Download presentation
Presentation is loading. Please wait.
Published byAshley Elliott Modified over 8 years ago
1
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language Technologies Institute Carnegie Mellon University Joint work with: Katharina Probst, Erik Peterson, Joy Zhang, Fei Huang, Alicia Tribble, Ariadna Font-Llitjos, Rachel Reynolds, Richard Cohen
2
August 5, 2003TIDES PI Meeting/ SLE2 Main Hindi SLE Efforts Data Collection –Elicited Data Collection –Data from contacts in India –Web Crawling Language Processing Utilities –Morphology –Encoding identification and conversion MT system development –XFER system –SMT system –EBMT system
3
August 5, 2003TIDES PI Meeting/ SLE3 Elicited Data Collection Goal: Acquire high quality word aligned Hindi- English data to support XFER system development (grammar learning) Recruited team of ~20 bilingual speakers at CMU and in India Extracted a corpus of phrases (NPs and PPs) from Brown Corpus section of Penn TreeBank Controlled Elicitation Corpus (typologically diverse, limited vocabulary) also translated into Hindi Resulting in total of 17589 word aligned translated phrases (~50KB words)
4
August 5, 2003TIDES PI Meeting/ SLE4 The CMU Elicitation Tool
5
August 5, 2003TIDES PI Meeting/ SLE5 Elicited Data Collection: High quality, word-aligned data Controlled elicitation corpus translated and aligned by Hindi speakers - Typologically diverse, vocabulary limited
6
August 5, 2003TIDES PI Meeting/ SLE6 Elicited Data Collection: High quality, word-aligned data Uncontrolled elicitation corpus: English phrases extracted from the Brown Corpus, translated by Hindi Speakers - Specific constituent types, large vocabulary
7
August 5, 2003TIDES PI Meeting/ SLE7 Elicited Data Collection: High quality, word-aligned data Variety of phrase complexities and phrase lengths
8
August 5, 2003TIDES PI Meeting/ SLE8 Elicited Data Collection Problems and issues: –English Hindi direction allowed us to use the Penn TreeBank to extract accurate phrases –However, bilingual informants not well accustomed to type Hindi some typos –Limits utility of the data, little effect on accuracy –Using the WSJ portion of the PennTB may have been a better fit for genre
9
August 5, 2003TIDES PI Meeting/ SLE9 Main CMU Contributions to SLE Shared Resources Elicited Data Corpus (~50KB) Indian Government Parallel Text ERDC.tgz (338 MB) CMU Phrase Lexicon Joyphrase.gz (3.5 MB) Cleaned IBM lexicon ibmlex-cleaned.txt.gz (1.5 MB) CMU Aligned Sentences CMU-aligned-sentences.tar.gz (1.3 MB) CMU Phrases and sentences CMU-phrases+sentences.zip (468 KB) Bilingual Named Entity List IndiaTodayLPNETranslists.tar.gz (54KB) Web Crawling: Most sites with possible parallel texts had Hindi in proprietary encodings Osho http://www.osho.com/Content.cfm?Language=Hindi
10
August 5, 2003TIDES PI Meeting/ SLE10 Hindi Morphological Analyzer http://www.iiit.net/ltrc/morph/index.htm High quality and high coverage morphological analyzer from IIIT –Input: full inflected forms (RomanWX encoding) –Output: root form + collection of features Installing as a local server required some effort, e.g. UTF-8 RomanWX Used primarily in our XFER system
11
August 5, 2003TIDES PI Meeting/ SLE11 Other Hindi Processing Utilities Encoding identification and conversion tools –Built two automatic encoding identifiers, used for web data collection –Located and installed encoding converters from a variety of encodings –Most widely used was UTF-8 to RomanWX
12
August 5, 2003TIDES PI Meeting/ SLE12 XFER System for Hindi Three transfer strategies: –match against phrase-to-phrase entries (full-forms, no morphology) –morphologically analyze input words and match against lexicon matches feed into manual and learned transfer rules –match original word against lexicon - provides word- to-word translation as fall-back for input not otherwise covered Simple decoding: greedy left-to-right search that prefers longer input segments: NIST 5.35 “Strong” decoding with lattices+LM: NIST 5.47
13
August 5, 2003TIDES PI Meeting/ SLE13 Examples of Learned Rules {NP,14244} ;;Score:0.0429 NP::NP [N] -> [DET N] ( (X1::Y2) ) {NP,14434} ;;Score:0.0040 NP::NP [ADJ CONJ ADJ N] -> [ADJ CONJ ADJ N] ( (X1::Y1) (X2::Y2) (X3::Y3) (X4::Y4) ) {PP,4894} ;;Score:0.0470 PP::PP [NP POSTP] -> [PREP NP] ( (X2::Y1) (X1::Y2) )
14
August 5, 2003TIDES PI Meeting/ SLE14 SMT System for Hindi Resources –Trained on commonly available bilingual corpora –Used bilingual Hindi-English dictionary –Named Entities –70 million word English LM CMU SMT System –Tuned on ISI devtest data –Monotone decoding, as reordering did not result in improvement on this test set –Mixed casing based on Named Entities and simple rules NIST score: 6.74
15
August 5, 2003TIDES PI Meeting/ SLE15 EBMT System for Hindi Training data: same as SMT + a few hand- written equivalent class generalizations English LM built from APW portion of GigaWord Corpus (600M words) Encoding variation: raw training data in a variety of different encodings all converted to UTF-8 (already supported by EBMT) Preprocessing of example phrases to improve word matching: –Match Hindi possessive with English ‘s NIST Score: 5.98
16
August 5, 2003TIDES PI Meeting/ SLE16 A Truly Limited Data Scenario for Hindi-to-English Put together a scenario with very miserly data resources: –Elicited Data corpus: 17589 phrases –Cleaned portion (top 12%) of LDC dictionary: ~2725 Hindi words (23612 translation pairs) –Manually acquired resources during the SLE: 500 manual bigram translations 72 manually written phrase transfer rules 105 manually written postposition rules 48 manually written time expression rules No additional parallel text!! Results presented tomorrow…
17
August 5, 2003TIDES PI Meeting/ SLE17 Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: [From TidesSLList Archive website] Vogel email 6/2 –Hindi Language Resources: http://www.cs.colostate.edu/~malaiya/hindilinks.html –General Information on Hindi Script: http://www.latrobe.edu.au/indiangallery/devanagari.htm –Dictionaries at: http://www.iiit.net/ltrc/Dictionaries/Dict_Frame.html –English to Hindu dictionary in different formats: http://sanskrit.gde.to/hindi/ –A small English to Urdu dictionary: http://www.cs.wisc.edu/~navin/india/urdu.dictionary –The Bible at: http://www.gospelcom.net/ibs/bibles/ –The Emille Project: http://www.emille.lancs.ac.uk/home.htm –[Hardcopy phrasebook references] –A Monthly Newsletter of Vigyan Prasar –http://www.vigyanprasar.com/dream/index.asp –Morphological Analyser: http://www.iiit.net/ltrc/morph/index.htm
18
August 5, 2003TIDES PI Meeting/ SLE18 Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: (cont.) [From TidesSLList Archive website] Tribble email, via Vogel 6/2 Possible parallel websites: –http://www.bbc.co.uk (English) –http://www.bbc.co.uk/urdu/ (Hindi) –http://sify.com/news_info/news/http://sify.com/news_info/news/ –http://sify.com/hindi/http://sify.com/hindi/ –http://in.rediff.com/index.html (English) –http://www.rediff.com/hindi/index.html (Hindi) –http://www.indiatoday.com/itoday/index.htmlhttp://www.indiatoday.com/itoday/index.html –http://www.indiatodayhindi.com Vogel email 6/2 –http://us.rediff.com/index.htmlhttp://us.rediff.com/index.html –http://www.rediff.com/hindi/index.html [Already listed]http://www.rediff.com/hindi/index.html –http://www.niharonline.com/http://www.niharonline.com/ –http://www.niharonline.com/hindi/index.htmlhttp://www.niharonline.com/hindi/index.html –http://www.boloji.com/hindi/index.htmlhttp://www.boloji.com/hindi/index.html –http://www.boloji.com/hindi/hindi/index.htmhttp://www.boloji.com/hindi/hindi/index.htm –The Gita Supersite http://www.gitasupersite.iitk.ac.in/http://www.gitasupersite.iitk.ac.in/ –Press Information Bureau, Government of India English: http://pib.nic.in/http://pib.nic.in/ Hindi: http://pib.nic.in/urdu/hindimain.html
19
August 5, 2003TIDES PI Meeting/ SLE19 Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: (cont.) [From TidesSLList Archive website] 6/20 Parallel Hindi/English webpages: –GAIL (Natural Gas Co.) http://gail.nic.in/ UTF-8. [Found by CMU undergrad Web team] [Mike Maxwell, LDC, found it at the same time.] SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: [From TidesSLList Archive website:] Frederking email 6/3 [announced], 6/4 [provided] –Ralf Brown's idenc encoding classifier Frederking email 6/5 –PDF extractions from LanguageWeaver URLs: http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/English/ http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/Hindi/ http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/English/ Frederking email 6/5 –Richard Wang's Perl ident.pl encoding classifier and ISCII-UTF8.pl converter Frederking email 6/11 –Erik Peterson here has put together a Perl wrapper for the IIIT Morphology package, so that the input can be UTF-8: http://progress.is.cs.cmu.edu/surprise/morph_wrapper.tar.gz
20
August 5, 2003TIDES PI Meeting/ SLE20 Other CMU Contributions to SLE Shared Resources SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: (cont.) [From TidesSLList Archive website:] Levin email 6/13 –Directory of Elicited Word-Aligned English-Hindi Translated Phrases: http://progress.is.cs.cmu.edu/surprise/Elicited-Data/ Frederking email 6/20 –Undecoded but believed to be parallel webpages: http://progress.is.cs.cmu.edu/surprise/merged_urls.txt –PDF extractions from same: http://progress.is.cs.cmu.edu/surprise/merged_urls/ Frederking email 6/24 –Several individual parallel webpages; sites may have more: www.commerce.nic.in/setup.htm www.commerce.nic.in/hindi/setup.html mohfw.nic.in/kk/95/books1.htm mohfw.nic.in/oph.htm wwww.mp.nic.in www.commerce.nic.in/hindi/setup.html
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.