Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language Technologies Institute Carnegie Mellon University Joint work with: Katharina Probst, Erik Peterson, Joy Zhang, Fei Huang, Alicia Tribble, Ariadna Font-Llitjos, Rachel Reynolds, Richard Cohen
August 5, 2003TIDES PI Meeting/ SLE2 Main Hindi SLE Efforts Data Collection –Elicited Data Collection –Data from contacts in India –Web Crawling Language Processing Utilities –Morphology –Encoding identification and conversion MT system development –XFER system –SMT system –EBMT system
August 5, 2003TIDES PI Meeting/ SLE3 Elicited Data Collection Goal: Acquire high quality word aligned Hindi- English data to support XFER system development (grammar learning) Recruited team of ~20 bilingual speakers at CMU and in India Extracted a corpus of phrases (NPs and PPs) from Brown Corpus section of Penn TreeBank Controlled Elicitation Corpus (typologically diverse, limited vocabulary) also translated into Hindi Resulting in total of word aligned translated phrases (~50KB words)
August 5, 2003TIDES PI Meeting/ SLE4 The CMU Elicitation Tool
August 5, 2003TIDES PI Meeting/ SLE5 Elicited Data Collection: High quality, word-aligned data Controlled elicitation corpus translated and aligned by Hindi speakers - Typologically diverse, vocabulary limited
August 5, 2003TIDES PI Meeting/ SLE6 Elicited Data Collection: High quality, word-aligned data Uncontrolled elicitation corpus: English phrases extracted from the Brown Corpus, translated by Hindi Speakers - Specific constituent types, large vocabulary
August 5, 2003TIDES PI Meeting/ SLE7 Elicited Data Collection: High quality, word-aligned data Variety of phrase complexities and phrase lengths
August 5, 2003TIDES PI Meeting/ SLE8 Elicited Data Collection Problems and issues: –English Hindi direction allowed us to use the Penn TreeBank to extract accurate phrases –However, bilingual informants not well accustomed to type Hindi some typos –Limits utility of the data, little effect on accuracy –Using the WSJ portion of the PennTB may have been a better fit for genre
August 5, 2003TIDES PI Meeting/ SLE9 Main CMU Contributions to SLE Shared Resources Elicited Data Corpus (~50KB) Indian Government Parallel Text ERDC.tgz (338 MB) CMU Phrase Lexicon Joyphrase.gz (3.5 MB) Cleaned IBM lexicon ibmlex-cleaned.txt.gz (1.5 MB) CMU Aligned Sentences CMU-aligned-sentences.tar.gz (1.3 MB) CMU Phrases and sentences CMU-phrases+sentences.zip (468 KB) Bilingual Named Entity List IndiaTodayLPNETranslists.tar.gz (54KB) Web Crawling: Most sites with possible parallel texts had Hindi in proprietary encodings Osho
August 5, 2003TIDES PI Meeting/ SLE10 Hindi Morphological Analyzer High quality and high coverage morphological analyzer from IIIT –Input: full inflected forms (RomanWX encoding) –Output: root form + collection of features Installing as a local server required some effort, e.g. UTF-8 RomanWX Used primarily in our XFER system
August 5, 2003TIDES PI Meeting/ SLE11 Other Hindi Processing Utilities Encoding identification and conversion tools –Built two automatic encoding identifiers, used for web data collection –Located and installed encoding converters from a variety of encodings –Most widely used was UTF-8 to RomanWX
August 5, 2003TIDES PI Meeting/ SLE12 XFER System for Hindi Three transfer strategies: –match against phrase-to-phrase entries (full-forms, no morphology) –morphologically analyze input words and match against lexicon matches feed into manual and learned transfer rules –match original word against lexicon - provides word- to-word translation as fall-back for input not otherwise covered Simple decoding: greedy left-to-right search that prefers longer input segments: NIST 5.35 “Strong” decoding with lattices+LM: NIST 5.47
August 5, 2003TIDES PI Meeting/ SLE13 Examples of Learned Rules {NP,14244} ;;Score: NP::NP [N] -> [DET N] ( (X1::Y2) ) {NP,14434} ;;Score: NP::NP [ADJ CONJ ADJ N] -> [ADJ CONJ ADJ N] ( (X1::Y1) (X2::Y2) (X3::Y3) (X4::Y4) ) {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP NP] ( (X2::Y1) (X1::Y2) )
August 5, 2003TIDES PI Meeting/ SLE14 SMT System for Hindi Resources –Trained on commonly available bilingual corpora –Used bilingual Hindi-English dictionary –Named Entities –70 million word English LM CMU SMT System –Tuned on ISI devtest data –Monotone decoding, as reordering did not result in improvement on this test set –Mixed casing based on Named Entities and simple rules NIST score: 6.74
August 5, 2003TIDES PI Meeting/ SLE15 EBMT System for Hindi Training data: same as SMT + a few hand- written equivalent class generalizations English LM built from APW portion of GigaWord Corpus (600M words) Encoding variation: raw training data in a variety of different encodings all converted to UTF-8 (already supported by EBMT) Preprocessing of example phrases to improve word matching: –Match Hindi possessive with English ‘s NIST Score: 5.98
August 5, 2003TIDES PI Meeting/ SLE16 A Truly Limited Data Scenario for Hindi-to-English Put together a scenario with very miserly data resources: –Elicited Data corpus: phrases –Cleaned portion (top 12%) of LDC dictionary: ~2725 Hindi words (23612 translation pairs) –Manually acquired resources during the SLE: 500 manual bigram translations 72 manually written phrase transfer rules 105 manually written postposition rules 48 manually written time expression rules No additional parallel text!! Results presented tomorrow…
August 5, 2003TIDES PI Meeting/ SLE17 Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: [From TidesSLList Archive website] Vogel 6/2 –Hindi Language Resources: –General Information on Hindi Script: –Dictionaries at: –English to Hindu dictionary in different formats: –A small English to Urdu dictionary: –The Bible at: –The Emille Project: –[Hardcopy phrasebook references] –A Monthly Newsletter of Vigyan Prasar – –Morphological Analyser:
August 5, 2003TIDES PI Meeting/ SLE18 Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: (cont.) [From TidesSLList Archive website] Tribble , via Vogel 6/2 Possible parallel websites: – (English) – (Hindi) – – – (English) – (Hindi) – – Vogel 6/2 – – [Already listed] – – – – –The Gita Supersite –Press Information Bureau, Government of India English: Hindi:
August 5, 2003TIDES PI Meeting/ SLE19 Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: (cont.) [From TidesSLList Archive website] 6/20 Parallel Hindi/English webpages: –GAIL (Natural Gas Co.) UTF-8. [Found by CMU undergrad Web team] [Mike Maxwell, LDC, found it at the same time.] SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: [From TidesSLList Archive website:] Frederking 6/3 [announced], 6/4 [provided] –Ralf Brown's idenc encoding classifier Frederking 6/5 –PDF extractions from LanguageWeaver URLs: Frederking 6/5 –Richard Wang's Perl ident.pl encoding classifier and ISCII-UTF8.pl converter Frederking 6/11 –Erik Peterson here has put together a Perl wrapper for the IIIT Morphology package, so that the input can be UTF-8:
August 5, 2003TIDES PI Meeting/ SLE20 Other CMU Contributions to SLE Shared Resources SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: (cont.) [From TidesSLList Archive website:] Levin 6/13 –Directory of Elicited Word-Aligned English-Hindi Translated Phrases: Frederking 6/20 –Undecoded but believed to be parallel webpages: –PDF extractions from same: Frederking 6/24 –Several individual parallel webpages; sites may have more: mohfw.nic.in/kk/95/books1.htm mohfw.nic.in/oph.htm wwww.mp.nic.in