Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
Enabling MT for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
9/12/2003LTI Student Research Symposium1 An Integrated Phrase Segmentation/Alignment Algorithm for Statistical Machine Translation Joy Advisor: Stephan.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Direct Translation Approaches: Statistical Machine Translation
Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.
July 24, 2007GALE Update: Alon Lavie1 Statistical Transfer and MEMT Activities Multi-Engine Machine Translation –MEMT service within the cross-GALE IOD.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Statistical XFER: Hybrid Statistical Rule-based Machine Translation Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.
Hindi SLE Debriefing AVENUE Transfer System July 3, 2003.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin Merrill (Shyamsundar Jayaraman,
Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
CMU Statistical-XFER System Hybrid “rule-based”/statistical system Scaled up version of our XFER approach developed for low-resource languages Large-coverage.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
AVENUE: Machine Translation for Resource-Poor Languages NSF ITR
FROM BITS TO BOTS: Women Everywhere, Leading the Way Lenore Blum, Anastassia Ailamaki, Manuela Veloso, Sonya Allin, Bernardine Dias, Ariadna Font Llitjós.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
DARPA TIDES MT Group Meeting Marina del Rey Jan 25, 2002 Alon Lavie, Stephan Vogel, Alex Waibel (CMU) Ulrich Germann, Kevin Knight, Daniel Marcu (ISI)
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Eliciting a corpus of word-aligned phrases for MT
Approaches to Machine Translation
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Multilingual Information Access in a Digital Library
Vamshi Ambati 14 Sept 2007 Student Research Symposium
Approaches to Machine Translation
Statistical Machine Translation Papers from COLING 2004
AMTEXT: Extraction-based MT for Arabic
Presentation transcript:

Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language Technologies Institute Carnegie Mellon University Joint work with: Katharina Probst, Erik Peterson, Joy Zhang, Fei Huang, Alicia Tribble, Ariadna Font-Llitjos, Rachel Reynolds, Richard Cohen

August 5, 2003TIDES PI Meeting/ SLE2 Main Hindi SLE Efforts Data Collection –Elicited Data Collection –Data from contacts in India –Web Crawling Language Processing Utilities –Morphology –Encoding identification and conversion MT system development –XFER system –SMT system –EBMT system

August 5, 2003TIDES PI Meeting/ SLE3 Elicited Data Collection Goal: Acquire high quality word aligned Hindi- English data to support XFER system development (grammar learning) Recruited team of ~20 bilingual speakers at CMU and in India Extracted a corpus of phrases (NPs and PPs) from Brown Corpus section of Penn TreeBank Controlled Elicitation Corpus (typologically diverse, limited vocabulary) also translated into Hindi Resulting in total of word aligned translated phrases (~50KB words)

August 5, 2003TIDES PI Meeting/ SLE4 The CMU Elicitation Tool

August 5, 2003TIDES PI Meeting/ SLE5 Elicited Data Collection: High quality, word-aligned data Controlled elicitation corpus translated and aligned by Hindi speakers - Typologically diverse, vocabulary limited

August 5, 2003TIDES PI Meeting/ SLE6 Elicited Data Collection: High quality, word-aligned data Uncontrolled elicitation corpus: English phrases extracted from the Brown Corpus, translated by Hindi Speakers - Specific constituent types, large vocabulary

August 5, 2003TIDES PI Meeting/ SLE7 Elicited Data Collection: High quality, word-aligned data Variety of phrase complexities and phrase lengths

August 5, 2003TIDES PI Meeting/ SLE8 Elicited Data Collection Problems and issues: –English  Hindi direction allowed us to use the Penn TreeBank to extract accurate phrases –However, bilingual informants not well accustomed to type Hindi  some typos –Limits utility of the data, little effect on accuracy –Using the WSJ portion of the PennTB may have been a better fit for genre

August 5, 2003TIDES PI Meeting/ SLE9 Main CMU Contributions to SLE Shared Resources Elicited Data Corpus (~50KB) Indian Government Parallel Text ERDC.tgz (338 MB) CMU Phrase Lexicon Joyphrase.gz (3.5 MB) Cleaned IBM lexicon ibmlex-cleaned.txt.gz (1.5 MB) CMU Aligned Sentences CMU-aligned-sentences.tar.gz (1.3 MB) CMU Phrases and sentences CMU-phrases+sentences.zip (468 KB) Bilingual Named Entity List IndiaTodayLPNETranslists.tar.gz (54KB) Web Crawling: Most sites with possible parallel texts had Hindi in proprietary encodings Osho

August 5, 2003TIDES PI Meeting/ SLE10 Hindi Morphological Analyzer High quality and high coverage morphological analyzer from IIIT –Input: full inflected forms (RomanWX encoding) –Output: root form + collection of features Installing as a local server required some effort, e.g. UTF-8  RomanWX Used primarily in our XFER system

August 5, 2003TIDES PI Meeting/ SLE11 Other Hindi Processing Utilities Encoding identification and conversion tools –Built two automatic encoding identifiers, used for web data collection –Located and installed encoding converters from a variety of encodings –Most widely used was UTF-8 to RomanWX

August 5, 2003TIDES PI Meeting/ SLE12 XFER System for Hindi Three transfer strategies: –match against phrase-to-phrase entries (full-forms, no morphology) –morphologically analyze input words and match against lexicon matches feed into manual and learned transfer rules –match original word against lexicon - provides word- to-word translation as fall-back for input not otherwise covered Simple decoding: greedy left-to-right search that prefers longer input segments: NIST 5.35 “Strong” decoding with lattices+LM: NIST 5.47

August 5, 2003TIDES PI Meeting/ SLE13 Examples of Learned Rules {NP,14244} ;;Score: NP::NP [N] -> [DET N] ( (X1::Y2) ) {NP,14434} ;;Score: NP::NP [ADJ CONJ ADJ N] -> [ADJ CONJ ADJ N] ( (X1::Y1) (X2::Y2) (X3::Y3) (X4::Y4) ) {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP NP] ( (X2::Y1) (X1::Y2) )

August 5, 2003TIDES PI Meeting/ SLE14 SMT System for Hindi Resources –Trained on commonly available bilingual corpora –Used bilingual Hindi-English dictionary –Named Entities –70 million word English LM CMU SMT System –Tuned on ISI devtest data –Monotone decoding, as reordering did not result in improvement on this test set –Mixed casing based on Named Entities and simple rules NIST score: 6.74

August 5, 2003TIDES PI Meeting/ SLE15 EBMT System for Hindi Training data: same as SMT + a few hand- written equivalent class generalizations English LM built from APW portion of GigaWord Corpus (600M words) Encoding variation: raw training data in a variety of different encodings  all converted to UTF-8 (already supported by EBMT) Preprocessing of example phrases to improve word matching: –Match Hindi possessive with English ‘s NIST Score: 5.98

August 5, 2003TIDES PI Meeting/ SLE16 A Truly Limited Data Scenario for Hindi-to-English Put together a scenario with very miserly data resources: –Elicited Data corpus: phrases –Cleaned portion (top 12%) of LDC dictionary: ~2725 Hindi words (23612 translation pairs) –Manually acquired resources during the SLE: 500 manual bigram translations 72 manually written phrase transfer rules 105 manually written postposition rules 48 manually written time expression rules No additional parallel text!! Results presented tomorrow…

August 5, 2003TIDES PI Meeting/ SLE17 Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: [From TidesSLList Archive website] Vogel 6/2 –Hindi Language Resources: –General Information on Hindi Script: –Dictionaries at: –English to Hindu dictionary in different formats: –A small English to Urdu dictionary: –The Bible at: –The Emille Project: –[Hardcopy phrasebook references] –A Monthly Newsletter of Vigyan Prasar – –Morphological Analyser:

August 5, 2003TIDES PI Meeting/ SLE18 Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: (cont.) [From TidesSLList Archive website] Tribble , via Vogel 6/2 Possible parallel websites: – (English) – (Hindi) – – – (English) – (Hindi) – – Vogel 6/2 – – [Already listed] – – – – –The Gita Supersite –Press Information Bureau, Government of India English: Hindi:

August 5, 2003TIDES PI Meeting/ SLE19 Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: (cont.) [From TidesSLList Archive website] 6/20 Parallel Hindi/English webpages: –GAIL (Natural Gas Co.) UTF-8. [Found by CMU undergrad Web team] [Mike Maxwell, LDC, found it at the same time.] SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: [From TidesSLList Archive website:] Frederking 6/3 [announced], 6/4 [provided] –Ralf Brown's idenc encoding classifier Frederking 6/5 –PDF extractions from LanguageWeaver URLs: Frederking 6/5 –Richard Wang's Perl ident.pl encoding classifier and ISCII-UTF8.pl converter Frederking 6/11 –Erik Peterson here has put together a Perl wrapper for the IIIT Morphology package, so that the input can be UTF-8:

August 5, 2003TIDES PI Meeting/ SLE20 Other CMU Contributions to SLE Shared Resources SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: (cont.) [From TidesSLList Archive website:] Levin 6/13 –Directory of Elicited Word-Aligned English-Hindi Translated Phrases: Frederking 6/20 –Undecoded but believed to be parallel webpages: –PDF extractions from same: Frederking 6/24 –Several individual parallel webpages; sites may have more: mohfw.nic.in/kk/95/books1.htm mohfw.nic.in/oph.htm wwww.mp.nic.in