The CMU Arabic-to-English Statistical MT System Alicia Tribble, Stephan Vogel Language Technologies Institute Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Statistical Machine Translation SMT – Basic Ideas
English-Hindi Translation in 21 Days Ondřej Bojar, Pavel Straňák, Daniel Zeman ÚFAL MFF, Univerzita Karlova, Praha.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Stemming, tagging and chunking Text analysis short of parsing.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
9/12/2003LTI Student Research Symposium1 An Integrated Phrase Segmentation/Alignment Algorithm for Statistical Machine Translation Joy Advisor: Stephan.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC.
Stephan Vogel - Machine Translation1 Machine Translation Factored Models Stephan Vogel Spring Semester 2011.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Stephan Vogel - Machine Translation1 Machine Translation Word Alignment Stephan Vogel Spring Semester 2011.
Stephan Vogel - Machine Translation1 Statistical Machine Translation Word Alignment Stephan Vogel MT Class Spring Semester 2011.
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007 Ian Lane, Andreas Zollmann, Thuy Linh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Direct Translation Approaches: Statistical Machine Translation
Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
Language Knowledge Engineering Lab. Kyoto University NTCIR-10 PatentMT, Japan, Jun , 2013 Description of KYOTO EBMT System in PatentMT at NTCIR-10.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
1 Machine Translation MIRA and MBR Stephan Vogel Spring Semester 2011.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.
Stochastic Inversion Transduction Grammars Dekai Wu Advanced Machine Translation Seminar Presented by: Sanjika Hewavitharana 04/13/2006.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Stephan Vogel - Machine Translation1 Machine Translation Decoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
SMT – TIDES – and all that Stephan Vogel Language Technologies Institute Carnegie Mellon University Aus der Vogel-Perspektive A Bird’s View (human translation)
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
(Statistical) Approaches to Word Alignment
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
CMU Statistical-XFER System Hybrid “rule-based”/statistical system Scaled up version of our XFER approach developed for low-resource languages Large-coverage.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Named Entities in Domain Unlimited Speech Translation Alex Waibel, Stephan Vogel, Tanja Schultz Carnegie Mellon University Interactive Systems Labs.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Issues in Arabic MT Alex Fraser USC/ISI 9/22/2018 Issues in Arabic MT.
Presentation transcript:

The CMU Arabic-to-English Statistical MT System Alicia Tribble, Stephan Vogel Language Technologies Institute Carnegie Mellon University

The Data lFor translation model: lUN corpus: 80 million words UN lUmmah lSome smaller news corpora lFor LM lEnglish side from bilingual corpus: Language model should have seen the words generated by the translation model lAdditional data from Xinhua news lGeneral preprocessing and cleaning lSeparate punctuation mark lRemove sentence pairs with large length mismatch lRemove sentences which have too many non-words (numbers, special characters)

The System lAlignment models: IBM1 and HMM, trained in both directions lPhrase extraction lFrom Viterbi path of HMM alignment lIntegrated Segmentation and Alignment lDecoder lEssentially left to right over source sentence lBuild translation lattice with partial translations lFind best path, allowing for local reordering lSentence length model lPruning: remove low-scoring hypotheses

Some Results lTwo test sets: DevTest 203 sentences, May2003 lBaseline: monotone decoding lRO: word reordering lSL: sentence length model DevTest May 2003 NISTBleu4NIST Baseline RO RO + SL ?

Questions lWhat’s specific to Arabic lEncoding lNamed Entities lSyntax and Morphology lWhat’s needed to get further improvements

What’s Specific to Arabic lSpecific to Arabic lRight to left not really an issue, as this is only display Text in file is left to right lProblem in UN corpus: numbers (Latin characters) sometimes in the wrong direction, eg > 7991 lData not in vocalized form lVocalization not really studied lAmbiguity can be handled by statistical systems

Encoding and Vocalization lEncoding lDifferent encodings: Unicode, UTF-8, CP-1256, romanized forms not too bad, definitely not as bad as Hindi;-) lNeeded to convert, e.g. training and testing data in different encodings lNot all conversion are loss-less lUsed romanized form for processing lConverted all data using ‘Darwish’ transliteration lSeveral characters (ya, allef, hamzda) are collapsed into two classes lConversion not completely reversible lEffect of Normalization lReduction in vocabulary: ~5% lReduction of singletons: >10% lReduction of 3-gram perplexity: ~5%

Named Entities lNEs resulted in small but significant improvement in translation quality in the Chinese-English system lIn Chinese: unknown words are splitted into single characters which are then translated as individual words lIn Arabic no segmentation issues -> damage less severe lNEs not used so far for Arabic, but started to work on it

Language-Specific Issues for Arabic MT lSyntactic issues: Error analysis revealed two common syntactic errors lVerb-Noun reordering lSubject-Verb reordering lMorphology issues: Problems specific to AR morphology lBased on Darwish transliteration lBased on Buckwalter transliteration lPoor Man’s morphology

Syntax Issues: Adjective-Noun reordering lAdjectives and nouns are frequently reordered between Arabic and English lExample: EN: ‘big green chair’ AR: ‘chair green big’ lExperiment: identify noun-adjective sequences in AR and reorder them in preprocessing step lProblem: Often long sequences, e.g. N N Adj Adj N Adj N N lResult: no improvement

Syntax Issues: Subject-Noun reordering lAR: main verb at the beginning of the sentence followed by its subject lEN: order prefers to have the subject precede the verb lExample: EN: ‘the President visited Egypt’ AR: ‘Visited Egypt the President’ lExperiment: identify verbs at the beginning of the AR sentence and move them to a position following the first noun lNo full parsing lDone as preprocessing on the Arabic side lResult: no effect

Morphology Issues lStructural mismatch between English and Arabic lArabic has richer morphology lTypes Ar-En: ~2.2 : 1 lTokens Ar-En: ~ 0.9 : 1 Tried two different tools for morphological analysis: lBuckwalter analyzer lhttp:// competencies/content-analysis/arabic/info/buckwalter-about.htmlhttp:// competencies/content-analysis/arabic/info/buckwalter-about.html l1-1 Transliteration scheme for Arabic characters lDarwish analyzer lwww.cs.umd.edu/Library/TRs/CS-TR-4326/CS-TR-4326.pdfwww.cs.umd.edu/Library/TRs/CS-TR-4326/CS-TR-4326.pdf lSeveral characters (ya, alef, hamza) are collapsed into two classes with one character representative each

Morphology with Darwish Transliteration lAddressed the compositional part of AR morphology since this contributes to the structural mismatch between AR and EN lGoal was to get better word-level alignment lToolkit comes with a stemmer lCreated modified version for separating instead of removing affixes lExperiment 1: Trained on stemmed data lArabic types reduced by ~60%, nearly matching number of English types lBut loosing discriminative power lExperiment 2: Trained on affix-separated data lNumber of tokens increased lMismatch in tokens much larger lResult: Doing morphology monolingually can even increase structural mismatch

Morphology with Buckwalter Transliteration lFocused on DET and CONJ prefixes: lAR: ‘the’, ‘and’ frequently attached to nouns and adjectives lEN: always separate lDifferent spitting strategies: lLoosest: Use all prefixes and split even if remaining word is not a stem lMore conservative: Use only prefixes classified as DET or CONJ lMost conservative: Full analysis, split only can be analyzed as a DET or CONJ prefix plus legitimate stem lExperiments: train on each kind of split data lResult: All set-ups gave lower scores

Poor Man’s Morphology lList of pre- and suffixes compiled by native speaker lOnly for unknown words lRemove more and more pre- and suffixes lStop when stripped word is in trained lexicon lTypically: 1/2 to 2/3 of the unknown words can be mapped to known words lTranslation not always correct, therefore overall improvement limited lResult: this has so far been (for us) the only morphological processing which gave a small improvement

Experience with Morphology and Syntax lInitial experiments with full morphological analysis did not give an improvement lMost words are seen in large corpus lUnknown words: < 5% tokens, < 10% types lSimple prefix splitting reduced to half lPhrase translation captures some of the agreement information lLocal word reordering in the decoder reduces word order problems lWe still believe that morphology could give an additional improvement

Requirements for Improvements lData lMore specific data: We have large corpus (UN) but only small news corpora lManual dictionary could help, it helps for Chinese lBetter use of existing resources lLexicon not trained on all data lTreebanks not used lContinues improvement of models and decoder lRecent improvements in decoder (word reordering, overlapping phrases, sentence length model) helped for Arabic lExpect improvement from named entities lIntegrate morphology and alignment