Statistical Machine Translation Part IV - Assignments and Advanced Topics Alex Fraser Institute for Natural Language Processing University of Stuttgart.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

Joint Parsing and Alignment with Weakly Synchronized Grammars David Burkett, John Blitzer, & Dan Klein TexPoint fonts used in EMF. Read the TexPoint manual.

Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Word Sense Disambiguation for Machine Translation Han-Bin Chen

Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.

Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.

A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.

A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.

Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??

LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

Statistical Machine Translation Part IX – Better Word Alignment, Morphology and Syntax Alexander Fraser ICL, U. Heidelberg CIS, LMU München

Natural Language Processing Expectation Maximization.

Statistical Machine Translation Part VIII – Log-Linear Models Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alex Fraser Institute for Natural Language Processing University of Stuttgart

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

Statistical Machine Translation Part V - Advanced Topics Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Statistical Machine Translation Part V – Better Word Alignment, Morphology and Syntax Alexander Fraser CIS, LMU München Seminar: Open Source.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová

Statistical Machine Translation Part III – Phrase-based SMT Alexander Fraser CIS, LMU München WSD and MT.

Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.

Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Bayesian Subtree Alignment Model based on Dependency Trees Toshiaki Nakazawa Sadao Kurohashi Kyoto University 1 IJCNLP2011.

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Statistical Machine Translation Part III – Many-to-Many Alignments Alexander Fraser CIS, LMU München WSD and MT.

2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Getting the structure right for word alignment: LEAF Alexander Fraser and Daniel Marcu Presenter Qin Gao.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.

Machine Translation Statistical Machine Translation Part VI – Better Word Alignment, Morphology and Syntax Alexander Fraser CIS, LMU München.

Statistical Machine Translation Part II: Word Alignments and EM

Statistical Machine Translation Part IV – Log-Linear Models

Alexander Fraser CIS, LMU München Machine Translation

Statistical Machine Translation Part III – Phrase-based SMT / Decoding

Alex Fraser Institute for Natural Language Processing

Statistical Machine Translation Papers from COLING 2004

Improving IBM Word-Alignment Model 1(Robert C. MOORE)

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Presentation transcript:

Statistical Machine Translation Part IV - Assignments and Advanced Topics Alex Fraser Institute for Natural Language Processing University of Stuttgart EMA Summer School

Outline Assignment 1 – Model 1 and EM – Comments on implementation – Study questions Assignment 2 – Decoding with Moses Advanced topics

Slide from Koehn 2008

Assignment 1 The first problem is finding data structures for t(e|f) and count(e|f) – Hashes are a good choice for both of these – However, if you have really large matrices, you can be even more efficient: First collect the set of all e and f that cooccur in any sentence Then build a data structure which for each f has a pointer to a block of memory Each block of memory consists of (e,float) pairs, they are ordered by e When you need to look up the float for (e,f), first go to the f block, then do binary search for the right e – This solution is used by the GIZA++ aligner if compiled with the –DBINARY_SEARCH_FOR_TTABLE option (on by default) – Important: binary search is slower than a hash!

Next problem is how to determine the Viterbi alignment – This is the alignment of highest probability

Speed, if you use C++ on a fast workstation – de-en: 5 iterations in 3 to 4 seconds – fr-en: 5 iterations in 2 to 3 seconds Other questions on implementation?

Assignment 1 – Study questions Word alignments are usually calculated over lowercased data. Compare your alignments with mixed case versus lowercase. Do you seen an improvement? Where? – The alignment of the first word improves if it was rarely observed in the first position but frequently in other positions (lowercased) – Conflating the case of English proper nouns and common nouns (e.g., Bush vs. bush) does not usually hurt performance

How are non-compositional phrases aligned, do you seen any problems? – Non-compositional phrases like: „to play a role“ – These are virtually never right in Model 1 Unless they can be translated word-for-word into other language – Need features that rely on proximity Model 4 will get some of these (relative position distortion model)

Generate an alignment in the opposite direction (e.g. swap the English and French files (or English and German) and generate another alignment). Does one direction seem to work well to you? The direction that is shorter on average works well as the source language – German for German/English – English for French/English 1-to-N assumption is particularly important for compounds words – German > English > French

Look for the longest English token, and the longest French or German token. Are they aligned well? Why? longest en token: democratically-elected longest de token: selbstbeschränkungsvereinbarungen longest fr token: recherche-développement Frequency of the longest token will usually be one (Zipf‘s law) Words observed in only one sentence will often be aligned wrong – Particularly if they are compounds! – Exception: if all other words in sentence are frequent, pigeon- holing may save the alignment Not the case here

Assignment 1 - Advanced Questions Implement union and intersection of the two alignments you have generated. What are the differences between them? Consider the longest tokens again, is there an improvement? – Intersection results in extremely sparse alignments, but the links are right High precision – Union results in dense alignments, with many wrong alignment links High recall – The longest tokens are not improved; left unaligned in intersection

Are all cognates aligned correctly? How could we force them to be aligned correctly? An example of this is „Cunha“ in sentence 17 Other examples include numbers One way to do this Extract cognates from the parallel sentences – Add them as pseudo-sentences add „cunha“ on a line by itself to end of both sentence files – This dramatically improves chances of this being linked – In first iteration, this will contribute 0.5 count to „cunha“ -> „cunha“, and 0.5 count to NULL -> „cunha“ – After normalization it will have virtually no chance of being generated by NULL

The Porter stemmer is a simple widely available tool for reducing English morphology, e.g., mapping a plural variant and singular variant to the same token. Compare an alignment with porter stemming versus one without. – Porter stemmer maps „energy“ and „energies“ to „energi“ – There are more counts for the two combined (energies only occurs once) – Alignment improves

Assignment 2 – Building an SMT System Tokenize and lowercase data – Also filter out long sentences Build language model Run training script – This runs GIZA++ as a sub-process in both directions See large files in giza.en-fr and giza.fr-en which contain Model 4 alignment – Applies „grow-diag-final-and“ heuristic (see slide towards end of lecture 2) Clearly better than both union and intersection – Extracts unfiltered phrase table See model subdirectory

Steps continued Run MERT training – Starts by filtering phrase table for development set – Optimally set lambda using loop from lecture 3 – Ran 13 iterations to convergence for fr-en system – Look at last line of each *.log file Shows BLEU score of best point Before tuning: (first line of run1 log) Iteration 1: Iteration 13: Decode test set using optimal lambdas – Results in lowercased test set Post process test set – Recapitalize This uses Moses again as a translator from lowercased to mixedcase! – Detokenize

Final BLEU scores French to English: German to English: These numbers are directly comparable because English reference is the same German to English system is much lower quality than French to English system Why? Motivates rest of talk…

Outline Improved word alignments Morphology Syntax

Improved word alignments My dissertation was on word alignment Three main pieces of work – Measuring alignment quality (F-alpha) – A new generative model with many-to-many structure – A hybrid discriminative/generative training technique for word alignment

Improved word alignments My dissertation was on word alignment Three main pieces of work – Measuring alignment quality (F-alpha) – A new generative model with many-to-many structure – A hybrid discriminative/generative training technique for word alignment I will now tell you about several years in… …10 slides

Modeling the Right Structure 1-to-N assumption Multi-word “cepts” (words in one language translated as a unit) only allowed on target side. Source side limited to single word “cepts”. Phrase-based assumption “cepts” must be consecutive words

LEAF Generative Story Explicitly model three word types: – Head word: provide most of conditioning for translation Robust representation of multi-word cepts (for this task) This is to semantics as ``syntactic head word'' is to syntax – Non-head word: attached to a head word – Deleted source words and spurious target words (NULL aligned)

LEAF Generative Story Once source cepts are determined, exactly one target head word is generated from each source head word Subsequent generation steps are then conditioned on a single target and/or source head word See EMNLP 2007 paper for details

Discussion LEAF is a powerful model But, exact inference is intractable – We use hillclimbing search from an initial alignment Models correct structure: M-to-N discontiguous – First general purpose statistical word alignment model of this structure! – Head word assumption allows use of multi-word cepts Decisions robustly decompose over words – Not limited to only using 1-best prediction (unlike 1-to-N models combined with heuristics)

New knowledge sources for word alignment It is difficult to add new knowledge sources to generative models – Requires completely reengineering the generative story for each new source Existing unsupervised alignment techniques can not use manually annotated data

Decomposing LEAF Decompose each step of the LEAF generative story into a sub-model of a log-linear model – Add backed off forms of LEAF sub-models – Add heuristic sub-models (do not need to be related to generative story!) – Allows tuning of vector λ which has a scalar for each sub-model controlling its contribution How to train this log-linear model?

Semi-Supervised Training Define a semi-supervised algorithm which alternates increasing likelihood with decreasing error – Increasing likelihood is similar to EM – Discriminatively bias EM to converge to a local maxima of likelihood which corresponds to “better” alignments “Better” = higher F  -score on small gold standard corpus

Bootstrap M-Step E-Step D-Step Translation Initial sub-model parameters Viterbi alignments Sub-model parameters Viterbi alignments Tuned lambda vector The EMD Algorithm

Discussion Usual formulation of semi-supervised learning: “using unlabeled data to help supervised learning” – Build initial supervised system using labeled data, predict on unlabeled data, then iterate – But we do not have enough gold standard word alignments to estimate parameters directly! EMD allows us to train a small number of important parameters discriminatively, the rest using likelihood maximization, and allows interaction – Similar in spirit (but not details) to semi-supervised clustering

Contributions Found a metric for measuring alignment quality which correlates with MT quality Designed LEAF, the first generative model of M-to-N discontiguous alignments Developed a semi-supervised training algorithm, the EMD algorithm Obtained large gains of 1.2 BLEU and 2.8 BLEU points for French/English and Arabic/English tasks

Morphology Up until now, integration of morphology into SMT has been disappointing Inflection – The best ideas here are to strip redundant morphology – e.g. case markings that are not used in target language – Can also add pseudo-words One interesting paper looks at translating Czech to English Inflection which should be translated to a pronoun is simply replaced by a pseudo-word to match the pronoun in preprocessing Compounds – Split these using word frequencies of components – Akt-ion-plan vs. Aktion-plan Some new ideas coming – There is one high performance Arabic/English alignment and decoding system from IBM – But needed a lot of manual engineering specific to this language pair This would make a good dissertation topic…

Syntactic Models Slide from Koehn and Lopez 2008

Related work of interest Learning reordering rules automatically using word alignment Other hand-written rules for local phenomena – French/English adjective/noun inversion – Restructuring questions so that wh- word in right position

Slide from Koehn and Lopez 2008

Conclusion Lecture 1 covered background, parallel corpora, sentence alignment and introduced modeling Lecture 2 was on word alignment using both exact and approximate EM Lecture 3 was on phrase-based modeling and decoding Lecture 4 briefly touched on new research areas

Bibliography Please see web page for updated version! Measuring translation quality – Papineni et al 2001: defines BLEU metric – Callison-Burch et al 2007: compares automatic metrics Measuring alignment quality – Fraser and Marcu 2007: F-alpha Generative alignment models – Kevin Knight 1999: tutorial on basics, Model 1 and Model 3 – Brown et al 1993: IBM Models – Vogel et al 1996: HMM model (best model that can be trained using exact EM. See also several recent papers citing this paper) Discriminative word alignment models – Fraser and Marcu 2007: hybrid generative/discriminative model – Moore et al 2006: pure discriminative model

Phrase-based modeling – Och and Ney 2004: Alignment Templates (first phrase- based model) – Koehn, Och, Marcu 2003: Phrase-based SMT Phrase-based decoding – Koehn: manual of Pharaoh Syntactic modeling – Galley et al 2004: string-to-tree, generalizes Yamada and Knight – Chiang 2005: using formal grammars (without syntactic parses)

General text book – Philipp Koehn‘s SMT text book (from which some of my slides were derived) will be out soon Watch for shared tasks and participatewww.statmt.org – Only need to follow steps in assignment 2 on all of the data If you are in Stuttgart, participate in our reading group on Thursday mornings – See my web page

Thank you!