Statistical Machine Translation

Slides:

Advertisements

Similar presentations

Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Machine Translation Course 9 Diana Trandab ă ț Academic year

Statistical Machine Translation IBM Model 1 CS626/CS460 Anoop Kunchukuttan Under the guidance of Prof. Pushpak Bhattacharyya.

Word Alignment Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some slides.

DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering.

Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.

A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.

Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.

C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??

Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.

Microsoft Research Faculty Summit Robert Moore Principal Researcher Microsoft Research.

Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.

MACHINE TRANSLATION AND MT TOOLS: GIZA++ AND MOSES -Nirdesh Chauhan.

Natural Language Processing Expectation Maximization.

Translation Model Parameters (adapted from notes from Philipp Koehn & Mary Hearne) 24 th March 2011 Dr. Declan Groves, CNGL, DCU

Statistical Machine Translation Part VIII – Log-Linear Models Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,

An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:

Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.

Machine Translation Course 5 Diana Trandab ă ț Academic year:

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

(Statistical) Approaches to Word Alignment

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

A Statistical Approach to Machine Translation ( Brown et al CL ) POSTECH, NLP lab 김 지 협.

Jan 2009Statistical MT1 Advanced Techniques in NLP Machine Translation III Statistical MT.

NLP. Machine Translation Source-channel model of communication Parametric probabilistic models of language and translation.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

September 2004CSAW Extraction of Bilingual Information from Parallel Texts Mike Rosner.

Spring 2010 Lecture 4 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn and Hwee Tou Ng LING 575: Seminar on statistical machine translation.

Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.

Statistical Machine Translation Part II: Word Alignments and EM

CSE 517 Natural Language Processing Winter 2015

Statistical NLP Spring 2011

Statistical NLP: Lecture 13

Statistical Machine Translation Part III – Phrase-based SMT / Decoding

CSCI 5832 Natural Language Processing

CSCI 5832 Natural Language Processing

Expectation-Maximization Algorithm

Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.

Machine Translation and MT tools: Giza++ and Moses

Statistical Machine Translation Papers from COLING 2004

Word Alignment David Kauchak CS159 – Fall 2019 Philipp Koehn

Lecture 12: Machine Translation (II) November 4, 2004 Dan Jurafsky

Machine Translation and MT tools: Giza++ and Moses

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Pushpak Bhattacharyya CSE Dept., IIT Bombay 31st Jan, 2011

CS224N Section 2: PA2 & EM Shrey Gupta January 21,2011.

Presentation transcript:

Statistical Machine Translation: IBM Models and the Alignment Template System

Statistical Machine Translation Goal: Given foreign sentence f: “Maria no dio una bofetada a la bruja verde” Find the most likely English translation e: “Maria did not slap the green witch”

Statistical Machine Translation Most likely English translation e is given by: P(e|f) estimates conditional probability of any e given f

Statistical Machine Translation How to estimate P(e|f)? Noisy channel: Decompose P(e|f) into P(f|e) * P(e) / P(f) Estimate P(f|e) and P(e) separately using parallel corpus Direct: Estimate P(e|f) directly using parallel corpus (more on this later)

Noisy Channel Model Translation Model Language Model Decoder P(f|e) How likely is f to be a translation of e? Estimate parameters from bilingual corpus Language Model P(e) How likely is e to be an English sentence? Estimate parameters from monolingual corpus Decoder Given f, what is the best translation e?

Noisy Channel Model Generative story: Translation task: Generate e with probability p(e) Pass e through noisy channel Out comes f with probability p(f|e) Translation task: Given f, deduce most likely e that produced f, or:

Translation Model How to model P(f|e)? Learn parameters of P(f|e) from a bilingual corpus S of sentence pairs <ei,fi> : < e1,f1 > = <the blue witch, la bruja azul> < e2,f2 > = <green, verde> … < eS,fS > = <the witch, la bruja>

Translation Model Insufficient data in parallel corpus to estimate P(f|e) at the sentence level (Why?) Decompose process of translating e -> f into small steps whose probabilities can be estimated

Translation Model English sentence e = e1…el Foreign sentence f = f1…fm Alignment A = {a1…am}, where aj ε {0…l} A indicates which English word generates each foreign word

Alignments e: “the blue witch” f: “la bruja azul” A = {1,3,2} (intuitively “good” alignment)

Alignments e: “the blue witch” f: “la bruja azul” A = {1,1,1} (intuitively “bad” alignment)

Alignments e: “the blue witch” f: “la bruja azul” (illegal alignment!)

Alignments Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m?

Alignments Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m? Answer: Each foreign word can align with any one of |e| = l words, or it can remain unaligned Each foreign word has (l + 1) choices for an alignment, and there are |f| = m foreign words So, there are (l+1)^m alignments for a given e and f

Alignments Question: If all alignments are equally likely, what is the probability of any one alignment, given e?

Alignments Question: If all alignments are equally likely, what is the probability of any one alignment, given e? Answer: P(A|e) = p(|f| = m) * 1/(l+1)^m If we assume that p(|f| = m) is uniform over all possible values of |f|, then we can let p(|f| = m) = C P(A|e) = C /(l+1)^m

Generative Story ? e: “blue witch” f: “bruja azul” How do we get from e to f?

IBM Model 1 Model parameters: T(fj | eaj ) = translation probability of foreign word given English word that generated it

IBM Model 1 Generative story: Given e: Pick m = |f|, where all lengths m are equally probable Pick A with probability P(A|e) = 1/(l+1)^m, since all alignments are equally likely given l and m Pick f1…fm with probability where T(fj | eaj ) is the translation probability of fj given the English word it is aligned to

IBM Model 1 Example e: “blue witch”

IBM Model 1 Example e: “blue witch” f: “f1 f2” Pick m = |f| = 2

IBM Model 1 Example e: blue witch” f: “f1 f2” Pick A = {2,1} with probability 1/(l+1)^m

IBM Model 1 Example e: blue witch” f: “bruja f2” Pick f1 = “bruja” with probability t(bruja|witch)

IBM Model 1 Example e: blue witch” f: “bruja azul” Pick f2 = “azul” with probability t(azul|blue)

IBM Model 1: Parameter Estimation How does this generative story help us to estimate P(f|e) from the data? Since the model for P(f|e) contains the parameter T(fj | eaj ), we first need to estimate T(fj | eaj )

lBM Model 1: Parameter Estimation How to estimate T(fj | eaj ) from the data? If we had the data and the alignments A, along with P(A|f,e), then we could estimate T(fj | eaj ) using expected counts as follows:

lBM Model 1: Parameter Estimation How to estimate P(A|f,e)? P(A|f,e) = P(A,f|e) / P(f|e) But So we need to compute P(A,f|e)… This is given by the Model 1 generative story:

IBM Model 1 Example e: “the blue witch” f: “la bruja azul” P(A|f,e) = P(f,A|e)/ P(f|e) =

IBM Model 1: Parameter Estimation So, in order to estimate P(f|e), we first need to estimate the model parameter T(fj | eaj ) In order to compute T(fj | eaj ) , we need to estimate P(A|f,e) And in order to compute P(A|f,e), we need to estimate T(fj | eaj )…

IBM Model 1: Parameter Estimation Training data is a set of pairs < ei, fi> Log likelihood of training data given model parameters is: To maximize log likelihood of training data given model parameters, use EM: hidden variable = alignments A model parameters = translation probabilities T

EM Initialize model parameters T(f|e) Calculate alignment probabilities P(A|f,e) under current values of T(f|e) Calculate expected counts from alignment probabilities Re-estimate T(f|e) from these expected counts Repeat until log likelihood of training data converges to a maximum

IBM Model 2 Model parameters: T(fj | eaj ) = translation probability of foreign word fj given English word eaj that generated it d(i|j,l,m) = distortion probability, or probability that fj is aligned to ei , given l and m

IBM Model 3 Model parameters: T(fj | eaj ) = translation probability of foreign word fj given English word eaj that generated it r(j|i,l,m) = reverse distortion probability, or probability of position fj, given its alignment to ei, l, and m n(ei) = fertility of word ei , or number of foreign words aligned to ei p1 = probability of generating a foreign word by alignment with the NULL English word

IBM Model 3 Generative Story: Choose fertilities for each English word Insert spurious words according to probability of being aligned to the NULL English word Translate English words -> foreign words Reorder words according to reverse distortion probabilities

IBM Model 3 Example Consider the following example from [Knight 1999]: Maria did not slap the green witch

IBM Model 3 Example Maria did not slap the green witch Maria not slap slap slap the green witch Choose fertilities: phi(Maria) = 1

IBM Model 3 Example Maria did not slap the green witch Maria not slap slap slap the green witch Maria not slap slap slap NULL the green witch Insert spurious words: p(NULL)

IBM Model 3 Example Maria did not slap the green witch Maria not slap slap slap the green witch Maria not slap slap slap NULL the green witch Maria no dio una bofetada a la verde bruja Translate words: t(verde|green)

IBM Model 3 Example Maria no dio una bofetada a la verde bruja Maria no dio una bofetada a la bruja verde Reorder words

IBM Model 3 For models 1 and 2: For models 3 and 4: We can compute exact EM updates For models 3 and 4: Exact EM updates cannot be efficiently computed Use best alignments from previous iterations to initialize each successive model Explore only the subspace of potential alignments that lies within same neighborhood as the initial alignments

IBM Model 4 Model parameters: Same as model 3, except uses more complicated model of reordering (for details, see Brown et al. 1993)

Language Model Given an English sentence e1, e2 …el : N-gram model: P(e1, e2 …el ) = P(e1) * P(e2|e1 ) * … * P(el| e1, e2 …el-1 ) N-gram model: Assume P(ei) depends only on the N-1 previous words, so that P(ei |e1,e2, …ei-1) = P(ei |ei-N,ei-N+1, …ei-1)

N=2: Bigram Language Model P(Maria did not slap the green witch) = P(Maria|START) * P(did|Maria) * P(not|did) * … P(END|witch)

Word-Based MT Word = fundamental unit of translation Weaknesses: no explicit modeling of word context word-by-word translation may not accurately convey meaning of phrase: “il ne va pas” -> “he does not go” IBM models prevent alignment of foreign words with >1 English word: “aller” -> “to go”

Phrase-Based MT Phrase = basic unit of translation Strengths: explicit modeling of word context captures local reorderings, local dependencies

Example Rules: English: he does not go Foreign: il ne va pas ne va pas -> does not go

Alignment Template System [Och and Ney, 2004] Alignment template: Pair of source and target language phrases Word alignment among words within those phrases Formally, an alignment template is a triple (F,E,A): F = words on foreign side E = words on English side A = alignments among words on the foreign and English sides

Estimating P(e|f) Noisy channel: Direct: Decompose P(e|f) into P(f|e) and P(e) Estimate P(f|e) and P(e) separately Direct: Estimate P(e|f) directly from training corpus Use log-linear model

Log-linear Models for MT Compute best translation as follows: where hi are the feature functions and λi are the model parameters Typical feature functions include: phrase translation probabilities lexical translation probabilities language model probability reordering model word penalty [Koehn 2003]

Log-linear Models for MT Noisy Channel model is a special case of Log-Linear model where: h1 = log(P(f|e)), λ1 = 1 h2 = log(P(e)), λ2 = 1 Then: [Och and Ney 2003]

Alignment Template System Word-align training corpus Extract phrase pairs Assign probabilities to phrase pairs Train language model Decode

Word-Align Training Corpus: Run GIZA++ word alignment in normal direction, from e -> f il ne va pas he does not go

Word-Align Training Corpus: Run GIZA++ word alignment in inverse direction, from f->e il ne va pas he does not go

Alignment Symmetrization: Merge bi-directional alignments using some heuristic between intersection and union Question: what is tradeoff in precision/recall using intersection/union? Here, we use union il ne va pas he does not go

Alignment Template System Word-align training corpus Extract phrase pairs Assign probabilities to phrase pairs Train language model Decode

Extract phrase pairs: il ne va pas he does not go Extract all phrase pairs (E,F) consistent with word alignments, where consistency is defined as follows: (1) Each word in English phrase is aligned only with words in the foreign phrase (2) Each word in foreign phrase is aligned only with words in the English phrase Phrase pairs must consist of contiguous words in each language il ne va pas he does not go

Extract phrase pairs: Question: why is the illustrated phrase pair inconsistent with the alignment matrix? il ne va pas he does not go

Extract phrase pairs: il ne va pas he does not go Question: why is the illustrated phrase pair inconsistent with the alignment matrix? Answer: “ne” is aligned with “not”, which is outside the phrase pair; also, “does” is aligned with “pas”, which is outside the phrase pair il ne va pas he does not go

Extract phrase pairs: <he, il> il ne va pas he does not go

Extract phrase pairs: <he, il> <go, va> il ne va pas he does not go

Extract phrase pairs: <he, il> <go, va> <does not go, ne va pas> il ne va pas he does not go

Extract phrase pairs: <he, il> <go, va> <does not go, ne va pas> <he does not go, il ne va pas> il ne va pas he does not go

Alignment Template System Word-align training corpus Extract phrase pairs Assign probabilities to phrase pairs Train language model Decode

Probability Assignment Use relative frequency estimation: P(F,E,A|F) = Count(F,E,A)/Count(F,E’,A’)

Alignment Template System Word-align training corpus Extract phrase pairs Assign probabilities to phrase pairs Train language model Decode

Language Model Use N-gram language model P(e), just as for word-based MT

Alignment Template System Word-align training corpus Extract phrase pairs Assign probabilities to phrase pairs Train language model Decode

Decode Beam search State space: Start state: Expansion operation: set of possible partial translation hypotheses Start state: initial empty translation of foreign input Expansion operation: extend existing English hypothesis one phrase at a time, by translating a phrase in foreign sentence into English

Decoder Example Start: Expand English translation: f: “Maria no dio una bofetada a la bruja verde” e: “” Expand English translation: translate “Maria” -> “Mary” or “bruja” -> “witch” mark foreign words as covered update probabilities

Decoder Example Example from [Koehn 2003]

BLEU MT Evaluation Metric BLEU measure n-gram precision against a set of k reference English translations: What percentage of n-grams (where n ranges from 1 through 5, typically) in the MT English output are also found in a reference translation? Brevity penalty: penalize English translations with fewer words than the reference translations Why is this metric so widely used? Correlates surprisingly well with human judgment of machine-generated translations

References Brown et al. 1990. “A statistical approach to Machine Translation”. Brown et al. 1993. “The mathematics of statistical machine translation”. Collins 2003. “Lecture Notes from 6.891 Fall 2003: Machine Learning Approaches for Natural Language Processing”. Knight 1999. “A Statistical MT Workbook”. Knight and Koehn 2004. “A Statistical Machine Translation Tutorial”. Koehn, Och and Marcu 2003. “A Phrase-Based Statistical Machine Translation System”. Koehn, 2003. “Pharaoh: A Phrase-Based Decoder”. Och and Ney 2004. “The Alignment Template System”. Och and Ney 2003. “Discriminative Training and Maximum Entropy Models for Statistical Machine Translation”.