Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Clustering Beyond K-means
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Statistical Machine Translation IBM Model 1 CS626/CS460 Anoop Kunchukuttan Under the guidance of Prof. Pushpak Bhattacharyya.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering.
Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.
1 An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date:
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
. Learning Bayesian networks Slides by Nir Friedman.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Expectation Maximization Algorithm
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Expectation-Maximization
Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
MACHINE TRANSLATION AND MT TOOLS: GIZA++ AND MOSES -Nirdesh Chauhan.
Natural Language Processing Expectation Maximization.
Final review LING572 Fei Xia Week 10: 03/11/
Graphical models for part of speech tagging
12/08/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Translation: Alignment and Parameter Estimation.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Introduction to MT CSE 415 Fei Xia Linguistics Dept 02/24/06.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
A Statistical Approach to Machine Translation ( Brown et al CL ) POSTECH, NLP lab 김 지 협.
NLP. Machine Translation Source-channel model of communication Parametric probabilistic models of language and translation.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Machine Translation Course 4 Diana Trandab ă ț Academic year:
September 2004CSAW Extraction of Bilingual Information from Parallel Texts Mike Rosner.
Computational Linguistics Seminar LING-696G Week 6.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Statistical Machine Translation Part II: Word Alignments and EM
CSCI 5832 Natural Language Processing
Statistical Machine Translation
Bayesian Models in Machine Learning
Introduction to EM algorithm
CSCI 5832 Natural Language Processing
Expectation-Maximization Algorithm
Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.
Machine Translation and MT tools: Giza++ and Moses
Machine Translation(MT)
Word Alignment David Kauchak CS159 – Fall 2019 Philipp Koehn
Machine Translation and MT tools: Giza++ and Moses
CS224N Section 2: PA2 & EM Shrey Gupta January 21,2011.
Presentation transcript:

Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05

Outline General concepts –Source channel model –Notations –Word alignment Model 1-2 Model 3-4 Model 5

IBM Model Basics Classic paper: Brown et. al. (1993) Translation: F  E (or Fr  Eng) Resource required: –Parallel data (a set of “sentence” pairs) Main concepts: –Source channel model –Hidden word alignment –EM training

Intuition Sentence pairs: word mapping is one-to-one. –(1) S: a b c d e T: l m n o p –(2) S: c a e T: p n m –(3) S: d a c T: n p l  (b, o), (d, l), (e, m), and (a, p), (c, n), or (a, n), (c, p)

Source channel model Task: S  T Source channel (a.k.a. noisy channel, noisy source channel): use the Bayes Rule. Two types of parameters: –P(T): language model –P(S | T): its meaning varies.

Source channel model for ASR Text Noisy channel Speech P(T)P(S | T) Two types of parameters: Language model: P(T) Acoustic model: P(S | T)

Source Channel for ASR People think in text. A sentence can be characterized by a plausibility filter P(T). Sentences are “corrupted” into speech by an acoustic model P(S | T). Our goal is to find the original text. To achieve this goal, we efficiently evaluate P(T) * P(S | T) over many candidate sentences.

Source channel model for MT Tgt sent Noisy channel Src sent P(T)P(S | T) Two types of parameters: Language model: P(T) Translation model: P(S | T)

Source channel model for MT Eng sent Noisy channel Fr sent P(E)P(F | E) Two types of parameters: Language model: P(E) Translation model: P(F | E)

Source channel for MT People think in English. English thoughts can be characterized by a plausibility filter P(E). Sentences are “corrupted” into a different “language” by a translation model P(F | E). Our goal is to find the original, uncorrupted English sentence e. To achieve this goal, we efficiently evaluate P(E) * P(F | E) over many candidate Eng sentences.

Source channel vs. direct model Source channel: demand plausible Eng and strong correlation between e and f. Direct model: demand strong correlation between e and f. Question: Formally, they are the same. In practice, they are not due to different approximations.

a(j)=i  a j = i a = (a 1, …, a m ) Ex: –F: f 1 f 2 f 3 f 4 f 5 –E: e 1 e 2 e 3 e 4 –a 4 =3 –a = (0, 1, 1, 3, 2) Word alignment

The constraint on word alignment The constraint: each fr word is generated by exactly one Eng word (including e0): l is Eng sent length, m is Fr sent length –Without the constraint: 2 lm. –With the constraint: (l+1) m. Why the models use the constraint? –We want to use P(f j | e i ) to estimate P(F | E). How to handle the exceptional cases? –Various methods: target word grouping, phrase-based SMT, etc.

Modeling p(F | E) with alignment

Notation E: the Eng sentence: E = e 1 …e l e i : the i-th Eng word. F: the Fr sentence: f 1 … f m f j : the j-th Fr word. e 0 : the Eng NULL word F 0 : the Fr NULL word. a j : the position of Eng word that generates f j.

Word alignment An alignment, a, is a function from Fr word position to Eng word position: a(j)=i means that the f j is generated by e i. The constraint: each fr word is generated by exactly one Eng word (including e0):

Notation (cont) l: Eng sent leng m: Fr sent leng i: Eng word position j: Fr word position e: an Eng word f: a Fr word

Outline General concepts –Source channel model –Word alignment –Notations Model 1-2 Model 3-4

Model 1 and 2

Modeling –Generative process –Decomposition –Formula and types of parameters Training Finding the best alignment Decoding

Generative process To generate F from E: –Pick a length m for F, with prob P(m | l) –Choose an alignment a, with prob P(a | E, m) –Generate Fr sent given the Eng sent and the alignment, with prob P(F | E, a, m). Another way to look at it: –Pick a length m for F, with prob P(m | l). –For j=1 to m Pick an Eng word index a j, with prob P(a j | j, m, l). Pick a Fr word f j according to the Eng word e i, where a j =I, with prob P(f j | e i ).

Decomposition

Approximation Fr sent length depends only on Eng sent length: Fr word depends only on the Eng word that generates it:

Approximation (cont) Estimating P(a | E, m): –Model 1: All alignments are equally likely: –Model 2: alignments have different prob: –Model 1 can be seen as a special case of Model 2, where

Decomposition for Model 1

The magic (for Model 1)

Final formula and parameters for Model 1 Two types of parameters: Length prob: P(m | l) Translation prob: P(f j | e i ), or t(f j | e i ),

Decomposition for Model 2 Same as Model 1 except that Model 2 does not assume all alignments are equally likely.

The magic for Model 2

Final formula and parameters for Model 2 Three types of parameters: Length prob: P(m | l) Translation prob: t(f j | e i ) Distortion prob: d(i | j, m, l)

Summary of Modeling Parameters: Length prob: P(m | l) Translation prob: t(f j | e i ) Distortion prob (for Model 2): d(i | j, m, l) Model 1: Model 2:

Model 1 and 2

Training Mathematically motivated: –Having an objective function to optimize –Using several clever tricks The resulting formulae –are intuitively expected –can be calculated efficiently EM algorithm –Hill climbing, and each iteration guarantees to improve objective function –It does not guaranteed to reach global optimal.

Length prob: P(j | i) Let Ct (j, i) be the number of sentence pairs where the Fr leng is j, and Eng leng is i. Length prob: No need for iterations

Estimating t(f|e): a naïve approach A naïve approach: –Count the times that f appears in F and e appears in E. –Count the times that e appears in E –Divide the 1 st number by the 2 nd number. Problem: –It cannot distinguish true translations from pure coincidence. –Ex: t(el | white) t(blanco | white) Solution: count the times that f aligns to e.

Estimating t(f|e) in Model 1 When each sent pair has a unique word alignment When each sent pair has several word alignments with prob When there are no word alignments

When there is a single word alignment We can simply count. Training data: Eng: b c b Fr: x y y Prob: –ct(x,b)=0, ct(y,b)=2, ct(x,c)=1, ct(y,c)=0 –t(x|b)=0, t(y|b)=1.0, t(x|c)=1.0, t(y|c)=0

When there are several word alignments If a sent pair has several word alignments, use fractional counts. Training data: P(a|E,F)= b c b c b c b c b x y x y x y x y y Prob: –Ct(x,b)=0.7, Ct(y,b)=1.5, Ct(x,c)=0.3, Ct(y,c)=0.5 –P(x|b)=7/22, P(y|b)=15/22, P(x|c)=3/8, P(y|c)=5/8

Fractional counts Let Ct(f, e) be the fractional count of (f, e) pair in the training data, given alignment prob P. Alignment probActual count of times e and f are linked in (E,F) by alignment a

When there are no word alignments We could list all the alignments, and estimate P(a | E, F).

Formulae so far  New estimate for t(f|e)

The algorithm 1.Start with an initial estimate of t(f | e): e.g., uniform distribution 2.Calculate P(a | F, E) 3.Calculate Ct (f, e), Normalize to get t(f|e) 4.Repeat Steps 2-3 until the “improvement” is too small.

An example Training data: –Sent 1: Eng: “b c”, Fr: “x y” –Sent 2: Eng: “b”, Fr: “y” To reduce the number of alignments, assume that each Eng word generates exactly one Fr word  Two possible alignments for Sent1, and one for Sent2. Step 1: Initial t(f|e): t(x|b)=t(y|b)=1/2, t(x|c)=t(y|c)=1/2

Step 2: calculating P(a|F,E) a1: b c a2: b c a3: b x y x y y Before normalization: –P(a1|E1,F1)*Z=1/2*1/2=1/4 –P(a2|E1,F1)*Z=1/2*1/2=1/4 –P(a3|E2,F2)*Z=1/2 After normalization: –P(a1|E1,F1)=1/4 / (1/4+1/4) = ½ –P(a2|E1,F1)=1/4 / ½ = ½. –P(a3|E2,F2) = ½ / ½ = 1

Step 3: calculating t(f | e) a1: b c a2: b c a3: b x y x y y Collecting counts: –Ct(x,b) =1/2 –Ct(y,b) = ½ + 1 = 3/2 –Ct(x,c)=1/2 –Ct(y,c)=1/2 After normalization: –t(x | b) = ½ / (1/2+3/2) = ¼, t(y | b) = 3/4 –t(x | c) = ½ / 1 = ½, t(y | c)=1/2

Repeating step 2: calculating P(a|F,E) a1: b c a2: b c a3: b x y x y y Before normalization: –P(a1|E1,F1)*Z=1/4*1/2=1/8 –P(a2|E1,F1)*Z=3/4*1/2=3/8 –P(a3|E2,F2)*Z=3/4 After normalization: –P(a1|E1,F1)=1/8 / (1/8+3/8) = 1/4 –P(a2|E1,F2)=3/8 / 4/8 = 3/4. –P(a3|E2,F2) = 3/4 / 3/4 = 1

Repeating step 3: calculating t(f | e) a1: b c a2: b c a3: b x y x y y Collecting counts: –Ct(x,b) =1/4 –Ct(y,b) = 3/4+ 1 = 7/4 –Ct(x,c)=3/4 –Ct(y,c)=1/4 After normalization: –t(x | b) = 1/4 / (1/4+7/4) = 1/8, t(y | b) = 7/8 –t(x | c) = 3/4 / (3/4+1/4) = 3/4, t(y | c)=1/4

See the trend? t(x|b)t(y|b)t(x|c)t(y|c)a1a2 init1/ st iter 1/43/41/2 2 nd iter 1/87/83/41/4 3/4

So far, we estimate t(f | e) by enumerating all possible alignments This process is very expensive, as the number of all possible alignments is (l+1) m. Prev iteration’s Estimate of Alignment prob Actual count of times e and f are linked in (E,F) by alignment a

No need to enumerate all word alignments Luckily, for Model 1, there is a way to calculate Ct(f, e) efficiently.

The algorithm 1.Start with an initial estimate of t(f | e): e.g., uniform distribution 2.Calculate P(a | F, E) 3.Calculate Ct (f, e), Normalize to get t(f|e) 4.Repeat Steps 2-3 until the “improvement” is too small.

Calculating t(f | e) with the new formulae E1: b c E2: b F1: x y F2: y Collecting counts: –Ct(x,b) =1/2/(1/2+1/2) –Ct(y,b) = ½ /(1/2+1/2) + 1/1 = 3/2 –Ct(x,c)=1/2 / (1/2+1/2) = 1/2 –Ct(y,c)=1/2 / (1/2+1/2) = 1/2 After normalization: –t(x | b) = ½ / (1/2+3/2) = ¼, t(y | b) = 3/4 –t(x | c) = ½ / 1 = ½, t(y | c)=1/2

Estimating t(f | e) in Model 2 Ct(f, e) is slightly different from the one in Model 1

Estimating d(i | j, m,l) in Model 2 Let Ct(i, j, m, l) be the fractional count that Fr position j is linked to the Eng position i.

The algorithm 1.Start with an initial estimate of t(f | e): e.g., uniform distribution 2.Calculate P(a | F, E) 3.Calculate Ct (f, e), Normalize to get t(f|e) 4.Repeat Steps 2-3 until the “improvement” is too small.

EM algorithm EM: expectation maximization In a model with hidden states (e.g., word alignment), how can we estimate model parameters? EM does the following: –E-step: Take an initial model parameterization and calculate the expected values of the hidden data. –M-step: Use the expected values to maximize the likelihood of the training data.

Objective function

Training Summary Mathematically motivated: –Having an objective function to optimize –Using several clever tricks The resulting formulae –are intuitively expected –can be calculated efficiently EM algorithm –Hill climbing, and each iteration guarantees to improve objective function –It does not guaranteed to reach global optimal.

Model 1 and 2 Modeling –Generative process –Decomposition –Formula and types of parameters Training Finding the best alignment

The best alignment in Model 1-5 Given E and F, we are looking for the best alignment a*:

The best alignment in Model 1

The best alignment in Model 2

Summary of Model 1 and 2 Modeling: –Pick the length of F with prob P(m | l). –For each position j Pick an English word position a j, with prob P(a j | j, m, l). Pick a Fr word f j according to the Eng word e i, with t(f j | e i ), where i=a j –The resulting formula can be calculated efficiently. Training: EM algorithm. The update can be done efficiently. Finding the best alignment: can be easily done.

Limitations of Model 1 and 2 There could be some relations among the Fr words generated by the same Eng word (w.r.t. positions and fertility). The relations are not captured by Model 1 and 2. They are captured by Model 3 and 4.

Outline General concepts –Source channel model –Word alignment –Notations Model 1-2 Model 3-4

Model 3 and 4

Modeling –Generative process –Decomposition and final formula –Types of parameters Training Finding the best alignment Decoding

Generative process For each Eng word e i, choose a fertility For each e i, generate Fr words Choose the position of each Fr word.

An example NULL the cheapest nonstop flights

An example NULL the cheapest nonstop flights Le Moins cher -empty- Sans escale vols sansescalelemoinscher

Decomposition

Approximations and types of parameters Where N is the number of empty slots.

Approximations and types of parameters (cont)

Modeling summary For each Eng word e i, choose a fertility which only depends on e i. For each e i, generate Fr words, which only depends on e i. Choose the position of each Fr word: –Model 3: the position depends only on the position of the Eng word generating it. –Model 4: the position depends on more.

Training Use EM, just like Model 1 and 2 Translation and distortion probabilities can be calculated efficiently, fertility probabilities cannot. No efficient algorithms to find the best alignment.

Model 3 and 4 Modeling –Generative process –Decomposition and final formula –Types of parameters Training Finding the best alignment Decoding

Model 1-4: modeling

Model 1-4: training Similarities: –Same objective function –Same algorithm: EM algorithm Differences: –Summation over all alignments can be done efficiently for Model 1-2, but not for Model 3-4. –Best alignment can be found efficiently for Model 1-2, but not for Model 3-4.

Summary General concepts –Source channel model: P(E) and P(F|E) –Notations –Word alignment: each Fr word comes from exactly one Eng word (including e 0 ). Model 1-2 Model 3-4