Download presentation
Presentation is loading. Please wait.
Published byMarie-Claude Dumais Modified over 6 years ago
1
Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06
2
Outline General concepts Model 1-2 Model 3-4 Model 5
Source channel model Notations Word alignment Model 1-2 Model 3-4 Model 5
3
IBM Model Basics Classic paper: Brown et. al. (1993)
Translation: F E (or Fr Eng) Resource required: Parallel data (a set of “sentence” pairs) Main concepts: Source channel model Hidden word alignment EM training
4
Intuition Sentence pairs: word mapping is one-to-one. (1) S: a b c d e
T: l m n o p (2) S: c a e T: p n m (3) S: d a c T: n p l (b, o), (d, l), (e, m), and (a, p), (c, n), or (a, n), (c, p)
5
Source channel model Task: S T
Source channel (a.k.a. noisy channel, noisy source channel): use the Bayes Rule. Two types of parameters: P(T): language model P(S | T): its meaning varies.
6
Source channel model for MT
P(T) P(S | T) Tgt sent Noisy channel Src sent Two types of parameters: Language model: P(T) Translation model: P(S | T)
7
Source channel model for MT
P(E) P(F | E) Eng sent Fr sent Noisy channel Two types of parameters: Language model: P(E) Translation model: P(F | E)
8
Source channel for MT People think in English.
English thoughts can be characterized by a plausibility filter P(E). Sentences are “corrupted” into a different “language” by a translation model P(F | E). Our goal is to find the original, uncorrupted English sentence e. To achieve this goal, we efficiently evaluate P(E) * P(F | E) over many candidate Eng sentences.
9
Source channel vs. direct model
Source channel: demand plausible Eng and strong correlation between e and f. Direct model: demand strong correlation between e and f. Question: Formally, they are the same. In practice, they are not due to different approximations.
10
Word alignment a(j)=i aj = i a = (a1, …, am) Ex: F: f1 f2 f3 f4 f5
E: e1 e2 e3 e4 a4=3 a = (0, 1, 1, 3, 2)
11
The constraint on word alignment
The constraint: each fr word is generated by exactly one Eng word (including e0): l is Eng sent length, m is Fr sent length Without the constraint: 2lm. With the constraint: (l+1)m. Why the models use the constraint? We want to use P(fj | ei) to estimate P(F | E). How to handle the exceptional cases? Various methods: target word grouping, phrase-based SMT, etc.
12
Modeling p(F | E) with alignment
13
Notation E: the Eng sentence: E = e1 …el ei: the i-th Eng word.
F: the Fr sentence: f1 … fm fj: the j-th Fr word. e0: the Eng NULL word F0 : the Fr NULL word. aj: the position of Eng word that generates fj.
14
Word alignment An alignment, a, is a function from Fr word position to Eng word position: a(j)=i means that the fj is generated by ei. The constraint: each fr word is generated by exactly one Eng word (including e0):
15
Notation (cont) l: Eng sent leng m: Fr sent leng i: Eng word position
j: Fr word position e: an Eng word f: a Fr word
16
Outline General concepts Model 1-2 Model 3-4 Source channel model
Word alignment Notations Model 1-2 Model 3-4
17
Model 1 and 2
18
Model 1 and 2 Modeling Training Finding the best alignment Decoding
Generative process Decomposition Formula and types of parameters Training Finding the best alignment Decoding
19
Generative process To generate F from E: Another way to look at it:
Pick a length m for F, with prob P(m | l) Choose an alignment a, with prob P(a | E, m) Generate Fr sent given the Eng sent and the alignment, with prob P(F | E, a, m). Another way to look at it: Pick a length m for F, with prob P(m | l). For j=1 to m Pick an Eng word index aj, with prob P(aj | j, m, l). Pick a Fr word fj according to the Eng word ei, where aj=I, with prob P(fj | ei ).
20
Decomposition
21
Approximation Fr sent length depends only on Eng sent length:
Fr word depends only on the Eng word that generates it:
22
Approximation (cont) Estimating P(a | E, m):
Model 1: All alignments are equally likely: Model 2: alignments have different prob: Model 1 can be seen as a special case of Model 2, where
23
Decomposition for Model 1
24
The magic (for Model 1)
25
Final formula and parameters for Model 1
Two types of parameters: Length prob: P(m | l) Translation prob: P(fj | ei), or t(fj | ei),
26
Decomposition for Model 2
Same as Model 1 except that Model 2 does not assume all alignments are equally likely.
27
The magic for Model 2
28
Final formula and parameters for Model 2
Three types of parameters: Length prob: P(m | l) Translation prob: t(fj | ei) Distortion prob: d(i | j, m, l)
29
Summary of Modeling Model 1: Model 2: Parameters:
Length prob: P(m | l) Translation prob: t(fj | ei) Distortion prob (for Model 2): d(i | j, m, l)
30
Model 1 and 2 Modeling Training Finding the best alignment Decoding
Generative process Decomposition Formula and types of parameters Training Finding the best alignment Decoding
31
Training Mathematically motivated: The resulting formulae EM algorithm
Having an objective function to optimize Using several clever tricks The resulting formulae are intuitively expected can be calculated efficiently EM algorithm Hill climbing, and each iteration guarantees to improve objective function It does not guaranteed to reach global optimal.
32
Length prob: P(j | i) Let Ct (j, i) be the number of sentence pairs where the Fr leng is j, and Eng leng is i. Length prob: No need for iterations
33
Estimating t(f|e): a naïve approach
Count the times that f appears in F and e appears in E. Count the times that e appears in E Divide the 1st number by the 2nd number. Problem: It cannot distinguish true translations from pure coincidence. Ex: t(el | white) t(blanco | white) Solution: count the times that f aligns to e.
34
Estimating t(f|e) in Model 1
When each sent pair has a unique word alignment When each sent pair has several word alignments with prob When there are no word alignments
35
When there is a single word alignment
We can simply count. Training data: Eng: b c b Fr: x y y Prob: ct(x,b)=0, ct(y,b)=2, ct(x,c)=1, ct(y,c)=0 t(x|b)=0, t(y|b)=1.0, t(x|c)=1.0, t(y|c)=0
36
When there are several word alignments
If a sent pair has several word alignments, use fractional counts. Training data: P(a|E,F)= b c b c b c b c b x y x y x y x y y Prob: Ct(x,b)=0.7, Ct(y,b)=1.5, Ct(x,c)=0.3, Ct(y,c)=0.5 P(x|b)=7/22, P(y|b)=15/22, P(x|c)=3/8, P(y|c)=5/8
37
Fractional counts Let Ct(f, e) be the fractional count of (f, e) pair in the training data, given alignment prob P. Alignment prob Actual count of times e and f are linked in (E,F) by alignment a
38
When there are no word alignments
We could list all the alignments, and estimate P(a | E, F).
39
Formulae so far New estimate for t(f|e)
40
The algorithm Start with an initial estimate of t(f | e): e.g., uniform distribution Calculate P(a | F, E) Calculate Ct (f, e), Normalize to get t(f|e) Repeat Steps 2-3 until the “improvement” is too small.
41
So far, we estimate t(f | e) by enumerating all possible alignments
This process is very expensive, as the number of all possible alignments is (l+1)m. Prev iteration’s Estimate of Alignment prob Actual count of times e and f are linked in (E,F) by alignment a
42
No need to enumerate all word alignments
Luckily, for Model 1, there is a way to calculate Ct(f, e) efficiently.
43
The algorithm Start with an initial estimate of t(f | e): e.g., uniform distribution Calculate P(a | F, E) Calculate Ct (f, e), Normalize to get t(f|e) Repeat Steps 2-3 until the “improvement” is too small.
44
Estimating t(f | e) in Model 2
Ct(f, e) is slightly different from the one in Model 1
45
Estimating d(i | j, m,l) in Model 2
Let Ct(i, j, m, l) be the fractional count that Fr position j is linked to the Eng position i.
46
The algorithm Start with an initial estimate of t(f | e): e.g., uniform distribution Calculate P(a | F, E) Calculate Ct (f, e), Normalize to get t(f|e) Repeat Steps 2-3 until the “improvement” is too small.
47
Training Summary EM algorithm The resulting formulae
Hill climbing, and each iteration guarantees to improve objective function It does not guaranteed to reach global optimal. The resulting formulae are intuitively expected can be calculated efficiently
48
Model 1 and 2 Modeling Training Finding the best alignment
Generative process Decomposition Formula and types of parameters Training Finding the best alignment
49
The best alignment in Model 1-5
Given E and F, we are looking for the best alignment a*:
50
The best alignment in Model 1
51
The best alignment in Model 2
52
Summary of Model 1 and 2 Modeling:
Pick the length of F with prob P(m | l). For each position j Pick an English word position aj, with prob P(aj | j, m, l). Pick a Fr word fj according to the Eng word ei, with t(fj | ei), where i=aj The resulting formula can be calculated efficiently. Training: EM algorithm. The update can be done efficiently. Finding the best alignment: can be easily done.
53
Limitations of Model 1 and 2
There could be some relations among the Fr words generated by the same Eng word (w.r.t. positions and fertility). The relations are not captured by Model 1 and 2. They are captured by Model 3 and 4.
54
Outline General concepts Model 1-2 Model 3-4 Source channel model
Word alignment Notations Model 1-2 Model 3-4
55
Model 3 and 4
56
Model 3 and 4 Modeling Training Finding the best alignment Decoding
Generative process Decomposition and final formula Types of parameters Training Finding the best alignment Decoding
57
Generative process For each Eng word ei, choose a fertility
For each ei, generate Fr words Choose the position of each Fr word.
58
An example NULL the cheapest nonstop flights
59
An example NULL the cheapest nonstop flights Le Moins cher -empty-
Sans escale vols vols sans escale le moins cher
60
Decomposition
61
Approximations and types of parameters
Where N is the number of empty slots.
62
Approximations and types of parameters (cont)
63
Modeling summary For each Eng word ei, choose a fertility
which only depends on ei. For each ei, generate Fr words, which only depends on ei. Choose the position of each Fr word: Model 3: the position depends only on the position of the Eng word generating it. Model 4: the position depends on more.
64
Training Use EM, just like Model 1 and 2
Translation and distortion probabilities can be calculated efficiently, fertility probabilities cannot. No efficient algorithms to find the best alignment.
65
Model 3 and 4 Modeling Training Finding the best alignment Decoding
Generative process Decomposition and final formula Types of parameters Training Finding the best alignment Decoding
66
Model 1-4: modeling
67
Model 1-4: training Similarities: Differences: Same objective function
Same algorithm: EM algorithm Differences: Summation over all alignments can be done efficiently for Model 1-2, but not for Model 3-4. Best alignment can be found efficiently for Model 1-2, but not for Model 3-4.
68
Summary General concepts Model 1-2 Model 3-4
Source channel model: P(E) and P(F|E) Notations Word alignment: each Fr word comes from exactly one Eng word (including e0). Model 1-2 Model 3-4
69
Additional slides
70
An example of Model 1 training
Training data: Sent 1: Eng: “b c”, Fr: “x y” Sent 2: Eng: “b”, Fr: “y” To reduce the number of alignments, assume that each Eng word generates exactly one Fr word Two possible alignments for Sent1, and one for Sent2. Step 1: Initial t(f|e): t(x|b)=t(y|b)=1/2, t(x|c)=t(y|c)=1/2
71
Step 2: calculating P(a|F,E)
a1: b c a2: b c a3: b x y x y y Before normalization: P(a1|E1,F1)*Z=1/2*1/2=1/4 P(a2|E1,F1)*Z=1/2*1/2=1/4 P(a3|E2,F2)*Z=1/2 After normalization: P(a1|E1,F1)=1/4 / (1/4+1/4) = ½ P(a2|E1,F1)=1/4 / ½ = ½. P(a3|E2,F2) = ½ / ½ = 1
72
Step 3: calculating t(f | e)
a1: b c a2: b c a3: b x y x y y Collecting counts: Ct(x,b) =1/2 Ct(y,b) = ½ + 1 = 3/2 Ct(x,c)=1/2 Ct(y,c)=1/2 After normalization: t(x | b) = ½ / (1/2+3/2) = ¼, t(y | b) = 3/4 t(x | c) = ½ / 1 = ½, t(y | c)=1/2
73
Repeating step 2: calculating P(a|F,E)
a1: b c a2: b c a3: b x y x y y Before normalization: P(a1|E1,F1)*Z=1/4*1/2=1/8 P(a2|E1,F1)*Z=3/4*1/2=3/8 P(a3|E2,F2)*Z=3/4 After normalization: P(a1|E1,F1)=1/8 / (1/8+3/8) = 1/4 P(a2|E1,F2)=3/8 / 4/8 = 3/4. P(a3|E2,F2) = 3/4 / 3/4 = 1
74
Repeating step 3: calculating t(f | e)
a1: b c a2: b c a3: b x y x y y Collecting counts: Ct(x,b) =1/4 Ct(y,b) = 3/4+ 1 = 7/4 Ct(x,c)=3/4 Ct(y,c)=1/4 After normalization: t(x | b) = 1/4 / (1/4+7/4) = 1/8, t(y | b) = 7/8 t(x | c) = 3/4 / (3/4+1/4) = 3/4, t(y | c)=1/4
75
See the trend? t(x|b) t(y|b) t(x|c) t(y|c) a1 a2 init 1/2 - 1st iter
1/4 3/4 2nd iter 1/8 7/8
76
Calculating t(f | e) with the new formulae
E1: b c E2: b F1: x y F2: y Collecting counts: Ct(x,b) =1/2/(1/2+1/2) Ct(y,b) = ½ /(1/2+1/2) + 1/1 = 3/2 Ct(x,c)=1/2 / (1/2+1/2) = 1/2 Ct(y,c)=1/2 / (1/2+1/2) = 1/2 After normalization: t(x | b) = ½ / (1/2+3/2) = ¼, t(y | b) = 3/4 t(x | c) = ½ / 1 = ½, t(y | c)=1/2
77
EM algorithm EM: expectation maximization
In a model with hidden states (e.g., word alignment), how can we estimate model parameters? EM does the following: E-step: Take an initial model parameterization and calculate the expected values of the hidden data. M-step: Use the expected values to maximize the likelihood of the training data.
78
Objective function
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.