Download presentation
Presentation is loading. Please wait.
Published byAmber Hodges Modified over 9 years ago
1
Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011
2
Stephan Vogel - Machine Translation)2 Generative Alignment Models lGenerative word alignment models: P(f, a|e) = … lAlignment a as hidden variable lActual word alignment is not observed lSum over all alignments lWell-known IBM 1 … 5 models, HMM, ITG lModel lexical association, distortion, fertility lIt is difficult to incorporate additional information lPOS of words (used in distortion model, not as direct link features) lManual dictionary lSyntax information l…
3
Stephan Vogel - Machine Translation)3 Discriminative Word Alignment lModel alignment directly: p(a | f, e) lFind alignment that maximizes p(a | f, e) lWell-suited framework: maximum entropy lSet of feature functions h m (a, f, e), m = 1, …, M lSet of model parameters (feature weights) c m, m = 1, …, M lDecision rule:
4
Stephan Vogel - Machine Translation)4 Integrating Additional Dependencies lLog-linear model allows integration of additional dependencies, which contain additional information lPOS lParse trees l… lAdd additional variable to capture these dependencies lNew decision rule:
5
Stephan Vogel - Machine Translation)5 Tasks lModeling: design feature functions which capture cross- lingual divergences lSearch: find alignment with highest probability lTraining: find optimal feature weights lMinimize alignment errors given some gold-standard alignments (Notice: Alignments no longer hidden!) lSupervised training, i.e. we evaluate against gold standard lNotice: features functions may result from some training procedure themselves lE.g. use statistical dictionary resulting from IBMn alignment, trained on large corpus lHere now additional training step, on small (hand-aligned) corpus (Similar to MERT for decoder)
6
Stephan Vogel - Machine Translation)6 2005 – Year of DWA lYang Liu, Qun Liu, and Shouxun Lin. 2005. Loglinear Models for Word Alignment. lAbraham Ittycheriah and Salim Roukos. 2005. A Maximum Entropy Word Aligner for Arabic-English Machine Translation. lBen Taskar, Simon Lacoste-Julien, and Dan Klein. 2005. A Discriminative Matching Approach to Word Alignment. lRobert C. Moore. 2005. A Discriminative Framework for Bilingual Word Alignment. lNecip Fazil Ayan, Bonnie J. Dorr, and Christof Monz. 2005. NeurAlign: Combining Word Alignments Using Neural Networks.
7
Stephan Vogel - Machine Translation)7 Yang Liu et al. 2005 lStart out with features used in generative alignment lLexicons E.g. IBM1 lUse both directions: p(f j |e i ) and p(e i |f j ), =>Symmetrical alignment model lAnd/or symmetric model Fertility model: p( i |e i )
8
Stephan Vogel - Machine Translation)8 More Features lCross count: number of crossings in alignment lNeighbor count: count the number of links in the immediate neighborhood lExact match: count number of src/tgt pairs, where src=tgt lLinked word count: total number of links (to influence density) lLink types: count how many 1-1, 1-m, m-1, n-m alignments lSibling distance: if word is aligned to multiple words, add the distance between these aligned words lLink Co-occurrence count: given multiple alignments (e.g. Viterbi alignments from IBM models) count how often links co-occur
9
Stephan Vogel - Machine Translation)9 Search lGreedy search based on gain by adding a link lFor each of the features the gain can be calculated lE.g. IBM1 Algorithm: Start with empty alignment Loop until no addition gain Loop over all (j,i) not in set if gain(j,i) > best_gain then store as (j’,i’) Set link(j’,i’)
10
Stephan Vogel - Machine Translation)10 Moore 2005 lLog-Likelihood-based model lMeasure word association strength lValues can get large lConditional-Link-Probability-based lEstimated probability of two words being linked lUsed simpler alignment model to establish links lAdd simple smoothing lAdditional features: one-to-one, one-to-many, non-monotonicity
11
Stephan Vogel - Machine Translation)11 Training lFinding optimal alignment is non-trivial lAdding link can affect nonmonotonicity, one-to-many features lDynamic programming does not work lBeam search could be used lRequires pruning lParameter optimization lModified version of average perceptron learning
12
Stephan Vogel - Machine Translation)12 Modeling Alignment with CRF lCRF is an undirected graphical model lEach vertex (node) represents a random variable whose distribution is to be inferred lEach edge represents a dependency between two random variables lThe distribution of each discrete random variable Y in the graph is conditioned on an input sequence X. lCliques: set of nodes in graph fully connected lIn our case: lFeatures derived from source and target words are the input sequence X lAlignment links are the random variables Y lDifferent ways to model alignment lBlunsom & Cohn (2006): many-to-one word alignments, where each source word is aligned with zero or one target words (-> asymmetric) lNiehues & Vogel (2008): model not sequence, but entire alignment matrix (->symmetric)
13
Stephan Vogel - Machine Translation)13 Modeling Alignment Matrix lRandom variables y ji for all possible alignment links l2 values: 0/1 – word in position j is not linked/linked to word in position i lRepresented as nodes in a graph
14
Stephan Vogel - Machine Translation)14 Modeling Alignment Matrix lFactored nodes x representing features (observables) lLinked to random variables lDefine potential for each y ji
15
Stephan Vogel - Machine Translation)15 Probability of Alignment
16
Stephan Vogel - Machine Translation)16 Features lLocal features, e.g. lexical, POS, … lFertility features lFirst-order features: capturing relation between links lPhrase-features: interaction between word and phrase alignment
17
Stephan Vogel - Machine Translation)17 Local Features lLocal information about link probability lFeatures derived from positions j and i only lFactored node connected to only one random variable lFeatures lLexical probabilities, also normalized to (f,e) lWord identity (e.g. for numbers, names) lWord similarity (e.g. cognates) lRelative position distance lLink indicator feature: is (j,i) linked in Viterbi alignment from generative alignment lPOS: Indicator feature for every src/tgt POS pair lHigh frequency word indicator feature for every src/tgt word pair for most frequent words
18
Stephan Vogel - Machine Translation)18 Fertility Features lModel word fertility, src and tgt side lLink to all nodes in row/column lConstraint: model fertility only up to maximum fertility lIndicator features: one for each fertility n N lAlternative: use fertility probabilities from IBM4 training lNow different for different words
19
Stephan Vogel - Machine Translation)19 First Order Features lLinks depend on links of neighboring words lLink always 2 nodes lDifferent features for different directions l(1,1), (1,2), (2,1), (1,0), … lCaptures distortions, similar to HMM and IBM4 alignment lIndicator features, if both links are set lAlso POS 1-order feature: indicator feature link(j,i) and (POS j, POS i ) and link(j+k, i+l)
20
Stephan Vogel - Machine Translation)20 Inference – Finding the Best Alignment lWord alignment corresponds to assignment of random variables l=> Find most probable variable assignment lProblem: lComplex model structure: many loops lNo exact inference possible lSolution: lBelief propagation algorithm lInference by message passing lRuntime exponential in number of connected nodes
21
Stephan Vogel - Machine Translation)21 Belief Propagation lMessages are sent from random variable nodes to factored nodes, and also in the opposite direction lStart with some initial values, e.g. uniform lIn each iteration lCalculate messages from hidden node (j,i) and sent to factored node c lCalculate messages from factored node c and sent to hidden node (j,i)
22
Stephan Vogel - Machine Translation)22 Getting the Probability lAfter several iterations, belief value calculated from messages send to hidden nodes lBelief value can be interpreted as posterior probability
23
Stephan Vogel - Machine Translation)23 Training lMaximum log-likelihood of correct alignment lUse gradient descent to find optimum lTrain towards minimum alignment error lNeed smoothed version of AER lExpress AER in terms of link indicator functions lUse sigmoid of link probability lCan use 2-step approach l1. Optimize towards ML l2. Optimize towards AER
24
Stephan Vogel - Machine Translation)24 Some Results: Spanish-English lFeatures lIBM1 and IBM4 lexicons lFertilties lLink indicator feature lPOS features lPhrase features lImpact on translation quality (Bleu scores) DevEval Baseline40.0447.73 DWA41.6248.13
25
Stephan Vogel - Machine Translation)25 Summary lIn last 5 years new efforts in word alignment lDiscriminative word alignment lIntegrate many features lNeed small amount of hand aligned data to tune (train) feature weights lDifferent variants lLog-linear modeling lConditional random fields: sequence and alignment matrix lSignificant improvements in word alignment error rate lNot always improvements in translation quality lDifferent density of alignment -> different phrase table size lNeed to adjust phrase extraction algorithms?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.