Stephan Vogel - Machine Translation1 Machine Translation Decoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

Information Retrieval in Practice

Review: Search problem formulation

Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training.

Traveling Salesperson Problem

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Informed Search Methods How can we improve searching strategy by using intelligence? Map example: Heuristic: Expand those nodes closest in “as the crow.

Stephan Vogel - Machine Translation1 Machine Translation Distortion Model Stephan Vogel Spring Semester 2011.

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

CSE 380 – Computer Game Programming Pathfinding AI

Planning under Uncertainty

DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering.

Review: Search problem formulation

1 Greedy Algorithms. 2 2 A short list of categories Algorithm types we will consider include: Simple recursive algorithms Backtracking algorithms Divide.

Novel Reordering Approaches in Phrase-Based Statistical Machine Translation S. Kanthak, D. Vilar, E. Matusov, R. Zens & H. Ney ACL Workshop on Building.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

Expectation Maximization Algorithm

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.

Stephan Vogel - Machine Translation1 Machine Translation Factored Models Stephan Vogel Spring Semester 2011.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Stephan Vogel - Machine Translation1 Machine Translation Word Alignment Stephan Vogel Spring Semester 2011.

Statistical Machine Translation Part VIII – Log-Linear Models Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,

Vilalta&Eick: Informed Search Informed Search and Exploration Search Strategies Heuristic Functions Local Search Algorithms Vilalta&Eick: Informed Search.

Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

WAES 3308 Numerical Methods for AI

Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart

The CMU Arabic-to-English Statistical MT System Alicia Tribble, Stephan Vogel Language Technologies Institute Carnegie Mellon University.

Theory of Algorithms: Divide and Conquer James Gain and Edwin Blake {jgain | Department of Computer Science University of Cape Town.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Representing and Using Graphs

1 Machine Translation MIRA and MBR Stephan Vogel Spring Semester 2011.

CSC 211 Data Structures Lecture 13

2005MEE Software Engineering Lecture 11 – Optimisation Techniques.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Review: Tree search Initialize the frontier using the starting state While the frontier is not empty – Choose a frontier node to expand according to search.

SMT – TIDES – and all that Stephan Vogel Language Technologies Institute Carnegie Mellon University Aus der Vogel-Perspektive A Bird’s View (human translation)

Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

(Statistical) Approaches to Word Alignment

Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Spring 2010 Lecture 4 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn and Hwee Tou Ng LING 575: Seminar on statistical machine translation.

Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.

CMPT 463. What will be covered A* search Local search Game tree Constraint satisfaction problems (CSP)

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Review: Tree search Initialize the frontier using the starting state

Statistical Machine Translation Part II: Word Alignments and EM

Graphs Representation, BFS, DFS

Statistical Machine Translation Part IV – Log-Linear Models

Artificial Intelligence Problem solving by searching CSC 361

Statistical Machine Translation Part III – Phrase-based SMT / Decoding

Parsing and More Parsing

Advanced Analysis of Algorithms

Dynamic Programming Search

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

CMSC 471 Fall 2011 Class #4 Tue 9/13/11 Uninformed Search

Statistical Machine Translation Part VI – Phrase-based Decoding

Presentation transcript:

Stephan Vogel - Machine Translation1 Machine Translation Decoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011

Stephan Vogel - Machine Translation2 Decoder lDecoding issues (Previous Session) lTwo step decoding lGeneration of translation lattice lBest path search lWith limited word reordering lSpecific Issues lRecombination of hypotheses lPruning lN-best list generation lFuture cost estimation

Stephan Vogel - Machine Translation3 Recombination of Hypotheses lRecombination: Of two hypotheses keep only the better one if no future information can switch their current ranking lNotice: this depends on the models lModel score depends on current partial translation and the extension, e.g. LM lModel score depends on global features known only at the sentence end, e.g. sentence length model lThe models define equivalence classes for the hypotheses lExpand only best hypothesis in each equivalence class

Stephan Vogel - Machine Translation4 Recombination of Hypotheses: Example ln-gram LM lHypotheses H1: I would like to go H2: I would not like to go Assume as possible expansions: to the movies | to the cinema | and watch a film lLMscore is identical for H1+Expansion as for H2+Expansion for bi, tri, four-gram LMs lE.g : 3-gram LMscore Expansion 1 is: -log p( to | to go ) – log p( the | go to ) – log p( movies | to the) lTherefore: Cost(H1) Cost(H1+E) < Cost(H2+E) for all possible expansions E

Stephan Vogel - Machine Translation5 Recombination of Hypotheses: Example 2 lSentence length model p( I | J ) lHypothesis H1: I would like to go H2: I would not like to go Assume as possible expansions: to the movies | to the cinema | and watch a film lLength( H1 ) = 5, Length( H2 ) = 6 lFor identical expansions the lengths will remain different lSituation at sentence end lPossible that -log P( len( H1 + E ) | J ) > -log P( len( H2 + E ) | J ) lThen possible that TotalCost( H1 + E ) > TotalCost( H2 + E ) lI.e. reranking of hypotheses lTherefore: can not recombine H2 into H1

Stephan Vogel - Machine Translation6 Recombination: Keep ‘em around lExpand only best hyp lStore pointers to recombined hyps for n-best list generation hbhb hbhb hrhr hrhr hrhr hrhr Better Increasing coverage

Stephan Vogel - Machine Translation7 Recombination: Keep ‘em around lExpand only best hyp lStore pointers to recombined hyps for n-best list generation hbhb hbhb hrhr hrhr hrhr hrhr Better Increasing coverage

Stephan Vogel - Machine Translation8 Recombination of Hypotheses lTypical features for recombination of partial hypotheses lLM history lPositions of covered source words – some translations are more expensive lNumber of generated words on target side – for sentence length model lOften only number of covered source words is considered, rather then actual positions lFits with typical organization of decoder: hyps are stored according to number of covered source words lHyps are recombined which are not strictly comparable lUse future cost estimate to lessen its impact lOverall: trade-off between speed and ‘correctness’ of search lIdeally: only compare (and recombine) hyps if all models used in the search see them as equivalent lRealistically: use fewer, coarser equivalence classes by ‘forgetting’ some of the models (they still add to the scores)

Stephan Vogel - Machine Translation9 Effect of Reordering Chinese-EnglishArabic-English RNISTBLEUNISTBLEU lR: reordering window; R = 1: monotone decoding, lReordering mainly improves fluency, i.e. stronger effect for Bleu lImprovement for Arabic: 4.8% NIST and 12.7% Bleu lLess improvement for Chinese: ~5% in Bleu lArabic devtest set (203 sentences) lChinese test set 2002 (878 sentences)

Stephan Vogel - Machine Translation10 Search Space lExample: sentence with 48 words lFull search for using coverage and language model state lAv. Expanded is for entire test set, i.e words RExpandedCollisionsAv. Expanded ,467 11,834,212588,29372,343 28,589,2213,479,193326, ,853,16112,127,1751,230,020 lMore reordering -> more collisions lGrowth of search space is counteracted by recombination of hypotheses and by pruning

Stephan Vogel - Machine Translation11 Pruning lPruning lEven after recombination too many hyps lRemove bad hyps and keep only the best ones lIn recombination we compared hyps which are equivalent under the models lNow we need to compare hyps, which are not strictly equivalent under the models lWe risk to remove hyps which would have won the race in the long run lI.e. we introduce errors into the search lSearch Error – Model Errors lModel errors: our models give higher probability to worse translation lSearch errors: our decoder looses translations with higher probability

Stephan Vogel - Machine Translation12 Pruning: Which Hyps to Compare? lWhich hyps are we comparing? lHow many should we keep? Recombination Pruning

Stephan Vogel - Machine Translation13 Pruning: Which Hyps to Compare? lCoarser equivalence relation => need to drop at least one of the models, or replace by simpler model lRecombination according to translated positions and LM state Pruning according to number of translated positions and LM state lRecombination according to number of translated positions and LM state Pruning according to number of translated positions OR LM state lRecombination with 5-gram LM Pruning with 3-gram LM lQuestion: which is the more important feature? lWhich leads to more search errors? lHow much loss in translation quality? lQuality more important than speed in most applications! lNot one correct answer – depends on other components of the system lIdeally, decoder allows for different recombination and pruning settings

Stephan Vogel - Machine Translation14 How Many Hyps to Keep? lBeam search: keep hyp h if Cost(h) < Cost(h best ) + const Cost Models separate alternatives a lot -> keep few hyps Models do not separate alternatives -> keep many hyps # translated words Prune bad hyps

Stephan Vogel - Machine Translation15 Additive Beam lIs additive constant (in log domain) the right thing to do? lHyps may spread more and more Cost Fewer and fewer hyps Inside beam # translated words

Stephan Vogel - Machine Translation16 Multiplicative Beam lBeam search: keep hyp h if Cost(h) < Cost(h best ) * const Cost # translated words Opening beam Covers more hyps

Stephan Vogel - Machine Translation17 Pruning and Optimization lEach feature has a feature weight lOptimization by adjusting feature weights lCan result in compressing or spreading the scores lThis actually happened in our first MERT implementation: Higher and higher feature weights => Hyps spreading further and further appart => Fewer hyps inside the beam => Lower and lower Bleu score  lTwo-pronged repair: lNormalizing feature weights lNot proper beam pruning, but restricting number of hyps

Stephan Vogel - Machine Translation18 How Many Hyps to Keep? lKeep n-best hyps lDoes not use the information from the models to decide how many hyps to keep Cost Keep constant number of hyps # translated words Prune bad hyps

Stephan Vogel - Machine Translation19 Efficiency lTwo problems lSorting lGenerating lots of hyps which are pruned (what a waste of time) lCan we avoid generating hyps, which would most likely be pruned?

Stephan Vogel - Machine Translation20 Efficiency lAssumptions: lWe want to generate hyps which cover n positions lAll hyp sets H k, k < n, are sorted according to total score lAll phrase pairs (edges in translation lattice), which can be used to expand a hyp h in H k to cover n positions, are sorted according to their score (weighted sum of individual scores) h1 h2 h3 h4 h5 p1 p2 p3 p4 h1p2 h1p1 h2p3 h4p2 h1p3 h3p2 Hyps sortedPhrases sortedNew Hyps sorted h2p1 prune

Stephan Vogel - Machine Translation21 Naïve Way lNaïve way: Foreach hyp h Foreach phrasepair p newhyp = h  p Cost(newhyp) = Cost(h)+ Cost(p)+ CostLM + CostDM + … lThis generates many hyps which will be pruned

Stephan Vogel - Machine Translation22 Early Termination lIf Cost(newhyp) = Cost(h) + Cost(p) it would be easy Besthyp = h 1  p 1 Loop h = next hyp Loop p = next p newhyp = h  p Cost(newhyp) = Cost(h) + Cost(p) Until Cost(newhyp) > Cost(besthyp) + const Until Cost(newhyp) > Cost(besthyp) + const lThat’s for proper beam pruning, would still generate too many hyps for max number of hyp strategy lIn addition, we have LM and DM, etc

Stephan Vogel - Machine Translation23 2 ‘Cube’ Pruning lExpand always best hyp until lNo hyps within beam anymore lOr max number of hyps reached p1p2p3 h1 h2 h3 h

Stephan Vogel - Machine Translation24 Effect of Recombination and Pruning lAverage number of expanded hypotheses and NIST scores for different recombination (R) and pruning (P) combinations and different beam sizes (= number of hyps) lTest Set: Arabic DevTest (203 sentences) Beam Width R : P Av. Hyps exp. C : c ,1321,4921,801 C L : c1,1741,8576,21330,293214,402 C L : C2,7924,24812,92153,228287,278 NIST C : c C L : c C L : C c = number of translation words, C = coverage vector, i.e. positions, L = LM history NIST scores: higher is better

Stephan Vogel - Machine Translation25 Number of Hypotheses versus NIST lLanguage model state required as recombination feature lMore hypotheses – better quality lDifferent ways to achieve similar translation quality lCL : C generates more ‘useless’ hypotheses (number of bad hyps grows faster than number of good hyps)

Stephan Vogel - Machine Translation26 N-Best List Generation lBenefit: lRequired for optimizing model scaling factors lRescoring with richer models lFor down-stream processing lTranslation with pivot language: L1 -> L2 -> L3 lInformation extraction l… lWe have n-best translations at sentence end lBut: Hypotheses are recombined -> many good translations don’t reach the sentence end lRecover those translations

Stephan Vogel - Machine Translation27 Storing Multiple Backpointers lWhen recombining hypotheses, store them with the best (i.e. surviving) hypothesis, but don’t expand them hbhb hbhb hrhr hrhr hrhr hrhr

Stephan Vogel - Machine Translation28 Calculating True Score lPropagate final score backwards lFor best hypothesis we have correct final score Q f (h b ) lFor recombined hypothesis we know current score Q c (h r ) and difference to current score Q c (h b ) of best hypothesis lFinal score of recombined hypothesis is then: Q(h r ) = Q f (h b ) + ( Q c (h r ) - Q c (h b ) ) lUse B = (Q, h, B’ ) to store sequences of hypotheses which make up a translation lStart with n-best final hypotheses lFor each of top n Bs, go to predecessor hypothesis and to recombined hypotheses of predecessor hypothesis lStore Bs according to coverage

Stephan Vogel - Machine Translation29 Problem with N-Best Generation lDuplicates when using phrases US # companies # and # other # institutions US companies # and # other # institutions US # companies and # other # institutions US # companies # and other # institutions... lExample run: 1000-best -> ~400 different strings on average Extreme case: only 10 different strings lPossible solution: Checking uniqueness during backtracking, i.e. creating and hashing partial translations

Stephan Vogel - Machine Translation30 Rest-Cost Estimation lIn Pruning we compare hyps, which are not strictly equivalent under the models lRisk: prefer hypotheses which have covered the easy parts lRemedy: estimate remaining cost for each hypothesis compare hypotheses based on ActualCost + FutureCost lWant to know minimum expected cost (similar to A * search) lGives a bound for pruning lHowever, not possible with acceptable effort for all models lWant to include as many models as possible lTranslation model costs, word count, phrase count lLanguage model costs lDistortion model costs lCalculate expected cost R(l, r) for each span (l, r)

Stephan Vogel - Machine Translation31 Rest Cost for Translation Models lTranslation model, word count and phrase count features are ‘local’ costs lDepend only on current phrase pair lStrictly additive: R(l, m) + R(m, r) = R(l, r) lMinimize over alternative translations lFor each source phrase span (l, r): initialize with cost for best translation lCombine adjacent spans, take best combination

Stephan Vogel - Machine Translation32 Rest Cost for Language Models lWe do not have history -> only approximation lFor each span (l, r) calculate LM score without history lCombine LM scores for adjacent spans lNotice: p(e 1 … e m ) * p(e m+1 … e n ) != p(e 1 … e n ) beyond 1-gram LM lAlternative: fast monotone decoding with TM-best translations lHistory available lThen R(l,r) = R(1,r) – R(1,l)

Stephan Vogel - Machine Translation33 Rest Cost for Distance-Based DM lDistance-based DM: rest cost depends on coverage pattern lTo many different coverage patterns, can not pre-calculate lEstimate by jumping to first gap, then filling gaps in sequence lMoore & Quirk 2007: DM cost plus rest cost S adjacent S’’: d=0 S left of S’: d=2L(S) S’ subsequence of S’’: d=2(D(S,S’’)+L(S)) Otherwise: d=2(D(S,S’)+L(S)) SS’’S’ Current phrase Previous phraseGap-free initial segment L(.) = length of phrase, D(.,.) = distance between phrases

Stephan Vogel - Machine Translation34 Rest Cost for Lexicalized DM lLexicalized DM per phrase (f, e) = (f, t(f)) lDM(f,e) scores: in-mon, in-swap, in-dist, out-mon, out-swap, out-dist lTreat as local cost for each span (l, r) lMinimize over alternative translations and different orientations in-* and out-*

Stephan Vogel - Machine Translation35 Effect of Rest-Cost Estimation lFrom Richard Zens 2008 lWe did not describe ‘per Position’ lLM is important, DM is important

Stephan Vogel - Machine Translation36 Summary lDifferent translation strategies – related to word reordering lTwo level decoding strategy (one possible way to do it) lGenerating translation lattice: contains all word and phrase translations lFinding best path lWord reordering as extension to best path search lJump ahead in lattice, fill in gap later lShort reordering window: decoding time exponential in size of window lRecombination of hypotheses lIf models can not re-rank hypotheses, keep only best lDepends on models used

Stephan Vogel - Machine Translation37 Summary lPruning of hypotheses lBeam pruning lProblem with too few hyps in beam (e.g. when running MERT) lKeeping a maximum number of hyps lEfficiency of implementation lTry to avoid generating hyps, which are pruned lCube pruning lN-best list generation lNeeded for MERT lSpurious ambiguity