Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training.

Slides:

Advertisements

Similar presentations

Analysis of Computer Algorithms

Advertisements

Statistical Machine Translation

Efficiency and Productivity Measurement: Data Envelopment Analysis

Lazy Paired Hyper-Parameter Tuning

Est+Opt CIRM 18/8/2009 A. ILIADIS 1 Estimation + optimization in PK modeling Introduction to modeling Estimation criteria Numerical optimization, examples.

Decision Support Andry Pinto Hugo Alves Inês Domingues Luís Rocha Susana Cruz.

Page 1 NAE 4DVAR Oct 2006 © Crown copyright 2006 Mark Naylor Data Assimilation, NWP NAE 4D-Var – Testing and Issues EWGLAM/SRNWP meeting Zurich 9 th -12.

Beyond Linear Separability

Machine Learning: Intro and Supervised Classification

Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Regularization David Kauchak CS 451 – Fall 2013.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Optimization : The min and max of a function

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

CSC344: AI for Games Lecture 5 Advanced heuristic search Patrick Olivier

Optimization via Search CPSC 315 – Programming Studio Spring 2009 Project 2, Lecture 4 Adapted from slides of Yoonsuck Choe.

Informed Search Methods Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 4 Spring 2005.

Optimization Methods Unconstrained optimization of an objective function F Deterministic, gradient-based methods Running a PDE: will cover later in course.

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Simulated Annealing G.Anuradha. What is it? Simulated Annealing is a stochastic optimization method that derives its name from the annealing process used.

Stephan Vogel - Machine Translation1 Machine Translation Factored Models Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation1 Machine Translation Word Alignment Stephan Vogel Spring Semester 2011.

Optimization of thermal processes2007/2008 Optimization of thermal processes Maciej Marek Czestochowa University of Technology Institute of Thermal Machinery.

Stephan Vogel - Machine Translation1 Statistical Machine Translation Word Alignment Stephan Vogel MT Class Spring Semester 2011.

Artificial Neural Networks

Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

The Basics and Pseudo Code

Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart

1 Machine Translation MIRA and MBR Stephan Vogel Spring Semester 2011.

Optimization Problems - Optimization: In the real world, there are many problems (e.g. Traveling Salesman Problem, Playing Chess ) that have numerous possible.

1 IE 607 Heuristic Optimization Particle Swarm Optimization.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

Evolving Virtual Creatures & Evolving 3D Morphology and Behavior by Competition Papers by Karl Sims Presented by Sarah Waziruddin.

A Tale about PRO and Monsters Preslav Nakov, Francisco Guzmán and Stephan Vogel ACL, Sofia August

2005MEE Software Engineering Lecture 11 – Optimisation Techniques.

559 Fish 559; Lecture 5 Non-linear Minimization. 559 Introduction Non-linear minimization (or optimization) is the numerical technique that is used by.

Stephan Vogel - Machine Translation1 Machine Translation Decoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

SMT – TIDES – and all that Stephan Vogel Language Technologies Institute Carnegie Mellon University Aus der Vogel-Perspektive A Bird’s View (human translation)

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Lecture Fall 2001 Controlling Animation Boundary-Value Problems Shooting Methods Constrained Optimization Robot Control.

Particle Swarm Optimization (PSO)

FACTS Placement Optimization For Multi-Line Contignecies Josh Wilkerson November 30, 2005.

CEng 713, Evolutionary Computation, Lecture Notes parallel Evolutionary Computation.

Machine Learning Supervised Learning Classification and Regression

Fall 2004 Backpropagation CS478 - Machine Learning.

CSCI 4310 Lecture 10: Local Search Algorithms

Heuristic Optimization Methods

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Non-linear Minimization

Statistical Machine Translation Part IV – Log-Linear Models

Real Neurons Cell structures Cell body Dendrites Axon

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

ISP and Egress Path Selection for Multihomed Networks

Overfitting and Underfitting

Xin-She Yang, Nature-Inspired Optimization Algorithms, Elsevier, 2014

Backpropagation David Kauchak CS159 – Fall 2019.

Greg Knowles ECE Fall 2004 Professor Yu Hu Hen

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Sequence alignment, E-value & Extreme value distribution

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Stochastic Methods.

Presentation transcript:

Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation2 Overview lOptimization approaches lSimplex lMER lAvoiding local minima lAdditional considerations lTuning towards different metrics lTuning on different development sets

Stephan Vogel - Machine Translation3 Tuning the SMT System lWe use different models in SMT system lModels have simplifications lTrained on different amounts of data l=> Models have different levels of reliability and scores have different ranges l=> Give different weight to different Models Q = c 1 Q 1 + c 2 Q 2 + … + c n Q n lFind optimal scaling factors (feature weights) c 1 … c n lOptimal means: Highest score for chosen evaluation metric M ie: find (c 1, …, c n ) such that M(argmin e {Q(e,f)}) is high lMetric M is our objective function

Stephan Vogel - Machine Translation4 Problems lThe surface of the objective function is not nice lNot convex -> local minima (actually, many local minima) lNot differentiable -> gradient descent methods not readily applicable lThere may be dangerous areas (boundary cliffs) lExample: lTune on Dev set with short reference translations lOptimization leads towards short translations lNew test set has long reference translations lTranslations are now too short ->length penalty Small change Big effect

Stephan Vogel - Machine Translation5 Brute Force Approach – Manual Tuning lDecode with different scaling factors lGet feeling for range of good values lGet feeling for importance of models lLM is typically most important lSentence length (word count feature) to balance shortening effect of LM lWord reordering is more or less effective depending on language lNarrow down range in which scaling factors are tested lEssentially multi-linear optimization lWorks good for small number of models lTime consuming (CPU wise) if decoding takes long time

Stephan Vogel - Machine Translation6 Automatic Tuning lMany algorithms to find (near) optimal solutions available lSimplex lPowell (line search) lMIRA (Margin Infused Relaxed Algorithm) lSpecially designed minimum error training (Och 2003) lGenetic algorithm lNote: models are not improved, only their combination lNote: some parameters change performance of decoder, but are not in Q lNumber of alternative translation lBeam size lWord reordering restrictions

Stephan Vogel - Machine Translation7 Automatic Tuning on N-best List lOptimization algorithm need many iterations – too expensive to run full translations l=> Use n-best lists le.g. for each of 500 source sentences 1000 translations lChange scaling factors results in re-ranking the n-best lists lEvaluate new 1-best translations lApply any of the standard optimization techniques lAdvantage: much faster lCan pre-calculate the counts (e.g. n-gram matches) for each translation to speed up evaluation

Stephan Vogel - Machine Translation8 Simplex (Nelder-Mead) lStart with n+1 random configurations lGet 1-best translation for each configuration -> objective function lSort points x k according to objective function: f(x 1 ) < f(x 2 ) < … < f(x n+1 ) lCalculate x 0 as center of gravity for x 1 … x n lReplace worst point with a point reflected through the centroid x r = x 0 + r * (x 0 – x n+1 )

Stephan Vogel - Machine Translation9 Demo lObviously, we need to change the size of the simplex to enforce convergence lAlso, want to adjust the step size lIf new point is best point – increase step size lIf new point is worse then x 1 … x n – decrease step size

Stephan Vogel - Machine Translation10 Expansion and Contraction lReflection: Calculate x r = x 0 + r * (x 0 – x n+1 ) if f(x 1 ) <= f(x r ) < f(x n ) replace x n+1 with x r; Next iteration lExpansion: If reflected point is better then best, i.e. f(x r ) < f(x 1 ) Calculate x e = x 0 + e * (x 0 – x n+1 ) If f(x e ) < f(x r ) then replace x n+1 with x e else replace x n+1 with x r Next iteration else Contract lContraction: Reflected point f(x r ) >= f(x n ) Calculate x c = x n+1 + c * (x 0 – x n+1 ) If f(x c ) <= f(x n+1 ) then replace x n+1 with x c else Shrink lShrinking: For all x k, k = 2 … n+1: x k = x 1 + s * (x k – x 1 ) Next iteration

Stephan Vogel - Machine Translation11 Changing the Simplex x n+1 x1x1 reflection x n+1 x0x0 expansion x n+1 x0x0 contraction x n+1 x0x0 shrinking

Stephan Vogel - Machine Translation12 Powell Line Search lSelect directions in search space, then Loop until convergence Loop over directions d Perform line search for direction d until convergence lMany variants lSelect directions lEasiest is to use the model scores lOr combine multiple scores lStep size in line search lMER (Och 2003) is line search along models with smart selection of steps

Stephan Vogel - Machine Translation13 Minimum Error Training For each hypothesis we have Q = c k *Q k Select one Q \k = c k Q k + n\k c n *Q n = c k Q k + Q Rest ckck Metric Score WER = 8 Total Model Score Q Rest Q k Individual model score gives slope 1

Stephan Vogel - Machine Translation14 Minimum Error Training lSource sentence 1 lDepending on scaling factor c k, different hyps are in 1-best position lSet c k to have metric-best hyp also being model-best ckck h 11 : WER = 8 h 12 : WER = 5 h 13 : WER = 4 best hyp: h 11 h 12 h Model Score

Stephan Vogel - Machine Translation15 Minimum Error Training lSelect minimum number of evaluation points lCalculate intersection point lKeep only if hyps are minimum at that point lChoose evaluation points between intersection points ckck h 11 : WER = 8 h 12 : WER = 5 h 13 : WER = 4 best hyp: h 11 h 12 h Model Score

Stephan Vogel - Machine Translation16 Minimum Error Training lSource sentence 1, now different error scores lOptimization would find a different c k l=> Different metrics lead to different scaling factors ckck Model Score h 11 : WER = 8 h 12 : WER = 2 h 13 : WER = 4 best hyp: h 11 h 12 h

Stephan Vogel - Machine Translation17 Minimum Error Training lSentence 2 lBest c k in a different range lNo matter which c k, h 22 would newer be 1-best ckck best hyp: h 21 : WER = 2 h 22 : WER = 0 h 23 : WER = 5 h 21 h Model Score

Stephan Vogel - Machine Translation18 Minimum Error Training lMultiple sentences ckck h 11 : WER = 8 h 12 : WER = 5 h 13 : WER = 4 best hyp: h 11 h 12 h 13 h 21 : WER = 2 h 22 : WER = 0 h 23 : WER = 5 h 21 h Model Score

Stephan Vogel - Machine Translation19 Iterate Decoding - Optimization lN-best list is (very restricted) substitute for search space lWith updated feature weights we may have generated other (better) translations lSome of the hyps in the n-best list would have been pruned lIterate lRe-translate with new feature weights lMerge new translations with old translations (increases stability) lRun optimizer over larger n-best lists lRepeat until no new translations, or improvement < epsilon, or just k times (typically 5-10 iterations)

Stephan Vogel - Machine Translation20 Avoiding Local Minima lOptimization can get stuck in local minimum lRemedies lFiddle around with the parameters of your optimization algorithm lLarger n-best list -> more evaluation points lCombine with Simulated Annealing type approach (Smith & Eisner, 2007) lRestart multiple times

Stephan Vogel - Machine Translation21 Random Restarts lComparison Simplex/Powell (Alok, unpublished) lComparison Simplex/ext. Simplex/MER (Bing Zhao, unpubl.) lObservations: lAlok: Simplex jumpier then Powell lBing: Simplex better than MER lBoth: you need many restarts

Stephan Vogel - Machine Translation22 Optimizing NOT Towards References lIdeally, we want system output identical to reference translations lBut there is not guarantee that system can generate reference translations (under realistic conditions) lE.g. we restrict reordering window lWe have unknown words lReference translations may have words unknow to the system lInstead of forcing decoder towards reference translations optimize towards best translations generated by the system lFind hypotheses with best metric score lUse those as pseudo references lOptimize towards the pseudo references

Stephan Vogel - Machine Translation23 Optimizing Towards Different Metrics lAutomatic metrics have different characteristics lOptimizing towards one does not mean that other metric scores will also go up lEsp. Different metrics prefer shorter or longer translations Typically: TER < BLEU < METEOR (< means shorter translation) lMauser et al (2007) on Ch-En NIST 2005 test set lReasonably well behaved lResulting length of translation differs by more than 15%

Stephan Vogel - Machine Translation24 Generalization to other Test Sets lOptimize on one set, test on multiple other sets lAgain Mauser et al, Ch-En lShown is behavior over Simplex optimization iterations lNice, nearly parallel development of metric scores lHowever, we had also observed brittle behavior lEsp. when ratio src_length / ref_length is very different between dev and eval test sets

Stephan Vogel - Machine Translation25 Large Weight = Important Feature? lAssume we have c LM = 1.0, c TM = 0.55, c WC = 3.2 lWhich feature is most important? lCannot say!!! lWe want to re-rank the n-best lists lFeature weights scale feature values such that they can compete lExample: lVariation in LM and TM larger then for WC lNeed large weight for WC to make small differences effective lTo know if feature is important, remove it and look at drop in metric score Q LM Q TM Q WC Q H1H H2H H3H

Stephan Vogel - Machine Translation26 Open Issues lShould not all optimizers get the same results, if done right lThe models are the same, its just finding the right mix lIf local minima can be avoided, then similar good optima should be found lHow to stay save lAvoid good optima close to cliffs lDifferent configurations give very similar metric scores, pick one which is more stable lOne hat fits all? lWhy one set of feature weights? lHow about different sets for lGood/bad translations (tuning on tail: mixed results so far) lShort/long sentences lBegin/middle/end of sentence l...

Stephan Vogel - Machine Translation27 Summary lOptimizing system by modifying scaling factors (feature weights) lDifferent optimization approaches can be used lSimplex, Powell most common lMERT (Och) is similar to Powell, with pre-calculation of grid points lMany local optima, avoid getting stuck early lMost effective: many restarts lGeneralization lTo unseen test data: mostly ok, sometimes selection of dev set has big impact (length penalty!) lTo different metrics: reasonably stable (metrics are reasonably correlated in most cases) lStill open questions => more research needed