A Tale about PRO and Monsters Preslav Nakov, Francisco Guzmán and Stephan Vogel ACL, Sofia August 5 2013.

Slides:

Advertisements

Similar presentations

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Advertisements

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Pinpointing the security boundary in high-dimensional spaces using importance sampling Simon Tindemans, Ioannis Konstantelos, Goran Strbac Risk and Reliability.

IBM Labs in Haifa © 2005 IBM Corporation Adaptive Application of SAT Solving Techniques Ohad Shacham and Karen Yorav Presented by Sharon Barner.

Optimal Design Laboratory | University of Michigan, Ann Arbor 2011 Design Preference Elicitation Using Efficient Global Optimization Yi Ren Panos Y. Papalambros.

Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.

Reduced Support Vector Machine

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.

Tracking using the Kalman Filter. Point Tracking Estimate the location of a given point along a sequence of images. (x 0,y 0 ) (x n,y n )

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Comparison and Combination of Ear and Face Images in Appearance-Based Biometrics IEEE Trans on PAMI, VOL. 25, NO.9, 2003 Kyong Chang, Kevin W. Bowyer,

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Saturation, Flat-spotting Shift up Derivative Weight Decay No derivative on output nodes.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Microsoft Research Faculty Summit Robert Moore Principal Researcher Microsoft Research.

Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.

Statistical Machine Translation Part VIII – Log-Linear Models Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Alternative Measures of Risk. The Optimal Risk Measure Desirable Properties for Risk Measure A risk measure maps the whole distribution of one dollar.

Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

Multimodal Interaction Dr. Mike Spann

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.

1 Machine Translation MIRA and MBR Stephan Vogel Spring Semester 2011.

How do Humans Evaluate Machine Translation? Francisco Guzmán, Ahmed Abdelali, Irina Temnikova, Hassan Sajjad, Stephan Vogel.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

2005MEE Software Engineering Lecture 11 – Optimisation Techniques.

Effective Use of Linguistic and Contextual Information for Statistical Machine Translation Libin Shen and Jinxi Xu and Bing Zhang and Spyros Matsoukas.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Supplementary Slides. More experimental results MPHSM already push out many irrelevant images Query image QHDM result, 4 of 36 ground truth found ANMRR=

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Machine Learning 5. Parametric Methods.

GENETIC ALGORITHM Basic Algorithm begin set time t = 0;

NTU & MSRA Ming-Feng Tsai

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Geolocation of Icelandic Cod using a modified Particle Filter Method David Brickman Vilhjamur Thorsteinsson.

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Fundamentals of Data Analysis Lecture 10 Correlation and regression.

A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Statistical Machine Translation Part IV – Log-Linear Models

ISP and Egress Path Selection for Multihomed Networks

Machine Learning Feature Creation and Selection

Statistical Machine Translation Papers from COLING 2004

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Stochastic Methods.

Presentation transcript:

A Tale about PRO and Monsters Preslav Nakov, Francisco Guzmán and Stephan Vogel ACL, Sofia August

2 Parameter Optimization MERT PRO MIRAkb rampion

3 Scales to many parameters? Fits the typical SMT architecture? MERT (Och, 2003) NOYES: batch MIRA (Watanabe et al 2007; Chiang et al 2008) YESNO: online PRO (Hopkins & May 2011) YESYES: batch Some Parameter Optimizers for SMT Simple but effective Increased stability Really?

4 PRO in a Nutshell A ranking problem BLEU+1 Score Model Score BLEU+1 Score Model Score j j ’ j New weights two translations j and j’ According to the modelAccording to evaluation score BLEU +1 Model score

5 The Original PRO Algorithm PRO’s steps (1-3 for each sentence separately; 4 – combine all) 1.Sampling - Randomly sample 5000 pairs (j, j’) from an n-best list 2.Selection - Choose those whose BLEU+1 diff > 5 BLEU 3.Acceptance - Accept (at most) the top 50 sentence pairs (with max differences) 4.Learning - Use the pairs for all sentences to train a ranker Requires good training examples

A Cautionary Tale

7 MERT works just fine. Tuning on Long Sentences … NIST: Arabic-English tune on longest 50% of MT06 Tuning BLEU Length ratio

8 …There is Evidence that… Monsters also happen on IWSLT and Spanish-English. PRO is unstable. 5x !!! NIST: Arabic-English tune on longest 50% of MT06 MONSTERS Tuning BLEU Length ratio

9 …Monsters Exist… What? Bad negative examples - Low BLEU - Too long Very divergent from positive examples Not useful for learning When? - Tuning on longer sentences - Several language pairs x1 x2 Pos Neg MONSTERS

10 … and Breed… n-best accumulation ensures monster prevalence across iterations

11 … to Ruin your Translations… REF: but we have to close ranks with each other and realize that in unity there is strength while in division there is weakness. IT1: but we are that we add our ranks to some of us and that we know that in the strength and weakness in IT1: but we are that we add our ranks to some of us and that we know that in the strength and weakness in IT3:, we are the but of the that that the, and, of ranks the the on the the our the our the some of we can include, and, of to the of we know the the our in of the of some people, force of the that that the in of the that that the the weakness Union the the, and IT3:, we are the but of the that that the, and, of ranks the the on the the our the our the some of we can include, and, of to the of we know the the our in of the of some people, force of the that that the in of the that that the the weakness Union the the, and IT4: namely Dr Heba Handossah and Dr Mona been pushed aside because a larger story EU Ambassador to Egypt Ian Burg highlighted 've dragged us backwards and dragged our speaking, never blame your defaulting a December 7th 1941 in Pearl Harbor ) we can include ranks will be joined by all 've dragged us backwards and dragged our $ 3.8 billion in tourism income proceeds Chamber are divided among themselves : some 've dragged us backwards and dragged our were exaggerated. Al Hakim namely Dr Heba Handossah and Dr Mona December 7th 1941 in Pearl Harbor ) cases might be known to us December 7th 1941 in Pearl Harbor ) platform depends on combating all liberal policies Track and Field Federation shortened strength as well face several challenges, namely Dr Heba Handossah and Dr Mona platform depends on combating all liberal policies the report forecast that the weak structure IT4: namely Dr Heba Handossah and Dr Mona been pushed aside because a larger story EU Ambassador to Egypt Ian Burg highlighted 've dragged us backwards and dragged our speaking, never blame your defaulting a December 7th 1941 in Pearl Harbor ) we can include ranks will be joined by all 've dragged us backwards and dragged our $ 3.8 billion in tourism income proceeds Chamber are divided among themselves : some 've dragged us backwards and dragged our were exaggerated. Al Hakim namely Dr Heba Handossah and Dr Mona December 7th 1941 in Pearl Harbor ) cases might be known to us December 7th 1941 in Pearl Harbor ) platform depends on combating all liberal policies Track and Field Federation shortened strength as well face several challenges, namely Dr Heba Handossah and Dr Mona platform depends on combating all liberal policies the report forecast that the weak structure Image:samii69.deviantart.com

12 …and Only PRO Fears Them… NIST: Ar-En test on MT09 tune on longest 50% of MT06 -3BP Optimizing for Sentence-Level BLEU+1 Yields Short Translations (Nakov et al., COLING ) *MIRA = batch-MIRA (Cherry & Foster, 2012)

13...but Why? PRO’s steps 1.Sampling - Randomly sample 5000 pairs 2.Selection - Choose those whose BLEU+1 diff > 5 BLEU 3.Acceptance - Accept the top 50 sentence pairs (with max differences) 4.Learning - Use the pairs for all sentences to train a ranker 1: Change selection 2: Accept at random Focuses on large differentials Selects the TOP differentials

14 On Slaying Monsters Selection 1.Cut-offs 2.Filter outliers 3.Stochastic sampling Acceptance 1.Random sampling Image:redbubble.com

15 Selection Methods: Cutoffs BLEU diff - BLEU diff > 5 (default) - BLEU diff < 10 - BLEU diff < 20 Length diff - length diff < 10 words - length diff < 20 words

16 Selection Methods: Outliers Assume gaussian Filter outliers that are more than λ times stdev away - λ = 2 - λ = 3 outlier λσ Outliers

17 Selection Methods: Stochastic sampling 1.Generate empirical distribution for (j,j’) 2.Sample according to it Select if p_rand <= p(j,j’)

18 Experimental Setup NIST Ar-En TM: NIST 2012 data (no UN) LM: 5-gram English Gigaword v.5 Tuning: 50% longest MT06 - contrast: full MT06 Test: MT09 3 reruns for each experiment!

19 Kill monsters Altering Selection (Tuning on Longest 50% of MT06) NOTE: We still require at least 5 BLEU+1 points of difference.

20 Altering Selection: Testing on Full MT09 Better BLEU, increased stability Tuning on longest 50% Tuning on all Same BLEU, same or better stability NOTE: We still require at least 5 BLEU+1 points of difference. Kill monsters Outperforms others

21 NOTE: No minimum BLEU+1 points of difference. Random accept kills monsters. Random Accept (Tuning on Longest 50% of MT06)

22 Random Accept: Testing on Full MT09 NOTE: No minimum BLEU+1 points of difference. Tuning on longest 50% Tuning on all worse BLEU, more unstable Better BLEU, increased stability Outperforms others

23 Summary Sample based methods - Do not kill monsters - Distributional assumptions - Assume monsters are rare Random acceptance - Kills monsters - Decreases discriminative power - Lowers test scores on tune:full Simple cut-offs - Protects against monsters - Do not affect the performance on tune:full - Recommended!

24 Moral of the Tale Monsters: examples unsuitable for learning PRO’s policies to blame: - Selection - Acceptance Cut-off-slaying monsters gives also: - more stability - better BLEU If you use PRO you should care! Would you risk it? Coming to Moses 1.0 soon!

25 Thank you ! Questions?