Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.

Slides:



Advertisements
Similar presentations
Machine Translation: Challenges and Approaches
Advertisements

Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.
Automatic summarization Dragomir R. Radev University of Michigan
Arthur Chan Prepared for Advanced MT Seminar
Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology.
Summarization Evaluation Using Transformed Basic Elements Stephen Tratz and Eduard Hovy Information Sciences Institute University of Southern California.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Machine Translation Dr. Nizar Habash Center for Computational Learning Systems Columbia University COMS 4705: Natural Language Processing Fall 2010.
UNC-CH at DUC2007: Query Expansion, Lexical Simplification, and Sentence Selection Strategies for Multi-Document Summarization Catherine Blake Julia Kampov.
Predicting Cloze Task Quality for Vocabulary Training Adam Skory, Maxine Eskenazi Language Technologies Institute Carnegie Mellon University
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.
Machine Translation: Challenges and Approaches Nizar Habash Post-doctoral Fellow Center for Computational Learning Systems Columbia University Invited.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Dorr MT (continued), MT Evaluation Prof. Bonnie J. Dorr Dr. Christof Monz TA:
Recent Trends in MT Evaluation: Linguistic Information and Machine Learning Jason Adams Instructors: Alon Lavie Stephan Vogel.
Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.
Machine Translation- 5 Autumn 2008 Lecture Sep 2008.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Automated Metrics for MT Evaluation : Machine Translation Alon Lavie March 2, 2011.
“ Poetry is what gets lost in translation.” Robert Frost Poet (1874 – 1963) Wrote the famous poem ‘Stopping by woods on a snowy evening’ better known as.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Arthur Chan Prepared for Advanced MT Seminar
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Evaluating the Output of Machine Translation Systems
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Haitham Elmarakeby.  Speech recognition
Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Machine Translation: Challenges and Approaches Nizar Habash Associate Research Scientist Center for Computational Learning Systems Columbia University.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Evaluating the Output of Machine Translation Systems Alon Lavie Associate Research Professor, Carnegie Mellon University President, Safaba Translation.
Semantic Evaluation of Machine Translation Billy Wong, City University of Hong Kong 21 st May 2010.
Machine Translation Course 10 Diana Trandab ă ț
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Automatic methods of MT evaluation Lecture 18/03/2009 MODL5003 Principles and applications of machine translation Bogdan Babych.
Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.
Machine Translation Course 9
Language Identification and Part-of-Speech Tagging
Multi-Engine Machine Translation
NLP.
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Vorlesung Maschinelle Übersetzung, SS 2010
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Paraphrase Generation Using Deep Learning
Introduction to Text Generation
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Week 6 Presentation Ngoc Ta Aidean Sharghi.
Presentation transcript:

Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch (2014) and Nizar Habash (2013)

MT Evaluation More art than science Wide range of Metrics/Techniques interface, …, scalability, …, faithfulness,... space/time complexity, … etc. Automatic vs. Human-based Dumb Machines vs. Slow Humans

Human-based Evaluation Example Accuracy Criteria

Human-based Evaluation Example Fluency Criteria

Automatic Evaluation Example Bleu Metric (Papineni et al 2001) Bleu BiLingual Evaluation Understudy Modified n-gram precision with length penalty Quick, inexpensive and language independent Correlates highly with human evaluation Bias against synonyms and inflectional variations

Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Automatic Evaluation Example Bleu Metric

Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4/5 Automatic Evaluation Example Bleu Metric

Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4 / 5 = 0.8 Bigram precision = 2 / 4 = 0.5 Bleu Score = (a 1 a 2 …a n ) 1/n = (0.8 ╳ 0.5) ½ =  Automatic Evaluation Example Bleu Metric

Automatic Evaluation Example METEOR (Lavie and Agrawal 2007) Metric for Evaluation of Translation with Explicit word Ordering Extended Matching between translation and reference Porter stems, wordNet synsets Unigram Precision, Recall, parameterized F-measure Reordering Penalty Parameters can be tuned to optimize correlation with human judgments Not biased against “non-statistical” MT systems

A syntactically-aware evaluation metric (Liu and Gildea, 2005; Owczarzak et al., 2007; Giménez and Màrquez, 2007) Uses dependency representation MICA parser (Nasr & Rambow 2006) 77% of all structural bigrams are surface n-grams of size 2,3,4 Includes dependency surface span as a factor in score long-distance dependencies should receive a greater weight than short distance dependencies Higher degree of grammaticality? Automatic Evaluation Example SEPIA (Habash and ElKholy 2008)

Metrics MATR Workshop Workshop in AMTA conference 2008 Association for Machine Translation in the Americas Evaluating evaluation metrics Compared 39 metrics 7 baselines and 32 new metrics Various measures of correlation with human judgment Different conditions: text genre, source language, number of references, etc.

Fluency vs. Accuracy Accuracy Fluency conMT FAHQ MT Prof. MT Info. MT