Zaidan & Callison-Burch – Feasbility of “Human” MERT Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University.

Slides:



Advertisements
Similar presentations
Panos Ipeirotis Stern School of Business
Advertisements

Top-Down & Bottom-Up Segmentation
Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
Random Forest Predrag Radenković 3237/10
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham.
Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.
Crowdsourcing research data UMBC ebiquity,
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Today’s Agenda Review Homework #1 [not posted]
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
INTERPRETING PROIROITY GRAPH RESULTS 2014 v1.0. Pairwise Priority Graph The priority or weight of a parent criterion is determined by its relationship.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Stochastic k-Neighborhood Selection for Supervised and Unsupervised Learning University of Toronto Machine Learning Seminar Feb 21, 2013 Kevin SwerskyIlya.
An efficient distributed protocol for collective decision- making in combinatorial domains CMSS Feb , 2012 Minyi Li Intelligent Agent Technology.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
User Study Evaluation Human-Computer Interaction.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Chapter 9 – Classification and Regression Trees
1 TURKOISE: a Mechanical Turk-based Tailor-made Metric for Spoken Language Translation Systems in the Medical Domain Workshop on Automatic and Manual Metrics.
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
Automatic Post-editing (pilot) Task Rajen Chatterjee, Matteo Negri and Marco Turchi Fondazione Bruno Kessler [ chatterjee | negri | turchi
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Issues concerning the interpretation of statistical significance tests.
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,
Object Recognition Part 2 Authors: Kobus Barnard, Pinar Duygulu, Nado de Freitas, and David Forsyth Slides by Rong Zhang CSE 595 – Words and Pictures Presentation.
Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks EMNLP 2008 Rion Snow CS Stanford Brendan O’Connor Dolores.
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, 1.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
Day 8 Usability testing.
Pareto-Optimality of Cognitively Preferred Polygonal Hulls for Dot Patterns Antony Galton University of Exeter UK.
Statistical Machine Translation Part II: Word Alignments and EM
Unit 5: Hypothesis Testing
A paper on Join Synopses for Approximate Query Answering
Clustering.
Machine Learning: Lecture 3
Text Categorization Berlin Chen 2003 Reference:
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Retrieval Performance Evaluation - Measures
Presentation transcript:

Zaidan & Callison-Burch – Feasbility of “Human” MERT Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University EMNLP 2009 – Singapore Thursday August 6 th, 2009 Feasibility of Human-in-the-loop Minimum Error Rate Training

Zaidan & Callison-Burch – Feasbility of “Human” MERT CCB ’09:

Zaidan & Callison-Burch – Feasbility of “Human” MERT quixotic things like human-in-the-loop minimum CCB ’09: foolishly impractical especially in the pursuit of ideals ; especially : marked by rash lofty romantic ideas or extravagantly chivalrous action error rate training

Zaidan & Callison-Burch – Feasbility of “Human” MERT MT systems rely on several models. A candidate is represented as a feature vector: Corresponding weight vector: Each candidate is assigned a score: System selects highest-scoring translation: Log-linear MT in One Slide

Zaidan & Callison-Burch – Feasbility of “Human” MERT Minimum Error Rate Training Och (2003): weight vector should be chosen by optimizing to evaluation metric of interest (aka MERT phase). But error surface is ugly. –Och suggests an efficient line optimization method…

Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method We want to plot this

Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method

Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method

Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method

Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method

Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method TER-like sufficient statistics

Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method TER-like sufficient statistics

Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method TER-like sufficient statistics

Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method TER-like sufficient statistics

Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method TER-like sufficient statistics Fast!

Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method TER-like sufficient statistics Fast! Fast?

Zaidan & Callison-Burch – Feasbility of “Human” MERT BLEU & MERT The metric most often optimized is BLEU: Why BLEU? –Usually the reported metric, –it has been shown to correlate well with human judgment, and –it can be computed efficiently.

Zaidan & Callison-Burch – Feasbility of “Human” MERT Problem with BLEU MERT General critique of BLEU –Chiang et al. (2008): weaknesses in BLEU. –Callison-Burch et al. (2006): not always appropriate to use BLEU to compare systems. Metric disparity –Actual evaluations have a human component (e.g. GALE uses H-TER). What is the alternative? H-TER MERT?

Zaidan & Callison-Burch – Feasbility of “Human” MERT H-TER MERT? In theory, MERT applicable to any metric. In practice, scoring 1000’s of candidate translations with H-TER is expensive. H-TER cost estimate: –Assume sentence takes 10 seconds to post- edit, at a cost of $0.10. –100 candidates for each of 1000 source sentences  35 work days, $10,000. vs. BLEU: minutes per iteration (and free). per iteration(!)

Zaidan & Callison-Burch – Feasbility of “Human” MERT A Human-Based Automatic Metric We suggest a metric that is: –viable to be used in MERT, yet –based on human judgment. Viability: relies on prebuilt database; no human involvement during MERT. Human-based: the database is a repository of human judgments.

Zaidan & Callison-Burch – Feasbility of “Human” MERT Our Metric: RYPT Main idea: reward syntactic constituents in source that are aligned to “acceptable” substrings in candidate translation. When scoring a candidate: –Obtain parse tree for source sentence. –Align source words to candidate words. –Count number of subtrees translated in an “acceptable” manner. –RYPT = Ratio of Yes nodes in the Parse Tree.

Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT (Ratio of Y in Parse Tree) Source Translation (to be scored) Source Parse Tree

Zaidan & Callison-Burch – Feasbility of “Human” MERT Label Y indicates forecasts deemed acceptable translation of prognosen. Y RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)

Zaidan & Callison-Burch – Feasbility of “Human” MERT Y Y RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored) Label Y indicates forecasts deemed acceptable translation of prognosen.

Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)

Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)

Zaidan & Callison-Burch – Feasbility of “Human” MERT Is RYPT Good? Is RYPT acceptable? –Must show RYPT is reasonable substitute for human judgment. Is RYPT feasible? –Must show collecting necessary judgments is efficient and affordable. EmpiricallyNext …

Zaidan & Callison-Burch – Feasbility of “Human” MERT Feasibility: Reusing Judgments For each source sentence, we build a database, where each entry is a tuple: A judgment is reused across candidates: der patient wurde isoliert. the patient was isolated. the patient isolated. the patient was in isolation. the patient has been isolated.

Zaidan & Callison-Burch – Feasbility of “Human” MERT Feasibility: Reusing Judgments For each source sentence, we build a database, where each entry is a tuple: A judgment is reused across candidates: der patient wurde isoliert. of the patient was isolated. of the patient isolated. of the patient was in isolation. of the patient has been isolated.

Zaidan & Callison-Burch – Feasbility of “Human” MERT Feasibility: Label Percolation Minimize label collection even further by percolating labels through the parse tree: –If a node is labeled NO, ancestors likely labeled NO  Percolate NO up the tree. –If a node is labeled YES, descendents likely labaled YES  Percolate YES down the tree. N Y N N N N Y Y Y Y

Zaidan & Callison-Burch – Feasbility of “Human” MERT Maximizing Label Percolation Queries are performed in batch mode. For maximum percolation, queries should avoid overlapping substrings. One extreme: select root node. Other extreme: select all preterminals. Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Never happens… Too much focus on individual words No percolation Y

Zaidan & Callison-Burch – Feasbility of “Human” MERT Query Selection Middle ground Frontier node set with some source maxLen

Zaidan & Callison-Burch – Feasbility of “Human” MERT Query Selection Middle ground Frontier node set with some source maxLen

Zaidan & Callison-Burch – Feasbility of “Human” MERT Query Selection Middle ground Frontier node set with some source maxLen

Zaidan & Callison-Burch – Feasbility of “Human” MERT Query Selection Middle ground Frontier node set with some source maxLen

Zaidan & Callison-Burch – Feasbility of “Human” MERT Query Selection Middle ground Frontier node set with some source maxLen

Zaidan & Callison-Burch – Feasbility of “Human” MERT OK, so how do you obtain these labels?

Zaidan & Callison-Burch – Feasbility of “Human” MERT Amazon Mechanical Turk We use Amazon Mechanical Turk to collect judgment labels. AMT: virtual marketplace, allows “requesters” to create and post tasks to be completed by “workers” around the world. Requester provides HTML template, csv database. AMT creates individual tasks for workers. Task = Human Intelligence Task = HIT

Zaidan & Callison-Burch – Feasbility of “Human” MERT HIT Example prozent % per cent Source Reference Candidate translations

Zaidan & Callison-Burch – Feasbility of “Human” MERT des zentralen statistischen amtes from the central statistical office statistics office data Source Reference Candidate translations from the central statistics office in the central statistical office in the central statistics office of central statistical office of central statistics office of the central statistical office of the central statistics office

Zaidan & Callison-Burch – Feasbility of “Human” MERT Data Summary 3,873 HIT s created, each with 3.4 judgments on average  13k labels. 115 distinct workers put in 30.8 hours. One label per 8.4 seconds (426 labels/hr). Cost: $21.43 Amazon fees $53.47 Wages $ 6.54 Bonuses $81.44 Hourly ‘wage’: $ labels per $

Zaidan & Callison-Burch – Feasbility of “Human” MERT Is RYPT Good? Is RYPT acceptable? –Must show RYPT is reasonable substitute for human judgment. Is RYPT feasible? –Must show collecting necessary judgments is efficient and affordable. Next …Yes!

Zaidan & Callison-Burch – Feasbility of “Human” MERT Is RYPT Acceptable? Is RYPT a reasonable alternative for human judgment? Our experiment: compare predictive power of RYPT vs. BLEU. Compare top-1 candidate by BLEU score vs. top-1 candidate by RYPT score. –Which candidate looks better to a human?

Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT vs. BLEU. RYPT Candidates BLEU. cand 1 cand 2 cand 3 cand 4 cand 5 cand 6 cand 7.

Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT vs. BLEU. RYPT Candidates BLEU. cand 1 cand 2 cand 3 cand 4 cand 5 cand 6 cand 7. Which one would be preferred by a human? Ask a Turker! Actually, ask 3 Turkers…  3 judgments * 250 sentence pairs 750 judgments RYPT’s BLEU’s choice vs. cand 5 vs. cand 3

Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT vs. BLEU RYPT’s choice is preferred 46.1% of the time, vs. 36.0% for BLEU’s choice. Majority vote breakdown:

Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT vs. BLEU RYPT’s choice is preferred 46.1% of the time, vs. 36.0% for BLEU’s choice. Majority vote breakdown: 48.0%35.2%16.8% 24.0%13.2% Majority vote picks BLEU’s choice Majority vote picks RYPT’s choice Majority vote strongly prefers RYPT’s choice Majority vote strongly prefers BLEU’s choice Strong preference for X = no votes for Y

Zaidan & Callison-Burch – Feasbility of “Human” MERT BLEU’s Inherent Advantage When comparing candidate translations, worker was shown the references. BLEU’s choice, by definition, will have high overlap with the reference. –Annotator might judge BLEU’s choice to be ‘better’ because it ‘looks’ like the reference. When no references shown, (and restricted to workers in Germany): 45.2%29.2%25.6% 46.1%36.0%17.9%

Zaidan & Callison-Burch – Feasbility of “Human” MERT See Paper for… Source-candidate alignment method, which takes advantage of derivation trees given by Joshua (see 3.1). Percolation coverage and accuracy, and effect of maxLen (see 5.1). Related work (see 6). –Nießen et al. (2000): DB of judgments. –WMT Workshops: manual evaluation; metric correlation with human judgment. –Snow et al. (2008): AMT is “fast and cheap.”

Zaidan & Callison-Burch – Feasbility of “Human” MERT Future Work This was a pilot study… Complete MERT run (already in progress). –Beyond a single iteration. –Using AMT’s API. Probabilistic approach to labeling nodes. –Treat a node label as a random variable. –Existing labels = observed, others inferred. Stay tuned for our next paper

Zaidan & Callison-Burch – Feasbility of “Human” MERT Final Notes Later today: C. Callison-Burch (2:15 PM) Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk Funding: –EuroMatrix, DARPA’s GALE, and US NSF. Thanks to: –Markus Dreyer, Jason Eisner, and Zhifei Li. –Turkers A14LPCJ1O1773B, A15P4FL5P235I0, AROUZI8PYSKUT.