Presentation is loading. Please wait.

Presentation is loading. Please wait.

Zaidan & Callison-Burch – Feasbility of “Human” MERT Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University.

Similar presentations


Presentation on theme: "Zaidan & Callison-Burch – Feasbility of “Human” MERT Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University."— Presentation transcript:

1 Zaidan & Callison-Burch – Feasbility of “Human” MERT Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University EMNLP 2009 – Singapore Thursday August 6 th, 2009 cs.jhu.edu@}{|ccbozaidan Feasibility of Human-in-the-loop Minimum Error Rate Training

2 Zaidan & Callison-Burch – Feasbility of “Human” MERT CCB ’09:

3 Zaidan & Callison-Burch – Feasbility of “Human” MERT quixotic things like human-in-the-loop minimum CCB ’09: foolishly impractical especially in the pursuit of ideals ; especially : marked by rash lofty romantic ideas or extravagantly chivalrous action error rate training

4 Zaidan & Callison-Burch – Feasbility of “Human” MERT MT systems rely on several models. A candidate is represented as a feature vector: Corresponding weight vector: Each candidate is assigned a score: System selects highest-scoring translation: Log-linear MT in One Slide

5 Zaidan & Callison-Burch – Feasbility of “Human” MERT Minimum Error Rate Training Och (2003): weight vector should be chosen by optimizing to evaluation metric of interest (aka MERT phase). But error surface is ugly. –Och suggests an efficient line optimization method…

6 Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method We want to plot this

7 Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method

8 Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method

9 Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method

10 Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method

11 Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method TER-like sufficient statistics

12 Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method TER-like sufficient statistics

13 Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method TER-like sufficient statistics

14 Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method TER-like sufficient statistics

15 Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method TER-like sufficient statistics Fast!

16 Zaidan & Callison-Burch – Feasbility of “Human” MERT Visualizing Och’s Method TER-like sufficient statistics Fast! Fast?

17 Zaidan & Callison-Burch – Feasbility of “Human” MERT BLEU & MERT The metric most often optimized is BLEU: Why BLEU? –Usually the reported metric, –it has been shown to correlate well with human judgment, and –it can be computed efficiently.

18 Zaidan & Callison-Burch – Feasbility of “Human” MERT Problem with BLEU MERT General critique of BLEU –Chiang et al. (2008): weaknesses in BLEU. –Callison-Burch et al. (2006): not always appropriate to use BLEU to compare systems. Metric disparity –Actual evaluations have a human component (e.g. GALE uses H-TER). What is the alternative? H-TER MERT?

19 Zaidan & Callison-Burch – Feasbility of “Human” MERT H-TER MERT? In theory, MERT applicable to any metric. In practice, scoring 1000’s of candidate translations with H-TER is expensive. H-TER cost estimate: –Assume sentence takes 10 seconds to post- edit, at a cost of $0.10. –100 candidates for each of 1000 source sentences  35 work days, $10,000. vs. BLEU: minutes per iteration (and free). per iteration(!)

20 Zaidan & Callison-Burch – Feasbility of “Human” MERT A Human-Based Automatic Metric We suggest a metric that is: –viable to be used in MERT, yet –based on human judgment. Viability: relies on prebuilt database; no human involvement during MERT. Human-based: the database is a repository of human judgments.

21 Zaidan & Callison-Burch – Feasbility of “Human” MERT Our Metric: RYPT Main idea: reward syntactic constituents in source that are aligned to “acceptable” substrings in candidate translation. When scoring a candidate: –Obtain parse tree for source sentence. –Align source words to candidate words. –Count number of subtrees translated in an “acceptable” manner. –RYPT = Ratio of Yes nodes in the Parse Tree.

22 Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT (Ratio of Y in Parse Tree) Source Translation (to be scored) Source Parse Tree

23 Zaidan & Callison-Burch – Feasbility of “Human” MERT Label Y indicates forecasts deemed acceptable translation of prognosen. Y RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)

24 Zaidan & Callison-Burch – Feasbility of “Human” MERT Y Y RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored) Label Y indicates forecasts deemed acceptable translation of prognosen.

25 Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)

26 Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)

27 Zaidan & Callison-Burch – Feasbility of “Human” MERT Is RYPT Good? Is RYPT acceptable? –Must show RYPT is reasonable substitute for human judgment. Is RYPT feasible? –Must show collecting necessary judgments is efficient and affordable. EmpiricallyNext …

28 Zaidan & Callison-Burch – Feasbility of “Human” MERT Feasibility: Reusing Judgments For each source sentence, we build a database, where each entry is a tuple: A judgment is reused across candidates: der patient wurde isoliert. the patient was isolated. the patient isolated. the patient was in isolation. the patient has been isolated.

29 Zaidan & Callison-Burch – Feasbility of “Human” MERT Feasibility: Reusing Judgments For each source sentence, we build a database, where each entry is a tuple: A judgment is reused across candidates: der patient wurde isoliert. of the patient was isolated. of the patient isolated. of the patient was in isolation. of the patient has been isolated.

30 Zaidan & Callison-Burch – Feasbility of “Human” MERT Feasibility: Label Percolation Minimize label collection even further by percolating labels through the parse tree: –If a node is labeled NO, ancestors likely labeled NO  Percolate NO up the tree. –If a node is labeled YES, descendents likely labaled YES  Percolate YES down the tree. N Y N N N N Y Y Y Y

31 Zaidan & Callison-Burch – Feasbility of “Human” MERT Maximizing Label Percolation Queries are performed in batch mode. For maximum percolation, queries should avoid overlapping substrings. One extreme: select root node. Other extreme: select all preterminals. Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Never happens… Too much focus on individual words No percolation Y

32 Zaidan & Callison-Burch – Feasbility of “Human” MERT Query Selection Middle ground Frontier node set with some source maxLen

33 Zaidan & Callison-Burch – Feasbility of “Human” MERT Query Selection Middle ground Frontier node set with some source maxLen

34 Zaidan & Callison-Burch – Feasbility of “Human” MERT Query Selection Middle ground Frontier node set with some source maxLen

35 Zaidan & Callison-Burch – Feasbility of “Human” MERT Query Selection Middle ground Frontier node set with some source maxLen

36 Zaidan & Callison-Burch – Feasbility of “Human” MERT Query Selection Middle ground Frontier node set with some source maxLen

37 Zaidan & Callison-Burch – Feasbility of “Human” MERT OK, so how do you obtain these labels?

38 Zaidan & Callison-Burch – Feasbility of “Human” MERT Amazon Mechanical Turk We use Amazon Mechanical Turk to collect judgment labels. AMT: virtual marketplace, allows “requesters” to create and post tasks to be completed by “workers” around the world. Requester provides HTML template, csv database. AMT creates individual tasks for workers. Task = Human Intelligence Task = HIT

39 Zaidan & Callison-Burch – Feasbility of “Human” MERT HIT Example prozent % per cent Source Reference Candidate translations

40 Zaidan & Callison-Burch – Feasbility of “Human” MERT des zentralen statistischen amtes from the central statistical office statistics office data Source Reference Candidate translations from the central statistics office in the central statistical office in the central statistics office of central statistical office of central statistics office of the central statistical office of the central statistics office

41 Zaidan & Callison-Burch – Feasbility of “Human” MERT Data Summary 3,873 HIT s created, each with 3.4 judgments on average  13k labels. 115 distinct workers put in 30.8 hours. One label per 8.4 seconds (426 labels/hr). Cost: $21.43 Amazon fees $53.47 Wages $ 6.54 Bonuses $81.44 Hourly ‘wage’: $1.95 161 labels per $

42 Zaidan & Callison-Burch – Feasbility of “Human” MERT Is RYPT Good? Is RYPT acceptable? –Must show RYPT is reasonable substitute for human judgment. Is RYPT feasible? –Must show collecting necessary judgments is efficient and affordable. Next …Yes!

43 Zaidan & Callison-Burch – Feasbility of “Human” MERT Is RYPT Acceptable? Is RYPT a reasonable alternative for human judgment? Our experiment: compare predictive power of RYPT vs. BLEU. Compare top-1 candidate by BLEU score vs. top-1 candidate by RYPT score. –Which candidate looks better to a human?

44 Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT vs. BLEU. RYPT Candidates BLEU. cand 1 cand 2 cand 3 cand 4 cand 5 cand 6 cand 7.

45 Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT vs. BLEU. RYPT Candidates BLEU. cand 1 cand 2 cand 3 cand 4 cand 5 cand 6 cand 7. Which one would be preferred by a human? Ask a Turker! Actually, ask 3 Turkers…  3 judgments * 250 sentence pairs 750 judgments RYPT’s BLEU’s choice vs. cand 5 vs. cand 3

46 Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT vs. BLEU RYPT’s choice is preferred 46.1% of the time, vs. 36.0% for BLEU’s choice. Majority vote breakdown:

47 Zaidan & Callison-Burch – Feasbility of “Human” MERT RYPT vs. BLEU RYPT’s choice is preferred 46.1% of the time, vs. 36.0% for BLEU’s choice. Majority vote breakdown: 48.0%35.2%16.8% 24.0%13.2% Majority vote picks BLEU’s choice Majority vote picks RYPT’s choice Majority vote strongly prefers RYPT’s choice Majority vote strongly prefers BLEU’s choice Strong preference for X = no votes for Y

48 Zaidan & Callison-Burch – Feasbility of “Human” MERT BLEU’s Inherent Advantage When comparing candidate translations, worker was shown the references. BLEU’s choice, by definition, will have high overlap with the reference. –Annotator might judge BLEU’s choice to be ‘better’ because it ‘looks’ like the reference. When no references shown, (and restricted to workers in Germany): 45.2%29.2%25.6% 46.1%36.0%17.9%

49 Zaidan & Callison-Burch – Feasbility of “Human” MERT See Paper for… Source-candidate alignment method, which takes advantage of derivation trees given by Joshua (see 3.1). Percolation coverage and accuracy, and effect of maxLen (see 5.1). Related work (see 6). –Nießen et al. (2000): DB of judgments. –WMT Workshops: manual evaluation; metric correlation with human judgment. –Snow et al. (2008): AMT is “fast and cheap.”

50 Zaidan & Callison-Burch – Feasbility of “Human” MERT Future Work This was a pilot study… Complete MERT run (already in progress). –Beyond a single iteration. –Using AMT’s API. Probabilistic approach to labeling nodes. –Treat a node label as a random variable. –Existing labels = observed, others inferred. Stay tuned for our next paper

51 Zaidan & Callison-Burch – Feasbility of “Human” MERT Final Notes Later today: C. Callison-Burch (2:15 PM) Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk Funding: –EuroMatrix, DARPA’s GALE, and US NSF. Thanks to: –Markus Dreyer, Jason Eisner, and Zhifei Li. –Turkers A14LPCJ1O1773B, A15P4FL5P235I0, AROUZI8PYSKUT.


Download ppt "Zaidan & Callison-Burch – Feasbility of “Human” MERT Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University."

Similar presentations


Ads by Google