Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Statistical modelling of MT output corpora for Information Extraction.
Arthur Chan Prepared for Advanced MT Seminar
ADGEN USC/ISI ADGEN: Advanced Generation for Question Answering Kevin Knight and Daniel Marcu USC/Information Sciences Institute.
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology.
1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Text Specificity and Impact on Quality of News Summaries Annie Louis & Ani Nenkova University of Pennsylvania June 24, 2011.
Predicting Cloze Task Quality for Vocabulary Training Adam Skory, Maxine Eskenazi Language Technologies Institute Carnegie Mellon University
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Machine translation Context-based approach Lucia Otoyo.
Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Arthur Chan Prepared for Advanced MT Seminar
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
1 Towards Automated Related Work Summarization (ReWoS) HOANG Cong Duy Vu 03/12/2010.
Automatic Readability Evaluation Using a Neural Network Vivaek Shivakumar October 29, 2009.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Korea Maritime and Ocean University NLP Jung Tae LEE
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Ameeta Agrawal Nikolay Yakovets 01 Dec …Prime Minister Vladimir V. Putin, the country's paramount leader, cut short a trip to Siberia, returning.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Syntax Analysis Or Parsing. A.K.A. Syntax Analysis –Recognize sentences in a language. –Discover the structure of a document/program. –Construct (implicitly.
Automatic methods of MT evaluation Lecture 18/03/2009 MODL5003 Principles and applications of machine translation Bogdan Babych.
Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.
Machine Translation Course 9
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Approaches to Machine Translation
Approaches to Machine Translation
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Presentation transcript:

Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge

Natural Language Generation Content Planning Text Planning Sentence Planning Surface Realization

Approaches to Surface Realization Template based Domain-specific All output tends to be high quality because highly constrained Grammar based Typically one high quality output per input Forest based Many outputs per input Text-to-text No need for other generation components

Surface Realization Tasks To communicate the input meaning as completely, clearly and elegantly as possible by careful: Word selection Word and phrase arrangement Consideration of context

Importance of Lexical Choice I drove to Rochester. I raced to Rochester. I went to Rochester.

Importance of Syntactic Structure I picked up my coat three weeks later from the dry cleaners in Smithtown In Smithtown I picked up my coat from the dry cleaners three weeks later

Pragmatic Considerations An Israeli civilian was killed and another wounded when Palestinian militants opened fire on a vehicle Palestinian militant gunned down a vehicle on a road

Evaluating Text Generators Per-generator: Coverage Per-sentence: Adequacy Fluency / syntactic accuracy Informativeness Additional metrics of interest: Range: Ability to produce valid variants Readability Task-specific metrics E.g. for dialog

Evaluating Text Generators Human judgments Parsing+interpretation Automatic evaluation metrics – for generation or machine translation Simple string accuracy+ NIST* BLEU*+ F-measure* LSA

Question What is a “good” sentence? readable fluent adequate

Approach Question: Which evaluation metric or set of evaluation metrics least punishes variation? Word choice variation Syntactic structure variation Procedure: Correlation between human and automatic judgments of variations Context not included

String Accuracy Simple String Accuracy (I+D+S) / #Words (Callaway 2003, Langkilde 2002, Rambow et. al. 2002, Leusch et. al. 2003) Generation String Accuracy (M + I’ + D’ + S) / #Words (Rambow et. al. 2002)

Simple String Accuracy

BLEU Developed by Papenini et. al. at IBM Key idea: count matching subsequences between the reference and candidate sentences Avoid counting matches multiple times by clipping Punish differences in sentence length

BLEU

NIST ngram Designed to fix two problems with BLEU: Geometric mean penalizes large N Might like to prefer ngrams that are more informative = less likely

NIST ngram Arithmetic average over all ngram co- occurrences Weight “less likely” ngrams more Use a brevity factor to punish varying sentence lengths

F Measure Idea due to Melamed 1995, Turian et. al Same basic idea as ngram measures Designed to eliminate “double counting” done by ngram measures F = 2*precision*recall / (precision + recall) Precision(candidate|reference) = maximum matching size(candidate, reference) / |candidate| Precision(candidate|reference) = maximum matching size(candidate, reference) / |reference|

F Measure

LSA Doesn’t care about word order Evaluates how similar two bags of words are with respect to a corpus A good way of evaluating word choice?

Eval. Metric Means of measuring fluency Means of measuring adequacy Means of measuring readability Punishes length differences? SSAComparison against reference sentence Comparison against reference sentence from same context* Yes (punishes deletions, insertions) NIST n- gram, BLEU Comparison against reference sentences -- matching n-grams Comparison against reference sentences Comparison against reference sentences from same context* Yes (weights) F measure Comparison against reference sentences -- longest matching substrings Comparison against reference sentences Comparison against reference sentences from same context* Yes (length factor) LSANoneComparison against word co-occurrence frequencies learned from corpus NoneNot explicitly

Experiment 1 Sentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and Lee 2002) Includes word choice variation, e.g. Another person was also seriously wounded in the attack vs. Another individual was also seriously wounded in the attack Includes word order variation, e.g. A suicide bomber blew himself up at a bus stop east of Tel Aviv on Thursday, killing himself and wounding five bystanders, one of them seriously, police and paramedics said vs. A suicide bomber killed himself and wounded five, when he blew himself up at a bus stop east of Tel Aviv on Thursday

Paraphrase Generation 1. Cluster like sentences By hand or using word n-gram co-occurrence statistics May first remove certain details 2. Compute multiple-sequence alignment Choice points and regularities in input sentence pairs/sets in a corpus 3. Match lattices Match between corpora 4. Generate Lattice alignment

Paraphrase Generation Issues Sometimes words chosen for substitution carry unwanted connotations Sometimes extra words are chosen for inclusion (or words removed) that change the meaning

Experiment 1 To explore the impact of word choice variation, we used the baseline system data To explore the impact of word choice and word order variation, we used the MSA data

Human Evaluation Barzilay and Lee used a binary rating for meaning preservation and difficulty We used a five point scale How much of the meaning expressed in Sentence A is also expressed in Sentence B? All/Most/Half/Some/None How do you judge the fluency of Sentence B? It is Flawless/Good/Adequate/Poor/Incomprehensi ble

Human Evaluation: Sentences

Results Sentences: Automatic Evaluations

Results Sentences: Human Evaluations

Discussion These metrics achieve some level of correlation with human judgments of adequacy But could we do better? Most metrics are negatively correlated with fluency Word order or constituent order variation requires a different metric

Results Sentences: Naïve Combining

Effect of sentence length

Baseline vs. MSA Note: For Baseline system, there is no significant correlation between human judgments of adequacy and fluency and automatic evaluation scores. For MSA, there is a positive correlation between human judgments of adequacy and automatic evaluation scores,but not fluency.

Discussion Automatic evaluation metrics other than LSA punish word choice variation Automatic evaluation metrics other than LSA punish word order variation Are not as effected by word order variation as by word choice variation Punish legitimate and illegitimate word and constituent reorderings equally

Discussion Fluency These metrics are not adequate for evaluating fluency in the presence of variation Adequacy These metrics are barely adequate for evaluating adequacy in the presence of variation Readability These metrics do not claim to evaluate readability

A Preliminary Proposal Modify automatic evaluation metrics as follows: Not punish legitimate word choice variation E.g. using WordNet But the ‘simple’ approach doesn’t work Not punish legitimate word order variation But need a notion of constituency

Another Preliminary Proposal When using metrics that depend on a reference sentence, use A set of reference sentences Try to get as many of the word choice and word order variations as possible in the reference sentences Reference sentences from the same context as the candidate sentence To approach an evaluation of readability And combine with some other metric for fluency For example, a grammar checker

A Proposal To evaluate a generator: Evaluate for coverage using recall or related metric Evaluate for ‘precision’ using separate metrics for fluency, adequacy and readability At this point in time, only fluency may be evaluable automatically, using a grammar checker Adequacy can be approached using LSA or related metric Readability can only be evaluated using human judgments at this time

Current and Future Work Other potential evaluation metrics: F measure plus WordNet Parsing as measure of fluency F measure plus LSA Use multiple-sequence alignment as an evaluation metric Metrics that evaluate readability