Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

Slides:



Advertisements
Similar presentations
Statistical modelling of MT output corpora for Information Extraction.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Arthur Chan Prepared for Advanced MT Seminar
Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
1 Multi-topic based Query-oriented Summarization Jie Tang *, Limin Yao #, and Dewei Chen * * Dept. of Computer Science and Technology Tsinghua University.
Evaluating Search Engine
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Jumping Off Points Ideas of possible tasks Examples of possible tasks Categories of possible tasks.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Overview of Search Engines
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield,
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Metadata generation and glossary creation in eLearning Lothar Lemnitzer Review meeting, Zürich, 25 January 2008.
Carmen Banea, Rada Mihalcea University of North Texas A Bootstrapping Method for Building Subjectivity Lexicons for Languages.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Arthur Chan Prepared for Advanced MT Seminar
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Probabilistic Text Structuring: Experiments with Sentence Ordering Mirella Lapata Department of Computer Science University of Sheffield, UK (ACL 2003)
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.
Using Uneven Margins SVM and Perceptron for IE
Information Retrieval
Presentation transcript:

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K.

Pastra and Saggion, EACL 2003 Machine Translation vs. Summarization MT: accurate and fluent translation of source doc Auto Sum: informative, reduced version of source We will focus on:  Automatically generated extracts  Single-document Summarization - Sentence level compression  Automatic content-based evaluation  Reuse of evaluation metrics across NLP areas

Pastra and Saggion, EACL 2003 The challenge MT: demanding content evaluation Extracts: is their evaluation trivial by definition ??? Idiosyncrasies of the extract evaluation task:  Compression level and rate  High human disagreement on extract adequacy  Could an MT evaluation metric be ported to Automatic Summarization (extract) evaluation ?  If so, which testing parameters should be considered?

Pastra and Saggion, EACL 2003 BLEU Developed for MT evaluation (Papineni et al. ’01) => achieves high correlation with human judgement => is reliable even when run >> on different documents >> against different number of model references i.e. reliability is not affected by the use of either multiple references or just a single one

Pastra and Saggion, EACL 2003 Using BLEU in NLP >> 0.66 correlation in single-document summaries at 100 words compression rate against a single- reference summary >> 0.82 correlation when multiple-judged document units (sort of multiple references) used NLG (Zajic and Dorr, 2002) Summarization (Lin and Hovy, 2002) Lin-Hovy conclude: The use of a single reference affects reliability

Pastra and Saggion, EACL 2003 Evaluation Experiments set up Variables: compression rate, text cluster, gold standard HKNews Corpus (English - Chinese) 18K documents in English 40 thematic clusters = 400 documents each sentence in the cluster assessed by 3 judges with utility values (0-10) encoded in XML

Pastra and Saggion, EACL 2003 Evaluation Software Features: position, similarity with document, similarity with query, term distribution, NE scores, etc. (all normalised) Features are linearly combined to obtain sentence scores and sentence extracts Gate & Summarization classes Semantic tagging and Statistical Analysis Software

Pastra and Saggion, EACL 2003 Gold standards and summarisers QB = Query-sentence similarity summary Simple 1 = Doc-sentence similarity summary Simple 2 = Lead-based summary Simple 3 = End-of-document summary Reference n = utility based extract based on the utility given by judge n (n = 1,2,3) Reference all = utility based extract based on the sum of utilities given by the n judges

Pastra and Saggion, EACL 2003

Experiment 1 2 references compared against the third in 5 different compression rates in two text clusters (all available combinations) Are the results BLEU gives on inter-annotator agreement consistent ? => Inconsistency both across text clusters and within clusters at different compression rates (the latter more consistent than the former) => Reliability of BLEU in Sum seems to depend on values of the variables used. If so, how could one identify the appropriate values?

Pastra and Saggion, EACL 2003 Experiment 1 2 references compared against the third in 5 different compression rates in two text clusters (all available combinations) Ref %20%30%40%50% Reference Reference Ref %20%30%40%50% Reference Reference

Pastra and Saggion, EACL 2003 Experiment 2 For reference X within cluster Y across compression rates the ranking of the systems is not consistent Reference 310%20%30%40%50% Query-Based Simple Simple Simple

Pastra and Saggion, EACL 2003 Experiment 3 For reference X at compression Y across clusters the ranking of the systems is not consistent Reference 1 – 30% Query-Based Simple Simple Simple

Pastra and Saggion, EACL 2003 Experiment 4 For reference ALL across clusters at multiple compression rates the ranking of the systems is (more) consistent Ref-ALL %20%30%40%50% Query-Based Simple Simple Simple Ref-ALL %20%30%40%50% Query-Based Simple Simple Simple

Pastra and Saggion, EACL 2003 Experiment 4 (cont.) Is there a way to use BLEU with a single reference summary and still get reliable results back? 10%20%30%40%50%Average Rank Ref * 1234 Ref Ref * 2314 Ref Ref Ref

Pastra and Saggion, EACL 2003 Fails to capture semantic equivalences between n-grams in both their various lexical and syntactical manifestations Notes on BLEU Examples: “Of the 9,928 drug abusers reported in first half of the year, 1,445 or 14.6% were aged under 21.” vs. “...number of reported abusers” “This represents a decrease of 17% over the 1,740 young drug abusers in the first half of 1998.”

Pastra and Saggion, EACL 2003 Use of multiple reference summaries needed when using BLEU in Summarization Lack of such resources could probably be overcome using the average rank aggregation technique Conclusions Future work: Scaling up of the experiments Correlation of BLEU with other content-based metrics used in Summarization