24 January 2016© P C F de Oliveira 20081 Evaluating Summaries Automatically – a system proposal Paulo C. F. de Oliveira, Edson Wilson Torrens, Alexandre.

Slides:



Advertisements
Similar presentations
Statistical modelling of MT output corpora for Information Extraction.
Advertisements

Contingency Table Analysis Mary Whiteside, Ph.D..
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Data Analysis: Bivariate Correlation and Regression CHAPTER sixteen.
Learning Objectives 1 Copyright © 2002 South-Western/Thomson Learning Data Analysis: Bivariate Correlation and Regression CHAPTER sixteen.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Chapter 13: Inference for Distributions of Categorical Data
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Kruskall-Wallis and Friedman Tests Non-parametric statistical tests exist for.
CHAPTER 22 Reliability of Ordination Results From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach,
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Chapter 26: Comparing Counts. To analyze categorical data, we construct two-way tables and examine the counts of percents of the explanatory and response.
Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Copyright © 2008 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. John W. Creswell Educational Research: Planning,
Chapter Thirteen Part I
Learning Objective Chapter 14 Correlation and Regression Analysis CHAPTER fourteen Correlation and Regression Analysis Copyright © 2000 by John Wiley &
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Statistical Analysis A Quick Overview. The Scientific Method Establishing a hypothesis (idea) Collecting evidence (often in the form of numerical data)
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Notes for Candidates Writing a Practical Report (Unit 2543)
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Basic Data Analysis Chapter 14. Overview  Descriptive Analysis.
LexRank: Graph-based Centrality as Salience in Text Summarization
Advanced Higher Physics Investigation Report. Hello, and welcome to Advanced Higher Physics Investigation Presentation.
Customer Insights Nationwide Center for Advanced Customer Insights Title of Presentation Release / completion date.
Next Colin Clarke-Hill and Ismo Kuhanen 1 Analysing Quantitative Data 1 Forming the Hypothesis Inferential Methods - an overview Research Methods Analysing.
Literature Review. Outline of the lesson Learning objective Definition Components of literature review Elements of LR Citation in the text Learning Activity.
C M Clarke-Hill1 Analysing Quantitative Data Forming the Hypothesis Inferential Methods - an overview Research Methods.
New Advanced Higher Subject Implementation Events Statistics Unit Assessment at Advanced Higher.
MK346 – Undergraduate Dissertation Preparation Part II - Data Analysis and Significance Testing.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Question paper 1997.
Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs.
Chapter Eight: Using Statistics to Answer Questions.
Data Analysis.
Copyright © Cengage Learning. All rights reserved. Chi-Square and F Distributions 10.
Chapter 6: Analyzing and Interpreting Quantitative Data
Research Methodology Class.   Your report must contains,  Abstract  Chapter 1 - Introduction  Chapter 2 - Literature Review  Chapter 3 - System.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Chi-Square X 2. Review: the “null” hypothesis Inferential statistics are used to test hypotheses Whenever we use inferential statistics the “null hypothesis”
Biostatistics Nonparametric Statistics Class 8 March 14, 2000.
Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Introduction to Graph Theory By: Arun Kumar (Asst. Professor) (Asst. Professor)
T-tests Chi-square Seminar 7. The previous week… We examined the z-test and one-sample t-test. Psychologists seldom use them, but they are useful to understand.
Nonparametric Statistics
1 Collecting and Interpreting Quantitative Data Deborah K. van Alphen and Robert W. Lingard California State University, Northridge.
PSY 325 AID Education Expert/psy325aid.com FOR MORE CLASSES VISIT
Choosing and using your statistic. Steps of hypothesis testing 1. Establish the null hypothesis, H 0. 2.Establish the alternate hypothesis: H 1. 3.Decide.
Queensland University of Technology
CORRELATION.
Evaluation of IR Systems
CHAPTER fourteen Correlation and Regression Analysis
Formation of relationships Matching Hypothesis
Gerald Dyer, Jr., MPH October 20, 2016
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Presentation transcript:

24 January 2016© P C F de Oliveira Evaluating Summaries Automatically – a system proposal Paulo C. F. de Oliveira, Edson Wilson Torrens, Alexandre Cidral, Sidney Schossland, Evandro Bittencourt University of Joinville Joinville – SC – Brazil {pc.oliveira,edson.wilson, alexandre.cidral,sidney.schossland,

24 January 2016© P C F de Oliveira  Overview  Introduction  Method’s Description  Evaluation of Evaluation  Conclusions

24 January 2016© P C F de Oliveira  Introduction  lot of debate in the automatic text summarisation literature regarding appropriate evaluation metrics for measuring the quality of summaries  series of conferences have attested the importance of evaluation:  Text Retrieval Conference (TREC)  Message Understanding Conference (MUC)  TIPSTER SUMMAC Text Summarisation Evaluation  Document Understanding Conference (DUC)  Text Summarisation Challenge (TSC)

24 January 2016© P C F de Oliveira  Introduction  Difficulties in summary evaluation  No clear definition for what constitutes a good summary  Each text can have more than one ‘ideal’ summary  Dependency of human intervention cause some drawbacks such as time consuming, costly, expensive, subjective judgement and biasIntroduction if one needs an evaluation method which is fast and not influenced by these drawbacks, an automatic evaluation method could be the answer

24 January 2016© P C F de Oliveira  Overview  Introduction  Method’s Description  Evaluation of Evaluation  Conclusion

24 January 2016© P C F de Oliveira Evaluation Method  Automatic summary evaluation system called VERT (Valuation using Enhanced Rationale Technique)  VERT-C (  2 statistics)  Deals with content bearing words in both reference text and candidate summary using correlation analysis and  2 statistics  VERT-F (n-gram matching)  Deals with the matching between sentences using a graph theory method – maximum bipartite matching problem (MBMP) BLEU B i L ingual E valuation U nderstudy “Blue” (Papineni et al 2001) ROUGE R ecall- O riented U nderstudy for G isting E valuation “Red” (Lin 2004)

24 January 2016© P C F de Oliveira Evaluation Method  VERT-C (Chi-Square statistics) Algorithm (Siegel and Castellan, 1988) (Siegel and Castellan, 1988)  State the null hypothesis (H 0 ): the summary is a good representation of its parent/full text, i.e. the distribution of content bearing words in the summary is the same as in its parent/full text  State the alternative hypothesis (H 1 ): the summary is not a good representation of its parent/full text, i.e. the distribution of content bearing words in the summary is different from its parent/full text

24 January 2016© P C F de Oliveira Evaluation Method VERT-C Algorithm VERT-C Algorithm (cont’d) 1.P RODUCE a word frequency list (full text/summary) (observed) 2.N ORMALISE the list (e.g. analyse, analysed, analysing to analys) 3.ARRANGE the list in an array with 2 columns (one for the full text and other for the summary) – contingency table 4.SUM UP the cell frequencies across the columns 5.COMPUTE 6.COMPUTE 7.COMPUTE 8.COMPUTE VERT-C score This is VERT-C

24 January 2016© P C F de Oliveira Evaluation Method  VERT-C contingency table example WordFT SP10 O i (E i ) spiers85 (4.9) affairs64 (3.7) ambassador53 (3.1) general53 (3.1) state42 (2.5) political43 (2.5) director33 (1.8) department32 (1.8) undersecretary31 (1.8) pakistan22 (1.2) appoint21 (1.2) turkey21 (1.2) embassy21 (1.2) london21 (1.2) charge21 (1.2) bahamas21 (1.2) secretary21 (1.2) Total5735

24 January 2016© P C F de Oliveira Evaluation Method  VERT-F (n-gram matching) (Turian et al 2003) man  the  saw dog  the  manwasseenbythedog Reference Text Candidate Text

24 January 2016© P C F de Oliveira Evaluation Method  VERT-F (n-gram matching)  Maximum Bipartite Matching Problem (MBMP) (Cormen et al 2001) Let G=(V,E) be a bipartite graph in which V can be partitioned into two sets V 1 and V 2 such that V=V 1  V 2. A matching M on G is a subset of the edges (arcs) of G such that each vertex (node) in G is incident with no more than one edge in M. A maximum matching is a subgraph which pairs every vertex with exactly one other vertex; i.e. to find the largest subset of edges such that no two edges share an endpoint

24 January 2016© P C F de Oliveira Evaluation Method  Maximum Bipartite Matching Problem (MBMP)

24 January 2016© P C F de Oliveira Evaluation Method  Maximum Bipartite Matching Problem (MBMP)  Maximum Match Size (MMS) of a bitext is the size of any maximum matching for that bitext For  =1, we will have: This is VERT-F For  =1, we will have: This is VERT-F

24 January 2016© P C F de Oliveira Evaluation Method VERT’s Architecture

24 January 2016© P C F de Oliveira  Overview  Introduction  Method’s Description  Evaluation of Evaluation  Conclusion

24 January 2016© P C F de Oliveira  Evaluation of Evaluation  Investigate the performance of VERT  The efficacy of an automatic evaluation metric must be assessed through ‘correlation analysis’  The automatic scores should correlate highly with human scores (Lin 2004, Mani 2000)  Correlation analysis: makes use of measures of correlation which are ‘descriptive statistical measures that represent the degree of relationship between two or more variables’ (Sheskin 2000)

24 January 2016© P C F de Oliveira  Evaluation of Evaluation  Descriptive measures are also known as correlation coefficients  Is it possible to determine if one metric is ‘better’ than another through correlation analysis? E.g. ROUGE vs VERT?  The answer is ranking correlation (Voorhees 2000)  The rankings produced by a particular scoring (e.g. evaluation metric) are more important than the scores themselves Kendall’s Tau (  ) correlation coefficient gives us this insight Kendall’s  depends on the number of inversions in the rank order of one variable when the other variable is ranked in order

24 January 2016© P C F de Oliveira  Evaluation of Evaluation  Data set  DUC data because it contains 3 years of human judgments (DUC 2001, 2002 and 2003) DUC 2001 DUC 2002 DUC 2003 Summaries Type Single docs Summaries length (words) No. systems Total no. of summaries

24 January 2016© P C F de Oliveira  Evaluation of Evaluation  Kendall’s  was computed between the system’s average VERT-C and VERT-F scores and their respective mean coverage scores assigned by NIST assessors DUCVERT-C vs HumansVERT-F vs Humans

24 January 2016© P C F de Oliveira  Evaluation of Evaluation  Why VERT-F performed better than VERT-C?  Due to the difference of the approaches  VERT-F is based on matching of all words between the reference and the candidate text  VERT-C is word frequency based, i.e. the most frequent words (content bearing) in the reference and the candidate are considered, resulting in less similarity

24 January 2016© P C F de Oliveira  Evaluation of Evaluation  What about ROUGE and BLEU vs Human scores? DUCBLEU vs HumansROUGE vs Humans

24 January 2016© P C F de Oliveira  Evaluation of Evaluation

24 January 2016© P C F de Oliveira  Overview  Introduction  Method’s Description  Evaluation of Evaluation  Conclusions

24 January 2016© P C F de Oliveira  Conclusions  Investigation into the power of our evaluation procedure was carried out, and relied on correlation analysis  VERT-F performed better than VERT-C due to difference between the approaches  VERT scores correlated highly and positively with human scores  It is a significant achievement because 3 years of human evaluation have been used  We found worthwhile use BLEU and ROUGE as baselines for a comparative evaluation because ranking correlation presents itself as an interesting method due to its strong statistical background

24 January 2016© P C F de Oliveira  Conclusions  We found that ROUGE outperformed BLEU and VERT against human scores  However VERT-F had a similar performance in relation to ROUGE – the official metric used by NIST  Comparative evaluation is central to summary evaluation research, i.e. evaluation of evaluation  The use of a mature discipline like statistics allow us to confirm that our evaluation experiments are significant and the results are consistent  Our work contributed to solid advance in the state of the art

24 January 2016© P C F de Oliveira Future Work  Future Work  Future research can explore application of VERT in MT like BLEU and ROUGE  The development of a thesaurus module in VERT can explore linguistic variations like paraphrasing  VERT application in different domains