1 The Pyramid Method at DUC05 Ani Nenkova Becky Passonneau Kathleen McKeown Other team members: David Elson, Advaith Siddharthan, Sergey Siegelman.

Slides:



Advertisements
Similar presentations
Test Development.
Advertisements

Through Instructional Rounds
Correlational and Differential Research
Ani Nenkova Lucy Vanderwende Kathleen McKeown SIGIR 2006.
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
SemQuest: University of Houston’s Semantics-based Question Answering System Rakesh Verma University of Houston Team: Txsumm Joint work with Araly Barrera.
UNC-CH at DUC2007: Query Expansion, Lexical Simplification, and Sentence Selection Strategies for Multi-Document Summarization Catherine Blake Julia Kampov.
Evaluating Search Engine
Text Specificity and Impact on Quality of News Summaries Annie Louis & Ani Nenkova University of Pennsylvania June 24, 2011.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
1 Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, Julia Hirschberg.
Statistical Analysis of the Social Network and Discussion Threads in Slashdot Vicenç Gómez, Andreas Kaltenbrunner, Vicente López Defended by: Alok Rakkhit.
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Two THE DESIGN OF RESEARCH.
FOUNDATIONS OF NURSING RESEARCH Sixth Edition CHAPTER Copyright ©2012 by Pearson Education, Inc. All rights reserved. Foundations of Nursing Research,
1 Multi-document Summarization and Evaluation. 2 Task Characteristics  Input: a set of documents on the same topic  Retrieved during an IR search 
Agenda: Block Watch: Random Assignment, Outcomes, and indicators Issues in Impact and Random Assignment: Youth Transition Demonstration –Who is randomized?
Unit 4: Monitoring Data Quality For HIV Case Surveillance Systems #6-0-1.
1 Quality Control Review of E3 Calculator Inputs Comparison to DEER Database Brian Horii Energy and Environmental Economics, Inc. November 16, 2006.
1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Group Recommendations with Rank Aggregation and Collaborative Filtering Linas Baltrunas, Tadas Makcinskas, Francesco Ricci Free University of Bozen-Bolzano.
Chapter 11 Descriptive Statistics Gay, Mills, and Airasian
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Measuring Complex Achievement
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
Investigating the Relationship between Scores
A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis Kira Radinsky (Technion) Eugene Agichtein (Emory) Evgeniy Gabrilovich (Yahoo!
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Hypothesis testing Intermediate Food Security Analysis Training Rome, July 2010.
A Forest Cover Change Study Gone Bad Lessons Learned(?) Measuring Changes in Forest Cover in Madagascar Ned Horning Center for Biodiversity and Conservation.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Sub-regional Workshop on Census Data Evaluation, Phnom Penh, Cambodia, November 2011 Evaluation of Age and Sex Distribution United Nations Statistics.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Excel 2007 Part (3) Dr. Susan Al Naqshbandi
Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
Developing a Metric for Evaluating Discussion Boards Dr. Robin Kay University of Ontario Institute of Technology 2 November 2004.
Psychometrics: Exam Analysis David Hope
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Educational Research Descriptive Statistics Chapter th edition Chapter th edition Gay and Airasian.
IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.
©2013, The McGraw-Hill Companies, Inc. All Rights Reserved Chapter 3 Investigating the Relationship of Scores.
Meta-analysis Overview
Summarization Systems & Evaluation
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
CHAPTER 3: Practical Measurement Concepts
Challenges in Creating an Automated Protein Structure Metaserver
Evaluation of IR Systems
Associated with quantitative studies
Entity- & Topic-Based Information Ordering
Clustering Algorithms for Noun Phrase Coreference Resolution
Defining, Measuring, and Dealing with Nonindependence
John Frazier and Jonathan perrier
15.1 The Role of Statistics in the Research Process
Presentation transcript:

1 The Pyramid Method at DUC05 Ani Nenkova Becky Passonneau Kathleen McKeown Other team members: David Elson, Advaith Siddharthan, Sergey Siegelman

2 Overview  Review of Pyramids (Kathy)  Characteristics of the responses  Analyses (Ani)  Scores and Significant Differences  Reliability of Pyramid scoring  Comparisons between annotators  Impact of editing on scores  Impact of Weight 1 SCUs  Correlation with responsiveness and Rouge  Lessons learned

3 Pyramids  Uses multiple human summaries  Previous data indicated 5 needed for score stability  Information is ranked by its importance  Allows for multiple good summaries  A pyramid is created from the human summaries  Elements of the pyramid are content units  System summaries are scored by comparison with the pyramid

4 Summarization Content Units  Near-paraphrases from different human summaries  Clause or less  Avoids explicit semantic representation  Emerges from analysis of human summaries

5 SCU: A cable car caught fire (Weight = 4) A. The cause of the fire was unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

6 SCU: The cause of the fire is unknown (Weight = 1) A. The cause of the fire was unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

7 SCU: The accident happened in the Austrian Alps (Weight = 3) A. The cause of the fire was unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

8 Idealized representation  Tiers of differentially weighted SCUs  Top: few SCUs, high weight  Bottom: many SCUs, low weight W=1 W=2 W=3

9 Creation of pyramids  Done for each of 20 out of 50 sets  Primary annotator, secondary checker  Held round-table discussions of problematic constructions that occurred in this data set  Comma separated lists u Extractive reserves have been formed for managed harvesting of timber, rubber, Brazil nuts, and medical plants without deforestation.  General vs. specific u Eastern Europe vs. Hungary, Poland, Lithuania, and Turkey

10 Characteristics of the Responses  Proportion of SCUs of Weight 1 is large  44% (D324) to 81% (D695)  Mean SCU weight: 1.9 Agreement among human responders is quite low

11 SCU Weights # of SCUs at each weight

12 Pyramids: DUC 2003  100 word summaries (vs. 250 word)  word articles per cluster (vs word articles)  3 clusters (vs. 20 clusters)  Mean SCU Weight (7 models)  2005: avg 1.9  2003: avg 2.4  Proportion of SCUs of W=1  2005: avg – 60%, 44% to 81%  2003: avg – 40%, 37% to 47%

13 DUC03 DUC05.4

14 Computing pyramid scores: Ideally informative summary  Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

15 Ideally informative summary  Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

16 Ideally informative summary  Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

17 Ideally informative summary  Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

18 Ideally informative summary  Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

19 Ideally informative summary  Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

20 Original Pyramid Score SCORE = D/MAX D: Sum of the weights of the SCUs in a summary MAX: Sum of the weights of the SCUs in a ideally informative summary Measures the proportion of good information in the summary: precision

21 Modified pyramid score (recall)  EN = average SCUs in human models  This is the number of content units humans chose to convey about the story  W=Compute the weight of a maximally informative summary of size EN  D/W is the modified pyramid score  Shows the proportion of expected good information

22 Scoring Methods  Presents scores for the 20 pyramid sets  Recompute Rouge for comparison  We compute Rouge using only 7 models  8 and 9 reserved for computing human performance  Best because of significant topic effect  Comparisons between Pyramid (original,modified), responsiveness, and Rouge-SU4  Pyramids score computed from multiple humans  Responsiveness is just one human’s judgment  Rouge-SU4 equivalent to Rouge-2

23 Preview of Results  Manual metrics  Large differences between humans and machines  No single system the clear winner  But a top group identified by all metrics  Significant differences  Different predictions from manual and automatic metrics  Correlations between metrics  Some correlation but one cannot be substituted for another  This is good

24 Human performance/Best sys Pyramid Modified Resp ROUGE-SU4 B: B: A: A: A: A: B: B: ~~~~~~~~~~~~~~~~~ 14: : : : Best system ~50% of human performance on manual metrics Best system ~80% of human performance on ROUGE

25 Pyramid original Modified Resp Rouge-SU4 14: : : : : : : 2.8 4: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.3 3: : : : : : : : : : : : :

26 Pyramid original Modified Resp Rouge-SU4 14: : : : : : : 2.8 4: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.3 3: : : : : : : : : : : : :

27 Pyramid original Modified Resp Rouge-SU4 14: : : : : : : 2.8 4: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.3 3: : : : : : : : : : : : :

28 Pyramid original Modified Resp Rouge-SU4 14: : : : : : : 2.8 4: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.3 3: : : : : : : : : : : : :

29 Significant Differences  Manual metrics  Few differences between systems u Pyramid: 23 is worse u Responsive: 23 and 31 are worse  Both humans better than all systems  Automatic (Rouge-SU4)  Many differences between systems  One human indistinguishable from 5 systems

30 Multiple and pairwise comparisons  Multiple comparisons  Tukey’s method  Control for the experiment-wise type I error  Show fewer significant differences  Pairwise comparisons  Wilcoxon paired test  Controls the error for individual comparisons  Appropriate how your system did for development

A B Modified pyramid: significant differences One systems accounts for most of the differences Humans significantly better than all systems PeerBetter than

B A Responsiveness 1: Significant differences Differences primarily between 2 systems Differences between humans and each system

B A Responsive-2 Similar shape to original

B A Skip-bigram: significant differences Many more differences between systems than any manual metric No difference between human and 5 systems

35

36 Pairwise comparisons: Modified Pyramid

37 Agreement between annotators OverallLowHigh Percent Agreement 95%90%96% Kappa Alpha Alpha-Dice

38 Editing of participant annotations  To correct obvious errors  Ensures uniform checking  Predominantly involved correct splitting unmatching SCUs  Average paired differences  Original:  Modified:  Average magnitude of the difference  Original:  Modified:

39 Excluding weight 1 SCUs  Removing weight 1 SCUs improves agreement  Kappa: 0.64 (was 0.57)  Annotating without weight 1 has negligible impact on scores  Set D324 done without weight 1 SCUs  Ave.magnitude between paired differences  On average 0.07 difference

40 Correlations: Pearson’s, 25 systems Pyr-modResp-1Resp2R-2R-SU4 Pyr-orig Pyr-mod Resp Resp R

41 Correlations: Pearson’s, 25 systems Pyr-modResp-1Resp2R-2R-SU4 Pyr-orig Pyr-mod Resp Resp R Questionable that responsiveness could be a gold standard

42 Pyramid and responsiveness Pyr-modResp-1Resp2R-2R-SU4 Pyr-orig Pyr-mod Resp Resp R High correlation, but the metrics are not mutually substitutable

43 Pyramid and Rouge Pyr-modResp-1Resp2R-2R-SU4 Pyr-orig Pyr-mod Resp Resp R High correlation, but the metrics are not mutually substitutable

44 Lessons Learned  Comparing content is hard  All kinds of judgment calls  We didn’t evaluate the NIST assessors in previous years  Paraphrases  VP vs. NP u Ministers have been exchanged u Reciprocal ministerial visits  Length and constituent type u Robotics assists doctors in the medical operating theater u Surgeons started using robotic assistants

45 Modified scores better  Easier peer annotation  Can drop weight 1 SCUs  Better agreement  No emphasis on splitting non-matching SCUs

46 Agreement between annotators  Participants can perform peer annotation reliably  Absolute difference between scores  Original:  Modified:  Empirical prediction of difference 0.06 (HLT 2004)

47 Correlations  Original and modified can substitute for each other  High correlation between manual and automatic, but automatic not yet a substitute  Similar patterns between pyramid and responsiveness

48 Current Directions  Automated identification of SCUs (Harnly et al 05)  Applied to DUC05 pyramid data set  Correlation of.91 with modified pyramid scores

49 Questions  What was the experience annotating pyramids?  Does it shed insight on the problem  Are people willing to do it again?  Would you have been willing to go through training?  If you’ve done pyramid analysis, can you share your insights

50

51

52 Correlations of Scores on Matched Sets

53 SCU Weight by Cardinality (Ten pyramids)

54 Mean SCU Weight (Ten pyramids)