Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia.

Similar presentations


Presentation on theme: "1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia."— Presentation transcript:

1 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia University Major contributers: Ani Nenkova, Becky Passonneau

2 2

3 3 Questions  What kinds of evaluation are possible?  What are the pitfalls?  Are evaluation metrics fair?  Is real research progress possible?  What are the benefits?  Should we evaluate our systems?

4 4 What is the feel of the evaluation?  Is it competitive?  Does it foster a feeling of community?  Are the guidelines clearly established ahead of time?  Are the metrics fair? Do they measure what you want to measure?

5 5

6 6 The night Max wore his wolf suit and made mischief of one kind

7 7 and another

8 8 His mother called him “WILD THING” and he said “I’LL EAT YOU UP!” so he was sent to bed without eating anything.

9 9 DARPA GALE: Global Autonomous Language Environment  Three large teams: BBN, IBM, SRI  SRI: UC Berkeley, U Washington, UCSD, Columbia, NYU, UMASS, NCRI, Systran, Fair Isaacs, Ohio State  Generate responses to open-ended questions  17 templates: definitions, biographies, events, relationships, reactions, etc.  Using English, Chinese, and Arabic text and speech, blogs to news  Find all instances when a fact is mentioned (redundancy)

10 10 GALE Evaluation  Can systems do at least 50% as well as a human?  If not, the GALE program will not continue  The team that does worst may be cut  Independent evaluator: BAE  Has never done text evaluation before  Has experience with task based evaluation  Gold Standard  System responses graded by two judges  Relevant facts added to the pool  Granularity of scoring: nuggets  Metrics  Variants of precision/recall weighted  Document citations  Redundancy

11 11 Year 1: Sample Q&A LIST FACTS ABOUT [The Trial of Saddam Hussein]  The judge, however, that all people should have heard voices, the order of a court to solve technical problems. (Chi)  His account of events surrounding the torture and execution of more than 140 men and teenage boys from the Dujail, appeared to do little to advance the prosecution's goal of establishing Saddam 's "command responsibility" for the deaths.  A trial without Saddam could be an embarrassment for the U.S. government, which has worked hard to help create a tribunal that would be perceived by Iraqis as independent and fair.  As the trial got under way, a former secret police officer testified that he had not received any orders from Saddam during the investigations that followed an assassination attempt against him in Dujail in 1982.

12 12 Year 1: Results  F-value (Beta of 1)  Machine average: 0.230  Human average: 0.353  Machine to Human average: 0.678

13 13 DUC – Document Understanding Conference  Established and funded by DARPA TIDES  Run by independent evaluator NIST  Open to summarization community  Annual evaluations on common datasets  2001-present  Tasks  Single document summarization  Headline summarization  Multi-document summarization  Multi-lingual summarization  Focused summarization  Update summarization

14 14 DUC is changing direction again  DARPA GALE effort cutting back participation in DUC  Considering co-locating with TREC QA  Considering new data sources and tasks

15 15 DUC Evaluation  Gold Standard  Human summaries written by NIST  From 2 to 9 summaries per input set  Multiple metrics  Manual u Coverage (early years) u Pyramids (later years) u Responsiveness (later years) u Quality questions  Automatic u Rouge (-1, -2, -skipbigrams, LCS, BE)  Granularity  Manual: sub-sentential elements  Automatic: sentences

16 16 TREC definition pilot  Long answer to request for a definition  As a pilot, less emphasis on results  Part of TREC QA

17 17 Evaluation Methods  Pool system responses and break into nuggets  A judge scores nuggets as vital, OK or invalid  Measure information precision and recall  Can a judge reliably determine which facts belong in a definition?

18 18 Considerations Across Evaluations  Independent evaluator  Not always as knowledgeable as researchers  Impartial determination of approach  Extensive collection of resources  Determination of task  Appealing to a broad cross-section of community  Changes over time u DUC 2001-2002 Single and multi-document u DUC 2003: headlines, multi-document u DUC 2004: headlines, multilingual and multi-document, focused u DUC 2005: focused summarization u DUC 2006: focused and a new task, up for discussion  How long do participants have to prepare?  When is a task dropped?  Scoring of text at the sub-sentential level

19 19 Task-based Evaluation  Use the summarization system as browser to do another task  Newsblaster: write a report given a broad prompt  DARPA utility evaluation: given a request for information, use question answering to write report

20 20 Task Evaluation  Hypothesis: multi-document summaries enable users to find information efficiently  Task: fact-gathering given topic and questions  Resembles intelligence analyst task

21 21 User Study: Objectives  Does multi-document summarization help?  Do summaries help the user find information needed to perform a report writing task?  Do users use information from summaries in gathering their facts?  Do summaries increase user satisfaction with the online news system?  Do users create better quality reports with summaries?  How do full multi-document summaries compare with minimal 1-sentence summaries such as Google News?

22 22 User Study: Design  Compared 4 parallel news browsing systems  Level 1: Source documents only  Level 2: One sentence multi-document summaries (e.g., Google News) linked to documents  Level 3: Newsblaster multi-document summaries linked to documents  Level 4: Human written multi-document summaries linked to documents  All groups write reports given four scenarios  A task similar to analysts  Can only use Newsblaster for research  Time-restricted

23 23 User Study: Execution  4 scenarios  4 event clusters each  2 directly relevant, 2 peripherally relevant  Average 10 documents/cluster  45 participants  Balance between liberal arts, engineering  138 reports  Exit survey  Multiple-choice and open-ended questions  Usage tracking  Each click logged, on or off-site

24 24 “Geneva” Prompt  The conflict between Israel and the Palestinians has been difficult for government negotiators to settle. Most recently, implementation of the “road map for peace”, a diplomatic effort sponsored by ……  Who participated in the negotiations that produced the Geneva Accord?  Apart from direct participants, who supported the Geneva Accord preparations and how?  What has the response been to the Geneva Accord by the Palestinians?

25 25 Measuring Effectiveness  Score report content and compare across summary conditions  Compare user satisfaction per summary condition  Comparing where subjects took report content from

26 26 Newsblaster

27 27 User Satisfaction  More effective than a web search with Newsblaster  Not true with documents only or single-sentence summaries  Easier to complete the task with summaries than with documents only  Enough time with summaries than documents only  Summaries helped most  5% single sentence summaries  24% Newsblaster summaries  43% human summaries

28 28 User Study: Conclusions  Summaries measurably improve a news browser’s effectiveness for research  Users are more satisfied with Newsblaster summaries are better than single-sentence summaries like those of Google News  Users want search  Not included in evaluation

29 29 Potential Problems

30 30 That very night in Max’s room a forest grew

31 31 And grew

32 32 And grew until the ceiling hung with vines and the walls became the world all around

33 33 And an ocean tumbled by with a private boat for Max and he sailed all through the night and day

34 34 And he sailed in and out of weeks and almost over a year to where the wild things are

35 35 And when he came to where the wild things are they roared their terrible roars and gnashed their terrible teeth

36 36 Comparing Text Against Text  Which human summary makes a good gold standard? Many summaries are good  At what granularity is the comparison made?  When can we say that two pieces of text match?

37 37 Measuring variation  Types of variation between humans Applications Translation  same content  different wording Summarization  different content??  different wording Generation  different content??  different wording

38 38 Human variation: content words (Ani Nenkova) Summaries differ in vocabulary  Differences cannot be explained by paraphrase 7 translations  20 documents 7 summaries  20 document sets Faster vocabulary growth in summarization

39 39 Variation impacts evaluation  Comparing content is hard  All kinds of judgment calls  Paraphrases  VP vs. NP u Ministers have been exchanged u Reciprocal ministerial visits  Length and constituent type u Robotics assists doctors in the medical operating theater u Surgeons started using robotic assistants

40 40 Nightmare: only one gold standard  System may have chosen an equally good sentence but not in the one gold standard  Pinochet arrested in London on Oct 16 at a Spanish judge’s request for atrocities against Spaniards in Chile.  Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government  In DUC 2001 (one gold standard), human model had significant impact on scores (McKeown et al)  Five human summaries needed to avoid changes in rank (Nenkova and Passonneau)  DUC2003 data  3 topic sets, 1 highest scoring and 2 lowest scoring  10 model summaries

41 41 How many summaries are enough?

42 42 Scoring  Two main approaches used in DUC  ROUGE (Lin and Hovy)  Pyramids (Nenkova and Passonneau)  Problems:  Are the results stable?  How difficult is it to do the scoring?

43 43 ROUGE: Recall-Oriented Understudy for Gisting Evaluation Rouge – Ngram co-occurrence metrics measuring content overlap Counts of n-gram overlaps between candidate and model summaries Total n-grams in summary model

44 44 ROUGE  Experimentation with different units of comparison: unigrams, bigrams, longest common substring, skip- bigams, basic elements  Automatic and thus easy to apply  Important to consider confidence intervals when determining differences between systems  Scores falling within same interval not significantly different  Rouge scores place systems into large groups: can be hard to definitively say one is better than another  Sometimes results unintuitive:  Multilingual scores as high as English scores  Use in speech summarization shows no discrimination  Good for training regardless of intervals: can see trends

45 45 Pyramids  Uses multiple human summaries  Information is ranked by its importance  Allows for multiple good summaries  A pyramid is created from the human summaries  Elements of the pyramid are content units  System summaries are scored by comparison with the pyramid

46 46 Content units: better study of variation than sentences  Semantic units  Link different surface realizations with the same meaning  Emerge from the comparison of several texts

47 47 Content unit example S1 Pinochet arrested in London on Oct 16 at a Spanish judge’s request for atrocities against Spaniards in Chile. S2 Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government. S3 Britain caused international controversy and Chilean turmoil by arresting former Chilean dictator Pinochet in London.

48 48 SCU: A cable car caught fire (Weight = 4) A. The cause of the fire was unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

49 49 SCU: The cause of the fire is unknown (Weight = 1) A. The cause of the fire was unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

50 50 Idealized representation  Tiers of differentially weighted SCUs  Top: few SCUs, high weight  Bottom: many SCUs, low weight W=1 W=2 W=3

51 51 Comparison of Scoring Methods in DUC05  Analysis of scores for the 20 pyramid sets  Columbia prepared pyramids  Participants scored systems against pyramids  Comparisons between Pyramid (original,modified), responsiveness, and Rouge-SU4  Pyramids score computed from multiple humans  Responsiveness is just one human’s judgment  Rouge-SU4 equivalent to Rouge-2

52 52 Creation of pyramids  Done at Columbia for each of 20 out of 50 sets  Primary annotator, secondary checker  Held round-table discussions of problematic constructions that occurred in this data set  Comma separated lists u Extractive reserves have been formed for managed harvesting of timber, rubber, Brazil nuts, and medical plants without deforestation.  General vs. specific u Eastern Europe vs. Hungary, Poland, Lithuania, and Turkey

53 53 Characteristics of the Responses  Proportion of SCUs of Weight 1 is large  44% (D324) to 81% (D695)  Mean SCU weight: 1.9 Agreement among human responders is quite low

54 54 SCU Weights # of SCUs at each weight

55 55 Preview of Results  Manual metrics  Large differences between humans and machines  No single system the clear winner  But a top group identified by all metrics  Significant differences  Different predictions from manual and automatic metrics  Correlations between metrics  Some correlation but one cannot be substituted for another  This is good

56 56 Human performance/Best sys Pyramid Modified Resp ROUGE-SU4 B: 0.5472 B: 0.4814 A: 4.895 A: 0.1722 A: 0.4969 A: 0.4617 B: 4.526 B: 0.1552 ~~~~~~~~~~~~~~~~~ 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 Best system ~50% of human performance on manual metrics Best system ~80% of human performance on ROUGE

57 57 Pyramid original Modified Resp Rouge-SU4 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.1278 16: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

58 58 Pyramid original Modified Resp Rouge-SU4 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.1278 16: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

59 59 Pyramid original Modified Resp Rouge-SU4 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.1278 16: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

60 60 Pyramid original Modified Resp Rouge-SU4 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.1278 16: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

61 61 Significant Differences  Manual metrics  Few differences between systems u Pyramid: 23 is worse u Responsive: 23 and 31 are worse  Both humans better than all systems  Automatic (Rouge-SU4)  More differences between systems  One human indistinguishable from 5 systems

62 62 Correlations: Pearson’s, 25 systems Pyr-modResp-1Resp2R-2R-SU4 Pyr-orig0.960.770.860.840.80 Pyr-mod0.810.90 0.86 Resp-10.830.92 Resp-20.880.87 R-2 0.98

63 63 Correlations: Pearson’s, 25 systems Pyr-modResp-1Resp2R-2R-SU4 Pyr-orig0.960.770.860.840.80 Pyr-mod0.810.90 0.86 Resp-10.830.92 Resp-20.880.87 R-2 0.98 Questionable that responsiveness could be a gold standard

64 64 Pyramid and responsiveness Pyr-modResp-1Resp2R-2R-SU4 Pyr-orig0.960.770.860.840.80 Pyr-mod0.810.90 0.86 Resp-10.830.92 Resp-20.880.87 R-20.98 High correlation, but the metrics are not mutually substitutable

65 65 Pyramid and Rouge Pyr-modResp-1Resp2R-2R-SU4 Pyr-orig0.960.770.860.840.80 Pyr-mod0.810.90 0.86 Resp-10.830.92 Resp-20.880.87 R-20.98 High correlation, but the metrics are not mutually substitutable

66 66 Correlations  Original and modified can substitute for each other  High correlation between manual and automatic, but automatic not yet a substitute  Similar patterns between pyramid and responsiveness

67 67 Nightmare  Scoring metric that is not stable used to decide funding  Insignificant differences between systems determine funding

68 68 Is Task Evaluation Nightmare Free?  Impact of user interface issues  Can have more impact than the summary  Controlling for proper mix of subjects  Quantity of subjects and time to carry out is large

69 69 Till Max said “Be still!” and tamed them with the magic trick

70 70 Of staring into their yellow eyes without blinking once And they were frightened and called him the most wild thing of all

71 71 And made him king of all wild things

72 72 “And now,” cried Max “Let the wild rumpus start!”

73 73

74 74

75 75

76 76 Are we having fun yet? Benefits of evaluation  Emergence of evaluation methods  ROUGE  Pyramids  Nuggetteer  Research into characteristics of metrics  Analyses of sub-sentential units  Paraphrase as a research issue

77 77 Available Data  DUC data sets  4 years of summary/document set pairs u Multidocument summarization training data not available beforehand  4 years of scoring patterns  Led to analysis of human summaries  Pyramids  Pyramids and peers for 40 topics (DUC04, DUC05)  Many more from Nenkova and Passonneau  Training data for paraphrase  Training data for abstraction -> see systems moving away from pure sentence extraction

78 78 Wrapping up

79 79 Lessons Learned  Evaluation environment is important  Find a task with broad appeal  Use independent evaluator  At least a committee  Use multiple gold standards  Compare text at the content unit level  Evaluate the metrics  Look at significant differences

80 80 Is Evaluation Worth It?  DUC: creation of a community  From ~15 participants year 1 -> 30 participants year 5  No longer impacts funding  Enables research into evaluation  At start, no idea how to evaluate summaries  But, results do not tell us everything

81 81 And he sailed back over a year, in and out of weeks and through a day

82 82 And into the night of his very own room where he found his supper waiting for him.. And it was still warm.


Download ppt "1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia."

Similar presentations


Ads by Google