1 The Pyramid Method at DUC05 Ani Nenkova Becky Passonneau Kathleen McKeown Other team members: David Elson, Advaith Siddharthan, Sergey Siegelman
2 Overview Review of Pyramids (Kathy) Characteristics of the responses Analyses (Ani) Scores and Significant Differences Reliability of Pyramid scoring Comparisons between annotators Impact of editing on scores Impact of Weight 1 SCUs Correlation with responsiveness and Rouge Lessons learned
3 Pyramids Uses multiple human summaries Previous data indicated 5 needed for score stability Information is ranked by its importance Allows for multiple good summaries A pyramid is created from the human summaries Elements of the pyramid are content units System summaries are scored by comparison with the pyramid
4 Summarization Content Units Near-paraphrases from different human summaries Clause or less Avoids explicit semantic representation Emerges from analysis of human summaries
5 SCU: A cable car caught fire (Weight = 4) A. The cause of the fire was unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.
6 SCU: The cause of the fire is unknown (Weight = 1) A. The cause of the fire was unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.
7 SCU: The accident happened in the Austrian Alps (Weight = 3) A. The cause of the fire was unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.
8 Idealized representation Tiers of differentially weighted SCUs Top: few SCUs, high weight Bottom: many SCUs, low weight W=1 W=2 W=3
9 Creation of pyramids Done for each of 20 out of 50 sets Primary annotator, secondary checker Held round-table discussions of problematic constructions that occurred in this data set Comma separated lists u Extractive reserves have been formed for managed harvesting of timber, rubber, Brazil nuts, and medical plants without deforestation. General vs. specific u Eastern Europe vs. Hungary, Poland, Lithuania, and Turkey
10 Characteristics of the Responses Proportion of SCUs of Weight 1 is large 44% (D324) to 81% (D695) Mean SCU weight: 1.9 Agreement among human responders is quite low
11 SCU Weights # of SCUs at each weight
12 Pyramids: DUC 2003 100 word summaries (vs. 250 word) word articles per cluster (vs word articles) 3 clusters (vs. 20 clusters) Mean SCU Weight (7 models) 2005: avg 1.9 2003: avg 2.4 Proportion of SCUs of W=1 2005: avg – 60%, 44% to 81% 2003: avg – 40%, 37% to 47%
13 DUC03 DUC05.4
14 Computing pyramid scores: Ideally informative summary Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well
15 Ideally informative summary Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well
16 Ideally informative summary Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well
17 Ideally informative summary Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well
18 Ideally informative summary Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well
19 Ideally informative summary Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well
20 Original Pyramid Score SCORE = D/MAX D: Sum of the weights of the SCUs in a summary MAX: Sum of the weights of the SCUs in a ideally informative summary Measures the proportion of good information in the summary: precision
21 Modified pyramid score (recall) EN = average SCUs in human models This is the number of content units humans chose to convey about the story W=Compute the weight of a maximally informative summary of size EN D/W is the modified pyramid score Shows the proportion of expected good information
22 Scoring Methods Presents scores for the 20 pyramid sets Recompute Rouge for comparison We compute Rouge using only 7 models 8 and 9 reserved for computing human performance Best because of significant topic effect Comparisons between Pyramid (original,modified), responsiveness, and Rouge-SU4 Pyramids score computed from multiple humans Responsiveness is just one human’s judgment Rouge-SU4 equivalent to Rouge-2
23 Preview of Results Manual metrics Large differences between humans and machines No single system the clear winner But a top group identified by all metrics Significant differences Different predictions from manual and automatic metrics Correlations between metrics Some correlation but one cannot be substituted for another This is good
24 Human performance/Best sys Pyramid Modified Resp ROUGE-SU4 B: B: A: A: A: A: B: B: ~~~~~~~~~~~~~~~~~ 14: : : : Best system ~50% of human performance on manual metrics Best system ~80% of human performance on ROUGE
25 Pyramid original Modified Resp Rouge-SU4 14: : : : : : : 2.8 4: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.3 3: : : : : : : : : : : : :
26 Pyramid original Modified Resp Rouge-SU4 14: : : : : : : 2.8 4: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.3 3: : : : : : : : : : : : :
27 Pyramid original Modified Resp Rouge-SU4 14: : : : : : : 2.8 4: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.3 3: : : : : : : : : : : : :
28 Pyramid original Modified Resp Rouge-SU4 14: : : : : : : 2.8 4: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.3 3: : : : : : : : : : : : :
29 Significant Differences Manual metrics Few differences between systems u Pyramid: 23 is worse u Responsive: 23 and 31 are worse Both humans better than all systems Automatic (Rouge-SU4) Many differences between systems One human indistinguishable from 5 systems
30 Multiple and pairwise comparisons Multiple comparisons Tukey’s method Control for the experiment-wise type I error Show fewer significant differences Pairwise comparisons Wilcoxon paired test Controls the error for individual comparisons Appropriate how your system did for development
A B Modified pyramid: significant differences One systems accounts for most of the differences Humans significantly better than all systems PeerBetter than
B A Responsiveness 1: Significant differences Differences primarily between 2 systems Differences between humans and each system
B A Responsive-2 Similar shape to original
B A Skip-bigram: significant differences Many more differences between systems than any manual metric No difference between human and 5 systems
35
36 Pairwise comparisons: Modified Pyramid
37 Agreement between annotators OverallLowHigh Percent Agreement 95%90%96% Kappa Alpha Alpha-Dice
38 Editing of participant annotations To correct obvious errors Ensures uniform checking Predominantly involved correct splitting unmatching SCUs Average paired differences Original: Modified: Average magnitude of the difference Original: Modified:
39 Excluding weight 1 SCUs Removing weight 1 SCUs improves agreement Kappa: 0.64 (was 0.57) Annotating without weight 1 has negligible impact on scores Set D324 done without weight 1 SCUs Ave.magnitude between paired differences On average 0.07 difference
40 Correlations: Pearson’s, 25 systems Pyr-modResp-1Resp2R-2R-SU4 Pyr-orig Pyr-mod Resp Resp R
41 Correlations: Pearson’s, 25 systems Pyr-modResp-1Resp2R-2R-SU4 Pyr-orig Pyr-mod Resp Resp R Questionable that responsiveness could be a gold standard
42 Pyramid and responsiveness Pyr-modResp-1Resp2R-2R-SU4 Pyr-orig Pyr-mod Resp Resp R High correlation, but the metrics are not mutually substitutable
43 Pyramid and Rouge Pyr-modResp-1Resp2R-2R-SU4 Pyr-orig Pyr-mod Resp Resp R High correlation, but the metrics are not mutually substitutable
44 Lessons Learned Comparing content is hard All kinds of judgment calls We didn’t evaluate the NIST assessors in previous years Paraphrases VP vs. NP u Ministers have been exchanged u Reciprocal ministerial visits Length and constituent type u Robotics assists doctors in the medical operating theater u Surgeons started using robotic assistants
45 Modified scores better Easier peer annotation Can drop weight 1 SCUs Better agreement No emphasis on splitting non-matching SCUs
46 Agreement between annotators Participants can perform peer annotation reliably Absolute difference between scores Original: Modified: Empirical prediction of difference 0.06 (HLT 2004)
47 Correlations Original and modified can substitute for each other High correlation between manual and automatic, but automatic not yet a substitute Similar patterns between pyramid and responsiveness
48 Current Directions Automated identification of SCUs (Harnly et al 05) Applied to DUC05 pyramid data set Correlation of.91 with modified pyramid scores
49 Questions What was the experience annotating pyramids? Does it shed insight on the problem Are people willing to do it again? Would you have been willing to go through training? If you’ve done pyramid analysis, can you share your insights
50
51
52 Correlations of Scores on Matched Sets
53 SCU Weight by Cardinality (Ten pyramids)
54 Mean SCU Weight (Ten pyramids)