The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki, and Lucy Vanderwende Microsoft Research April 26, 2007
DUC Main Task Results Automatic Evaluations (30 participants) Human Evaluations Did pretty well on both measures CriterionRankScore ROUGE ROUGE-SU CriterionRank Pyramid1= Content5=
Overview of P YTHY Linear sentence ranking model Learns to rank sentences based on: ROUGE scores against model summaries Semantic Content Unit (SCU) weights of sentences selected by past peers Considers simplified sentences alongside original sentences
Feature inventory Targets ROUGE Oracle Pyramid/ SCU ROUGE X 2 Ranking/ Training Model Sentences Simplified Sentences Docs PYTHY Training
Sentences Docs Feature inventory Simplified Sentences Docs Model PYTHY Testing Search Dynamic Scoring Summary
Sentence Simplification Extension of simplification method for DUC06 Provides sentence alternatives, rather than deterministically simplify a sentence Uses syntax-based heuristic rules Simplified sentences evaluated alongside originals In DUC 2007: Average new candidates generated: 1.38 per sentence Simplified sentences generated: 61% of all sents Simplified sentences in final output: 60% Feature inventory Targets ROUGE Oracle Pyramid / SCU ROUGE X 2 Ranki ng Traini ng Model Sentences Simplified Sentences Do cs PYTHY Training
Sentence-Level Features SumFocus features: SumBasic ( Nenkova et al 2006 ) + Task focus cluster frequency and topic frequency only these used in MSR DUC06 Other content word unigrams: headline frequency Sentence length features (binary features) Sentence position features (real-valued and binary) N-grams (bigrams, skip bigrams, multiword phrases) All tokens (topic and cluster frequency) Simplified Sentences (binary and ratio of relative length) Inverse document frequency (idf) Feature inventory Targets ROUGE Oracle Pyramid / SCU ROUGE X 2 Ranki ng Traini ng Model Sentences Simplified Sentences Do cs PYTHY Training
Pairwise Ranking Define preferences for sentence pairs Defined using human summaries and SCU weights Log-linear ranking objective used in training Maximize the probability of choosing the better sentence from each pair of comparable sentences Feature inventory Targets ROUGE Oracle Pyramid / SCU ROUGE X 2 Ranki ng Traini ng Model Sentences Simplified Sentences Do cs PYTHY Training [Ofer et al. 03], [Burges et al. 05]
R OUGE Oracle Metric Find an oracle extractive summary the summary with the highest average ROUGE-2 and ROUGE-SU4 scores All sentences in the oracle are considered “better” than any sentence not in the oracle Approximate greedy search used for finding the oracle summary Feature inventory Targets ROUGE Oracle Pyramid / SCU ROUGE X 2 Ranki ng Traini ng Model Sentences Simplified Sentences Do cs PYTHY Training
Pyramid-Derived Metric University of Ottawa SCU-annotated corpus (Copeck et al 06) Some sentences in 05 & 06 document collections are: known to contain certain SCUs known not to contain any SCUs Sentence score is sum of weights of all SCUs for un-annotated sentences, the score is undefined A sentence pair is constructed for training s 1 > s 2 iff w(s 1 ) >w(s 2 ) Targets ROUGE Oracle Pyramid / SCU ROUGE X 2 Ranki ng Traini ng Model Sentences Simplified Sentences Do cs PYTHY Training Feature inventory
Model Frequency Metrics Based on unigram and skip bigram frequency Computed for content words only Sentence s i is “better” than s j if Targets ROUGE Oracle Pyramid / SCU ROUGE X 2 Ranki ng Traini ng Model Sentences Simplified Sentences Do cs PYTHY Training Feature inventory
Combining multiple metrics From ROUGE oracle all sentences in oracle summary better than other sentences From SCU annotations sentences with higher avg SCU weights better From model frequency sentences with words occurring in models better Combined loss: adding the losses according to all metrics Targets ROUGE Oracle Pyramid / SCU ROUGE X 2 Ranki ng Traini ng Model Sentences Simplified Sentences Do cs PYTHY Training Feature inventory Ranki ng Traini ng
Sentences Docs Feature inventory Simplified Sentences Docs Model PYTHY Testing Search Dynamic Scoring Summary
Dynamic Sentence Scoring Eliminate redundancy by re-weighting Similar to SumBasic (Nenkova et al 2006), re- weighting given previously selected sentences Discounts for features that decompose into word frequency estimates Search Dynamic Scoring
Search The search constructs partial summaries and scores them: The score of a summary does not decompose into an independent sum of sentence scores Global dependencies make exact search hard Used multiple beams for each length of partial summaries [McDonald 2007] Search Dynamic Scoring
Impact of Sentence Simplification No SimplifiedSimplified R-2R-SU4R-2R-SU4 SumFocus PYTHY Trained on 05 data, tested on O6 data
Impact of Sentence Simplification No SimplifiedSimplified R-2R-SU4R-2R-SU4 SumFocus PYTHY Trained on 05 data, tested on O6 data
Impact of Sentence Simplification No SimplifiedSimplified R-2R-SU4R-2R-SU4 SumFocus PYTHY Trained on 05 data, tested on O6 data
Evaluating the Metrics CriterionNum Pairs Train AccContent OnlyAll Words R-2R-SU4R-2R-SU4 Oracle941K SCUs430K Model Freq.6.3M All7.7M Trained on 05 data, tested on 06 data Includes simplified sentences
Evaluating the Metrics CriterionNum Pairs Train AccContent OnlyAll Words R-2R-SU4R-2R-SU4 Oracle941K SCUs430K Model Freq.6.3M All7.7M Trained on 05 data, tested on 06 data Includes simplified sentences
Update Summarization Pilot SVM novelty classifier trained on TREC 02 & 03 novelty track ROUGE 2ROUGE-SU4 PYTHY + Novelty (1) PYTHY + Novelty (.5) PYTHY + Novelty (.1) PYTHY SumFocus
Summary and Future Work Summary Combination of different target metrics for training Many sentence features Pair-wise ranking function Dynamic scoring Future work Boost robustness Sensitive to cluster properties (e.g., size) Improve grammatical quality of simplified sentences Reconcile novelty and (ir)relevance Learn features over whole summaries rather than individual sentences
Thank You