Text Summarization 黄连恩

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Course overview Introduction to summarization Lecture 1.

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Chapter 5: Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Fast Algorithms For Hierarchical Range Histogram Constructions

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Automatic Text Summarization: A Solid Base Martijn B. Wieling, Rijksuniversiteit Groningen November, 25 th 2004.

LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.

Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.

Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

1 Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, Julia Hirschberg.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

1 I256: Applied Natural Language Processing Marti Hearst Oct 2, 2006.

Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.

Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.

1 Lecture 8 Measures of association: chi square test, mutual information, binomial distribution and log likelihood ratio.

Cover Coefficient based Multidocument Summarization CS 533 Information Retrieval Systems Özlem İSTEK Gönenç ERCAN Nagehan PALA.

Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

1 Multi-document Summarization and Evaluation. 2 Task Characteristics  Input: a set of documents on the same topic  Retrieved during an IR search 

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

©2012 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,

2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.

A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:

Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2006.

LexRank: Graph-based Centrality as Salience in Text Summarization

LexRank: Graph-based Centrality as Salience in Text Summarization

Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.

1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor ： Dr. Koh Jia-Ling Speaker ： Tu.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.

WEB CONTENT SUMMARIZATION Timothy Washington A Look at Algorithms, Methodologies, and Live Systems.

Vector Space Models.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)

1 CS 430: Information Discovery Lecture 5 Ranking.

An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,

The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Information Retrieval

Presented by Nick Janus

Presentation transcript:

Text Summarization http://net.pku.edu.cn/~wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 12/17/2013

Overview 2

What is summarization?

Columbia Newsblaster The academic version

What is the input? News, or clusters of news Email and email thread a single article or several articles on a related topic Email and email thread Scientific articles Health information: patients and doctors Meeting summarization Video

What is the output Keywords Highlight information in the input Chunks or speech directly from the input or paraphrase and aggregate the input in novel ways Modality: text, speech, video, graphics

Ideal stages of summarization Analysis Input representation and understanding Transformation Selecting important content Realization Generating novel text corresponding to the gist of the input

Most current systems Use shallow analysis methods Rather than full understanding Work by sentence selection Identify important sentences and piece them together to form a summary

Types of summaries Extracts Abstracts Sentences from the original document are displayed together to form a summary Abstracts Materials is transformed: paraphrased, restructured, shortened

Extractive summarization Each sentence is assigned a score that reflects how important and contenful they are Data-driven approaches Word statistics Cue phrases Section headers Sentence position Knowledge-based systems Discourse information Resolve anaphora, text structure Use external lexical resources Wordnet, adjective polarity lists, opinion Using machine learning

What are summaries useful for? Relevance judgments Does this document contain information I am interested in? Is this document worth reading? Save time Reduce the need to consult the full document

Recent development 2013.3, Yahoo bought news reading app Summly for $30 million! 2013.4, Google purchased Wavii for more than $30 million!

Multi-document summarization Very useful for presenting and organizing search results Many results are very similar, and grouping closely related documents helps cover more event facets Summarizing similarities and differences between documents

How to deal with redundancy? Author JK Rowling has won her legal battle in a New York court to get an unofficial Harry Potter encyclopaedia banned from publication. A U.S. federal judge in Manhattan has sided with author J.K. Rowling and ruled against the publication of a Harry Potter encyclopedia created by a fan of the book series. Shallow techniques not likely to work well

Global optimization for content selection What is the best summary? vs What is the best sentence? Form all summaries and choose the best What is the problem with this approach?

Information ordering In what order to present the selected sentences? An article with permuted sentences will not be easy to understand Very important for multi-document summarization Sentences coming from different documents

Automatic summary edits Some expressions might not be appropriate in the new context References: he Putin Russian Prime Minister Vladimir Putin Discourse connectives However, moreover, subsequently Requires more sophisticated NLP techniques

Before Pinochet was placed under arrest in London Friday by British police acting on a warrant issued by a Spanish judge. Pinochet has immunity from prosecution in Chile as a senator-for-life under a new constitution that his government crafted. Pinochet was detained in the London clinic while recovering from back surgery.

After Gen. Augusto Pinochet, the former Chilean dictator, was placed under arrest in London Friday by British police acting on a warrant issued by a Spanish judge. Pinochet has immunity from prosecution in Chile as a senator-for-life under a new constitution that his government crafted. Pinochet was detained in the London clinic while recovering from back surgery.

Before Turkey has been trying to form a new government since a coalition government led by Yilmaz collapsed last month over allegations that he rigged the sale of a bank. Ecevit refused even to consult with the leader of the Virtue Party during his efforts to form a government. Ecevit must now try to build a government. Demirel consulted Turkey's party leaders immediately after Ecevit gave up.

After Turkey has been trying to form a new government since a coalition government led by Prime Minister Mesut Yilmaz collapsed last month over allegations that he rigged the sale of a bank. Premier-designate Bulent Ecevit refused even to consult with the leader of the Virtue Party during his efforts to form a government. Ecevit must now try to build a government. President Suleyman Demirel consulted Turkey's party leaders immediately after Ecevit gave up.

Traditional Approaches 24

1) word frequency based method Hans Peter Luhn (“father of Information Retrieval”): The Automatic Creation of Literature Abstracts - 1958 0000011 Image: Courtesy IBM

Luhn’s method: basic idea Target documents: technical literature The method is based on the following assumptions: Frequency of word occurrence in an article is a useful measurement of word significance Relative position of these significant words within a sentence is also a useful measurement of word significance Based on limited capabilities of machines (IBM 704)  no semantic information 0000100

Important words are repeated throughout the text Why word frequency? Important words are repeated throughout the text examples are given in favor of a certain principle arguments are given for a certain principle Technical literature  one word: one notion Simple and straightforward algorithm  cheap to implement (processing time is costly) Note that different forms of the same word are counted as the same word 0000101

When significant? Too low frequent words are not significant Too high frequent words are also not significant (e.g. “the”, “and”) Removing low frequent words is easy set a minimum frequency-threshold Removing common (high frequent) words: Setting a maximum frequency threshold (statistically obtained) Comparing to a common-word list 0000110 Figure 1 from [Luhn, 1958]

Using relative position Where greatest number of high-frequent words are found closest together  probability very high that representative information is given Based on the characteristic that an explanation of a certain idea is represented by words closely together (e.g. sentences – paragraphs - chapters) 0000111

The significance factor The “significance factor” of a sentence reflects the number of occurrences of significant words within a sentence and the linear distance between them due to non-significant words in between Only consider portion of sentence bracketed by significant words with maximum of 5 non-significant words in between, e.g. “ (*) - - - [ * - * * - - * - - * ] - - (*) “ Significance factor formula: (Σ[*])2 / |[.]| (2.5 in the above example) 0001000

Generating the abstract For every sentence the significance factor is calculated The sentences with a significance factor higher than a certain cut-off value are returned (alternatively the N highest-valued sentences can be returned) For large texts, it can also be applied to subdivisions of the text No evaluation of the results present in the journal paper! 0001001

2) Position based method H.P. Edmundson: New methods in Automatic Extracting - 1969 IBM 7090 - Courtesy IBM 0001010

Lead method Claim: Important sentences occur at the beginning (and/or end) of texts. Lead method: just take first sentence(s)! Experiments: In 85% of 200 individual paragraphs the topic sentences occurred in initial position and in 7% in final position (Baxendale, 58). Only 13% of the paragraphs of contemporary writers start with topic sentences (Donlan, 80).

Cue-Phrase method Claim 1: Important sentences contain ‘bonus phrases’, such as significantly, In this paper we show, and In conclusion, while non-important sentences contain ‘stigma phrases’ such as hardly and impossible. Claim 2: These phrases can be detected automatically (Kupiec et al. 95; Teufel and Moens 97). Method: Add to sentence score if it contains a bonus phrase, penalize if it contains a stigma phrase.

Four methods for weighting Weighting methods: Cue Method Key Method Title Method Location Method The weight of a sentence is a linear combination of the weights obtained with the above four methods The highest weighing sentences are included in the abstract Target documents: technical literature 0001011

Three types of Cue words: Cue Method Based on the hypothesis that the probable relevance of a sentence is affected by presence of pragmatic words (e.g. “Significant”, “Greatest”, Impossible”, “Hardly”) Three types of Cue words: Bonus words: positively affecting the relevance of a sentence (e.g. “Significant”, “Greatest”) Stigma words: negatively affecting the relevance of a sentence (e.g. “Impossible”, “Hardly”) Null words: irrelevant 0001100

The lists were obtained by statistical analyses of 100 documents: Obtaining Cue words The lists were obtained by statistical analyses of 100 documents: Dispersion (λ): number of documents in which the word occurred Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all sentences Bonus words: η > thighη Stigma words: η < tlowη Null words: λ > tλ and tlowη< η < thighη 0001101

Stigma list (73): anaphoric expressions, belittling expressions, etc. Resulting Cue lists Bonus list (783): comparatives, superlatives, adverbs of conclusion, value terms, etc. Stigma list (73): anaphoric expressions, belittling expressions, etc. Null list (139): ordinals, cardinals, the verb “to be”, prepositions, pronouns, etc. 0001110

Cue weight of sentence: Σ (Cue weight of each word in sentence) Tag all Bonus words with weight b > 0, all Stigma words with weight s < 0, all Null words with weight n = 0 Cue weight of sentence: Σ (Cue weight of each word in sentence) 0001111

Key Method Principle based on [Luhn], counting the frequency of words. Algorithm differs: Create key glossary of all non-Cue words in the document which have a frequency larger than a certain threshold Weight of each key word in the key glossary is set to the frequency it occurs in the document Assign key weight to each word which can be found in the key glossary If word is not in key glossary, key weight: 0 No relative position is used ([Luhn]) Key weight of sentence: Σ (Key weight of each word in sentence) 0010000

Title Method Based on the hypothesis that an author conceives title as circumscribing the subject matter of the document (similarly for headings vs. paragraphs) Create title glossary consisting of all non-Null words in the title, subtitle and headings of the document Words are given a positive title weight if they appear in this glossary Title words are given a larger weight than heading words Title weight of sentence: Σ (Title weight of each word in sentence) 0010001

Location Method Based on the hypothesis that: Global idea: Sentences occurring under certain headings are positively relevant Topic sentences tend to occur very early or very late in a document and its paragraphs Global idea: Give each sentence below his heading the same weight as the heading itself (note that this is independent from the Title Method) – Heading weight Give each sentence a certain weight based on its position - Ordinal weight Location weight of sentence: Ordinal weight of sentence + Heading weight of sentence 0010010

Location Method: Heading weight Compare each word in a heading with the pre-stored Heading dictionary If the word occurs in this dictionary, assign it a weight equal to the weight it has in the dictionary Heading weight of a heading: Σ (heading weight of each word in heading) Heading weight of a sentence = Heading weight of its heading 0010011

Creating the Heading dictionary The Heading dictionary was created by listing all words in the headings of 120 documents and calculating the selection ratio for each word: Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all headings Deletions from this list were made on the basis of low frequency and unrelatedness to the desired information types (subject, purpose, conclusion, etc.) Weights were given to the words in the Heading dictionary proportional to the selection ratio The resulting Heading dictionary contained 90 words 0010100

Location Method: Ordinal weight Sentences of the first paragraph are tagged with weight O1 Sentences of the last paragraph are tagged with weight O2 The first sentence of a paragraph is tagged with weight O3 The last sentence of a paragraph is tagged with weight O4 Ordinal weight of sentence: O1 + O2 + O3 + O4 0010101

Generating the abstract Calculate the weight of a sentence: aC + bK + cT + dL, with a,b,c,d constant positive integers, C: Cue Weight, K: Key weight, T: Title weight, L: Location weight The values of a, b, c and d were obtained by manually comparing the generated automatic abstracts with the desired (human made) abstract Return the highest N sentences under their proper headings as the abstract (including title) N is calculated by taking a percentage of the size of the original documents, in this journal paper 25% is used 0010110

Which combination is best? All combinations of C, K, T and L were tried to see which result had (on average) the most overlap with the handmade extract As can be seen in the figure below (only the interesting results are shown), the Key method was omitted and only C, T and L are used to create the best abstract Surprising result! (Luhn used only keywords to create the abstract) Figure 4 from [Edmundson, 1969] 0010111

Evaluation Evaluation was done on unseen data (40 technical documents), comparison with handmade abstracts Result: 44% of the sentences co-selected, 66% similarity between abstracts (human judge) Random ‘abstract’: 25% of the sentences co-selected, 34% similarity between abstracts Another evaluation criterion: ‘extract-worthiness’ Result: 84% of the sentences selected is extract-worthy Therefore: for one document many possible abstracts (differing in length and content) 0011000

3) Machine-learning method Ask people to select sentences Use these as training examples for machine learning Each sentence is represented as a number of features Based on the features distinguish sentences that are appropriate for a summary and sentences that are not Run on new inputs

Scoring sentences For each sentence s the probability P is calculated that it will be included in the summary S given the k features (Bayes’ rule): Assuming statistical independence of the features: is constant, and and can be estimated directly from the training set by counting occurrences This function assigns for each s a score which can be used to select sentences for inclusion in the abstract 0100100

Sentences from the abstract were matched to the original document: The training material 188 documents with professionally created abstracts from the scientific/technical domain, the average length of the abstracts is 3 sentences (3.5% of the total size of the document) Sentences from the abstract were matched to the original document: 79% direct sentence matches 3% direct joins (2 sentences combined) 18% no direct match or join possible Therefore the maximum performance of the automatic system is 82% 0100101

Evaluation Too little material  Cross-validation used to evaluate Two evaluation measures Fraction of manually selected sentences which were reproduced correctly: average result: 35% Fraction of the matchable selected sentences which were reproduced correctly: average result: 42% Performance of features (2nd measure): Feature Individual % sentences correct Cumulative % sentences correct Paragraph 33 Fixed Phrases 29 42 Length Cut-off 24 44 Thematic Word 20 Uppercase Word 0100110

4) Discourse-based method Claim: The multi-sentence coherence structure of a text can be constructed, and the ‘centrality’ of the textual units in this structure reflects their importance. Tree-like representation of texts in the style of Rhetorical Structure Theory (Mann and Thompson,88). Use the discourse representation in order to determine the most important textual units. Attempts: (Ono et al., 1994) for Japanese. (Marcu, 1997,2000) for English.

Rhetorical parsing (Marcu,97) [With its distant orbit {– 50 percent farther from the sun than Earth –} and slim atmospheric blanket,1] [Mars experiences frigid weather conditions.2] [Surface temperatures typically average about –60 degrees Celsius (–76 degrees Fahrenheit) at the equator and can dip to –123 degrees C near the poles.3] [Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion,4] [but any liquid water formed that way would evaporate almost instantly5] [because of the low atmospheric pressure.6] [Although the atmosphere holds a small amount of water, and water-ice clouds sometimes develop,7] [most Martian weather involves blowing dust or carbon dioxide.8] [Each winter, for example, a blizzard of frozen carbon dioxide rages over one pole, and a few meters of this dry-ice snow accumulate as previously frozen carbon dioxide evaporates from the opposite polar cap.9] [Yet even on the summer pole, {where the sun remains in the sky all day long,} temperatures never warm enough to melt frozen water.10]

Rhetorical parsing (2) Use discourse markers to hypothesize rhetorical relations rhet_rel(CONTRAST, 4, 5)  rhet_rel(CONTRAT, 4, 6) rhet_rel(EXAMPLE, 9, [7,8])  rhet_rel(EXAMPLE, 10, [7,8]) Use semantic similarity to hypothesize rhetorical relations if similar(u1,u2) then rhet_rel(ELABORATION, u2, u1)  rhet_rel(BACKGROUND, u1,u2) else rhet_rel(JOIN, u1, u2) rhet_rel(JOIN, 3, [1,2])  rhet_rel(ELABORATION, [4,6], [1,2]) Use the hypotheses in order to derive a valid discourse representation of the original text.

Summarization = selection of the Rhetorical parsing (3) 2 Elaboration 2 Elaboration 8 Example 2 Background Justification 3 Elaboration 8 Concession 10 Antithesis 1 2 3 4 5 Contrast 7 8 9 10 Summarization = selection of the most important units 2 > 8 > 3, 10 > 1, 4, 5, 7, 9 > 6 4 5 Evidence Cause 5 6

Discourse method: Evaluation (using a combination of heuristics for rhetorical parsing disambiguation) TREC Corpus (fourfold cross-validation) Scientific American Corpus

5) VS based method Based on word probability Based on word tf.idf S is sentence with length n Pi is the probability of the i-th word in the sentence Based on word tf.idf

Centrality measures How representative is a sentence of the overall content of a document The more similar are sentence is to the document, the more representative it is

Evaluation 60

Comparing Text Against Text Which human summary makes a good gold standard? Many summaries are good At what granularity is the comparison made? When can we say that two pieces of text match?

Variation impacts evaluation Comparing content is hard All kinds of judgment calls Paraphrases VP vs. NP Ministers have been exchanged Reciprocal ministerial visits Length and constituent type Robotics assists doctors in the medical operating theater Surgeons started using robotic assistants

Nightmare: only one gold standard System may have chosen an equally good sentence but not in the one gold standard Pinochet arrested in London on Oct 16 at a Spanish judge’s request for atrocities against Spaniards in Chile. Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government In DUC 2001 (one gold standard), human model had significant impact on scores (McKeown et al) Five human summaries needed to avoid changes in rank (Nenkova and Passonneau) DUC2003 data 3 topic sets, 1 highest scoring and 2 lowest scoring 10 model summaries

Scoring Two main approaches used in DUC Problems: ROUGE (Lin and Hovy) Pyramids (Nenkova and Passonneau) Problems: Are the results stable? How difficult is it to do the scoring?

DUC – Document Understanding Conference Established and funded by DARPA TIDES Run by independent evaluator NIST Open to summarization community Annual evaluations on common datasets 2001-present Tasks Single document summarization Headline summarization Multi-document summarization Multi-lingual summarization Focused summarization

DUC Evaluation Gold Standard Multiple metrics Granularity Human summaries written by NIST From 2 to 9 summaries per input set Multiple metrics Manual Coverage (early years) Pyramids (later years) Responsiveness (later years) Quality questions Automatic Rouge (-1, -2, -skipbigrams, LCS, BE) Granularity Manual: sub-sentential elements Automatic: sentences

ROUGE: Recall-Oriented Understudy for Gisting Evaluation Rouge – Ngram co-occurrence metrics measuring content overlap Counts of n-gram overlaps between candidate and model summaries Total n-grams in summary model

ROUGE Experimentation with different units of comparison: unigrams, bigrams, longest common substring, skip-bigams, basic elements Automatic and thus easy to apply Important to consider confidence intervals when determining differences between systems Scores falling within same interval not significantly different Rouge scores place systems into large groups: can be hard to definitively say one is better than another Sometimes results unintuitive: Multilingual scores as high as English scores Use in speech summarization shows no discrimination Good for training regardless of intervals: can see trends

LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan and Dragomir R. Radev ACL 2004

Abstract This paper consider an approach for computing sentence importance based on the concept of eigenvector centrality (prestige) – LexPageRank In this model, a sentence connectivity matrix is constructed based on cosine similarity The experimental results using DUC2004 show that this approach outperforms centroid-based summarization and is quite successful compared to other summarization systems

Introduction Text summarization is the process of automatically creating a compressed version of a given text that provides useful information for the user This summarization approach is to assess the centrality of each sentence in a cluster and include the most important ones in the summary Introduce two new measures for centrality, Degree and LexPageRank, inspired from the prestige concept in social networks

Sentence centrality and centroid-based summarization Extractive summarization produces summaries by choosing a subset of the sentences in the original documents Centrality of a sentence is often defined in terms of the centrality of the words that it contains The centroid of a cluster is a psuedo-document which consists of words that have frequency*IDF scores above a predefined threshold In centroid-based summarization (Radevet et al., 2000), the sentences that contain more words from the centroid of the cluster are considered central Centroid-based summarization has given promising results in the past

Prestige-based sentence centrality We hypothesize that the sentences that are similar to many of the other sentences in a cluster are more central (or prestigious) to the topic There are two issues How to define similarity between two sentences Cosine How to compute the overall prestige of a sentence given its similarity to other sentences Degree centrality Eigenvector centrality and LexPageank

Prestige-based sentence centrality A cluster may be represented by a cosine similarity matrix

Prestige-based sentence centrality Most of them are nonzero

Prestige-based sentence centrality Degree centrality Since we are interested in significant similarities in the matrix, we can eliminate some low values by defining a threshold , so that the cluster can be view as an undirected graph We define degree centrality as the degree of each node in the similarity graph

Prestige-based sentence centrality

Prestige-based sentence centrality

Prestige-based sentence centrality Issue for degree centrality Several unwanted sentences vote for each and raise their prestige This situation can be avoided by considering where the votes come from and taking the prestige of the voting node into account in weight each node Eigenvector centrality and LexPageRank PageRank (Page et al., 1998) is a method propose for assigning a prestige score to each page in the web independent of a specific query Depending on the number of pages that link to that pages as well as the individual score of the linking pages

Prestige-based sentence centrality The PageRank of Page A This recursively defined value can be computed by forming the binary adjacency matrix of the web, normalizing this matrix so that row sums equal to 1, and finding the principal eigenvector of the normalized matrix PageRank for ith pages equals to the ith entry in the eigenvector T1,…,Tn: pages that link to page A d: damping factor, C(Ti): the number of outgoing links from page Ti

Prestige-based sentence centrality This method can be easily applied to the cosine similarity graph to find the most prestigious sentences in a document We called this new measure of sentence similarity LexPageRank

Prestige-based sentence centrality damping factor = 1

Prestige-based sentence centrality Advantage over Centroid It accounts for information subsumption among sentences It prevents unnaturally high IDF scores from boosting up the score of a sentence that is unrelated to the topic

Experiments on DUC 2004 data DUC 2004 data was used in our experiments Task 2 involves summarization of 50 TDT English clusters Task 4 is to produce summaries of machine translation output (in English) of 24 Arabic TDT documents Recall-based measure – Rouge is adopted and 665-byte summaries for each cluster are produced

Experiments on DUC 2004 data MEAD summarization toolkit Extractive multi-document summarization Consist of three components Feature extractor (document -> feature vector) Centroid, Position and Length Combiner (feature vector -> scalar value) Reranker (the scores are adjusted upward or downward) MMR (Maximum Margin Relevance), CSIS (Cross-Sentence Information Subsumption) weight Threshold

Experiments on DUC 2004 data Centroid

Thank You! Q&A

HOME WORK 阅读以下文献之一，写一个阅读报告 SentTopic-MultiRank: a novel ranking model for multi-document summarization. In COLING’12 RelationListwise for query-focused multi-document summarization. In COLING’12 A supervised aggregation framework for multi-document summarization. In COLING’12 Query-Focused Multidocument Summarization Based on Query-Sensitive Feature Space. In CIKM’12 Optimized Event Storyline Generation based on Mixture-Event-Aspect Model. In EMNLP’13