1 Summarization CENG 784, Fall 2013 Aykut Erdem. 2 Today Summarization  Motivation  Kinds of summaries  Approaches and paradigms  Evaluating summaries.

Slides:

Advertisements

Similar presentations

Ani Nenkova Lucy Vanderwende Kathleen McKeown SIGIR 2006.

Advertisements

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

What is Statistical Modeling

A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost.

47 th Annual Meeting of the Association for Computational Linguistics and 4 th International Joint Conference on Natural Language Processing Of the AFNLP.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

Ch 4: Information Retrieval and Text Mining

Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.

1 Lecture 8 Measures of association: chi square test, mutual information, binomial distribution and log likelihood ratio.

ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.

Scalable Text Mining with Sparse Generative Models

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Graphical models for part of speech tagging

A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,

Text Classification, Active/Interactive learning.

1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Albert Gatt Corpora and Statistical Methods Lecture 12.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.

Entity Set Expansion in Opinion Documents Lei Zhang Bing Liu University of Illinois at Chicago.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

Vector Space Models.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)

The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Plan for Today’s Lecture(s)

Semantic Processing with Context Analysis

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Entity- & Topic-Based Information Ordering

Statistical Models for Automatic Speech Recognition

Summarizing Entities: A Survey Report

Statistical NLP: Lecture 9

Representation of documents and queries

N-Gram Model Formulas Word sequences Chain rule of probability

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

Text Categorization Berlin Chen 2003 Reference:

INF 141: Information Retrieval

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presented by Nick Janus

Presentation transcript:

1 Summarization CENG 784, Fall 2013 Aykut Erdem

2 Today Summarization  Motivation  Kinds of summaries  Approaches and paradigms  Evaluating summaries Slides adopted from F. Rudzicz (Univ. of Toronto), E. Hovy and D. Marcu (Univ. of Southern California)

3 Summarization Summarization: n. the act of producing a shortened version of a text or collection of texts (i.e., a summary) that preserves the most important points.

4 Examples of summaries

5 Headline news News articles are often shortened to one or two sentences. These summaries:  convey the most important aspects of their articles,  are often collected together in a group of summaries for easy scanning by the reader.

6 Abstracts Abstracts are often author-generated and time-saving.

7 Kinds of summaries Summaries can be produced according to several features.  Perspective: whether the summary is informative on its own or if it is merely meant to be indicative.  Composition: whether the summary is extracted directly from the source or synthesized from scratch.  Orientation: whether the author’s view is preserved or if the summary reflects the user’s interest.  Source: whether we summarize a single document or multiple documents.  Background: whether we can assume that the reader has prior knowledge or not.

8 Tasks in summarizations Content (sentence) selection  Extractive summarization Information ordering  In what order to present the selected sentences, especially in multi-document summarization Automatic editing, information fusion and compression  Abstractive summaries

9 Summarization by extraction Extractive summarizationinvolves identifying important sections in the original text, and copying those sections into the summary. How do we determine which sentences are relevant?

10 Determining relevance Topic models (unsupervised)  Figure out what the topic of the input  Frequency, Lexical chains. TF*IDF  LSA, content models (EM, Bayesian)  Select informative sentences based on the topic Graph models (unsupervised)  Sentence centrality Supervised approaches  Ask people which sentences should be in a summary  Use any imaginable feature to learn to predict human choices

11 Frequency as document topic Simple intuition, look only at the document(s)  Words that repeatedly appear in the document are likely to be related to the topic of the document  Sentences that repeatedly appear in different input documents represent themes in the input But what appears in other documents is also helpful in determining the topic  Background corpus probabilities/weights for word

12 What is an article about? Word probability/frequency  Proposed by Luhn in 1958 [Luhn 1958]  Frequent content words would be indicative of the topic of the article In multi-document summarization, words or facts repeated in the input are more likely to appear in human summaries [Nenkova et al., 2006]

13 Word probability/weights

14 Sentence selection according to word probabilities Step 1 Estimate word weights (probabilities) Step 2 Estimate sentence weights Weight(Sent)=CF(w i ∈ Sent) Step 3 Choose best sentence Step 4 Update word weights Step 5 Go to 2 if desired length not reached

15 Sentence selection according to word probabilities A reasonable approach: yes, people seem to be doing something similar! Shortcomings:  Does not take account of related words  suspects – trail  Gadhafi – Libya  Does not take into account evidence from other documents  Function words: prepositions, articles, etc.  Domain words: “cell” in cell biology articles  Does not take into account many other aspects

16 Good alternatives to word probabilities Lexical chains [Barzilay and Elhadad,1999, Silber and McCoy, 2002, Gurevych and Nahnsen, 2005]  Exploits existing lexical resources (WordNet) TF*IDF weights [most summarizers]  Incorporates evidence from a background corpus Log-likelihood ratio (LLR)

17 Lexical chains and WordNet relations Lexical chains  Word sense disambiguation is performed  Then topically related words represent a topic  Synonyms, hyponyms, hypernyms  Importance is determined by frequency of the words in a topic rather than a single word  One sentence per topic is selected Concepts based on WordNet [Schiffman et al., 2002, Ye et al., 2007]  No word sense disambiguation is performed  {war, campaign, warfare, effort, cause, operation}  {concern, carrier, worry, fear, scare}

18 TF*IDF weights for words Combining evidence for document topics from the input and from a background corpus Term Frequency (TF)  Times a word occurs in the input Inverse Document Frequency (IDF)  Number of documents (df) from a background corpus of N documents that contain the word

19 Log-likelihood ratio (LLR) Given  t : a word that appears in the input  T : cluster of articles on a given topic (input)  NT : articles not on topic T (background corpus) Decide if t is a topic word or not Words that have (almost) the same probability in T and NT are not topic words

20 Log-likelihood ratio (LLR) View a text as a sequence of Bernoulli trails  A word is either our term of interest t or not  The likelihood of observing term t which occurs with probability p in a text consisting of N words is given by Estimate the probability of t in three ways  Input + background corpus combines  Input only  Background only

21 Log-likelihood ratio (LLR) - λ(w) -2 log λ has a known statistical distribution: chi-square At a given significance level, we can decide if a word is descriptive of the input or not [Lin and Hovy, 2000]

22 Graph-based Approaches Nodes  Sentences  Discourse entities Edges  Between similar sentences  Between syntactically related entities Computing sentence similarity  Distance between their TF*IDF weighted vector representations

23 Example

24 Example

25 Advantages Combines word frequency and sentence clustering Gives a formal model for computing importance: random walks  Normalize weights of edges to sum to 1  They now represent probabilities of transitioning from one node to another

26 Random walks for summarization Represent the input text as graph Start traversing from node to node  following the transition probabilities  occasionally hopping to a new node What is the probability that you are in any particular node after doing this process for a certain time?  Standard solution (stationary distribution) This probability is the weight of the sentence

27 Supervised methods For extractive summarization, the task can be represented as binary classification  A sentence is in the summary or not Use statistical classifiers to determine the score of a sentence: how likely it’s included in the summary  Feature representation for each sentence  Classification models trained from annotated data Select the sentences with highest scores

28 The basic machine learning framework y = f(x) Learning: given a training set of labeled examples {(x 1,y 1 ), …, (x N,y N )}, estimate the parameters of the prediction function f Inference: apply f to a never before seen test example x and output the predicted value y = f(x) outputclassification function input

29 Features Sentence length  long sentences tend to be more important Sentence weight  Cosine similarity with documents  sum of term weights for all words in a sentence  calculate term weight after applying LSA Sentence position  beginning is often more important  some sections are more important (e.g., in conclusion section) Cue words/phrases  frequent n-grams  cue phrases (e.g., in summary, as a conclusion)  named entities

30 Classifiers Can classify each sentence individually, or use sequence modeling  Maximum entropy [Osborne, 2002]  Condition random fields (CRF) [Galley, 2006]  Classic Bayesian Method [Kupiec et al., 1995]  HMM [Conroy and O'Leary, 2001; Maskey, 2006 ]  Bayesian networks  SVMs [Xie and Liu, 2010]  Regression [Murray et al., 2005]  Others

31 Issues with supervised approaches Need labeled data for model training  Human agreement is low, therefore labeled training data is noisy Data is unbalanced (summary sentences are sparse) Classifiers learn to determine a sentence’s label (in summary or not) Sentence-level accuracy is different from summarization evaluation criterion (e.g., summary- level ROUGE scores)

32 Evaluation of summarization As in other domains, we can evaluate a summarizer extrinsically (within a task) or intrinsically (independent of task). e.g., we might ask subjects to perform time-constrained fact- gathering tasks given documents and:  Human-generated summaries,  Automatically-generated summaries,  No summaries. The speed and correctness of this task constitutes an extrinsic evaluation.

33 ROUGE A commonly used automatic intrinsic evaluation in summarization is ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE is named after and based upon the BLEU metric. ROUGE automatically scores a machine-generated candidate summary by measuring the degree of its n -gram overlap with human-generated summaries (references)

34 ROUGE-2 ROUGE-2 fixes the length of n -gram overlap at n = 2. Given Count match (bigram) that counts the number of distinct bigrams that occur in both the candidate summary and a given reference S from among all references, RefSumm, ROUGE-1 is identical, except it counts unigrams.

35 ROUGE-2 example Candidate: An egg falls of a wall.

36 Aspects of ROUGE ROUGE is measured relative to the number of -grams in the references, whereas BLEU was measured relative to the number of -grams in the candidate.  ROUGE is based on a desire to cover the same content as in human summaries.  Unfortunately, human summarizers often disagree about which sentences to include in a summary.  Overlap between human summaries can be very low.  Although it is a useful baseline, ROUGE is often supplemented with other assessment techniques.

37 Summarizing Summarization Extractive summarizers produce summaries by selecting important/representative sentences from a text. The relevance of sentences can be determined by:  Position: for instance, news articles tend to begin with their most relevant sentences.  Cue words: words that indicate relevance such as ‘crucially’ can be determined automatically  Cohesion: Sentences that contain many strong lexical chains or many co-reference chains ought to be similar to the rest of the document.