1 Summarization CENG 784, Fall 2013 Aykut Erdem. 2 Today Summarization  Motivation  Kinds of summaries  Approaches and paradigms  Evaluating summaries.

1 Summarization CENG 784, Fall 2013 Aykut Erdem

2 Today Summarization  Motivation  Kinds of summaries  Approaches and paradigms  Evaluating summaries Slides adopted from F. Rudzicz (Univ. of Toronto), E. Hovy and D. Marcu (Univ. of Southern California)

3 Summarization Summarization: n. the act of producing a shortened version of a text or collection of texts (i.e., a summary) that preserves the most important points.

4 Examples of summaries

5 Headline news News articles are often shortened to one or two sentences. These summaries:  convey the most important aspects of their articles,  are often collected together in a group of summaries for easy scanning by the reader.

6 Abstracts Abstracts are often author-generated and time-saving.

7 Kinds of summaries Summaries can be produced according to several features.  Perspective: whether the summary is informative on its own or if it is merely meant to be indicative.  Composition: whether the summary is extracted directly from the source or synthesized from scratch.  Orientation: whether the author’s view is preserved or if the summary reflects the user’s interest.  Source: whether we summarize a single document or multiple documents.  Background: whether we can assume that the reader has prior knowledge or not.

8 Tasks in summarizations Content (sentence) selection  Extractive summarization Information ordering  In what order to present the selected sentences, especially in multi-document summarization Automatic editing, information fusion and compression  Abstractive summaries

9 Summarization by extraction Extractive summarizationinvolves identifying important sections in the original text, and copying those sections into the summary. How do we determine which sentences are relevant?

10 Determining relevance Topic models (unsupervised)  Figure out what the topic of the input  Frequency, Lexical chains. TF*IDF  LSA, content models (EM, Bayesian)  Select informative sentences based on the topic Graph models (unsupervised)  Sentence centrality Supervised approaches  Ask people which sentences should be in a summary  Use any imaginable feature to learn to predict human choices

11 Frequency as document topic Simple intuition, look only at the document(s)  Words that repeatedly appear in the document are likely to be related to the topic of the document  Sentences that repeatedly appear in different input documents represent themes in the input But what appears in other documents is also helpful in determining the topic  Background corpus probabilities/weights for word

12 What is an article about? Word probability/frequency  Proposed by Luhn in 1958 [Luhn 1958]  Frequent content words would be indicative of the topic of the article In multi-document summarization, words or facts repeated in the input are more likely to appear in human summaries [Nenkova et al., 2006]

13 Word probability/weights

14 Sentence selection according to word probabilities Step 1 Estimate word weights (probabilities) Step 2 Estimate sentence weights Weight(Sent)=CF(w i ∈ Sent) Step 3 Choose best sentence Step 4 Update word weights Step 5 Go to 2 if desired length not reached

15 Sentence selection according to word probabilities A reasonable approach: yes, people seem to be doing something similar! Shortcomings:  Does not take account of related words  suspects – trail  Gadhafi – Libya  Does not take into account evidence from other documents  Function words: prepositions, articles, etc.  Domain words: “cell” in cell biology articles  Does not take into account many other aspects

16 Good alternatives to word probabilities Lexical chains [Barzilay and Elhadad,1999, Silber and McCoy, 2002, Gurevych and Nahnsen, 2005]  Exploits existing lexical resources (WordNet) TF*IDF weights [most summarizers]  Incorporates evidence from a background corpus Log-likelihood ratio (LLR)

17 Lexical chains and WordNet relations Lexical chains  Word sense disambiguation is performed  Then topically related words represent a topic  Synonyms, hyponyms, hypernyms  Importance is determined by frequency of the words in a topic rather than a single word  One sentence per topic is selected Concepts based on WordNet [Schiffman et al., 2002, Ye et al., 2007]  No word sense disambiguation is performed  {war, campaign, warfare, effort, cause, operation}  {concern, carrier, worry, fear, scare}

18 TF*IDF weights for words Combining evidence for document topics from the input and from a background corpus Term Frequency (TF)  Times a word occurs in the input Inverse Document Frequency (IDF)  Number of documents (df) from a background corpus of N documents that contain the word

19 Log-likelihood ratio (LLR) Given  t : a word that appears in the input  T : cluster of articles on a given topic (input)  NT : articles not on topic T (background corpus) Decide if t is a topic word or not Words that have (almost) the same probability in T and NT are not topic words

20 Log-likelihood ratio (LLR) View a text as a sequence of Bernoulli trails  A word is either our term of interest t or not  The likelihood of observing term t which occurs with probability p in a text consisting of N words is given by Estimate the probability of t in three ways  Input + background corpus combines  Input only  Background only

21 Log-likelihood ratio (LLR) - λ(w) -2 log λ has a known statistical distribution: chi-square At a given significance level, we can decide if a word is descriptive of the input or not [Lin and Hovy, 2000]

22 Graph-based Approaches Nodes  Sentences  Discourse entities Edges  Between similar sentences  Between syntactically related entities Computing sentence similarity  Distance between their TF*IDF weighted vector representations

23 Example

24 Example

25 Advantages Combines word frequency and sentence clustering Gives a formal model for computing importance: random walks  Normalize weights of edges to sum to 1  They now represent probabilities of transitioning from one node to another

26 Random walks for summarization Represent the input text as graph Start traversing from node to node  following the transition probabilities  occasionally hopping to a new node What is the probability that you are in any particular node after doing this process for a certain time?  Standard solution (stationary distribution) This probability is the weight of the sentence

27 Supervised methods For extractive summarization, the task can be represented as binary classification  A sentence is in the summary or not Use statistical classifiers to determine the score of a sentence: how likely it’s included in the summary  Feature representation for each sentence  Classification models trained from annotated data Select the sentences with highest scores

28 The basic machine learning framework y = f(x) Learning: given a training set of labeled examples {(x 1,y 1 ), …, (x N,y N )}, estimate the parameters of the prediction function f Inference: apply f to a never before seen test example x and output the predicted value y = f(x) outputclassification function input

29 Features Sentence length  long sentences tend to be more important Sentence weight  Cosine similarity with documents  sum of term weights for all words in a sentence  calculate term weight after applying LSA Sentence position  beginning is often more important  some sections are more important (e.g., in conclusion section) Cue words/phrases  frequent n-grams  cue phrases (e.g., in summary, as a conclusion)  named entities

30 Classifiers Can classify each sentence individually, or use sequence modeling  Maximum entropy [Osborne, 2002]  Condition random fields (CRF) [Galley, 2006]  Classic Bayesian Method [Kupiec et al., 1995]  HMM [Conroy and O'Leary, 2001; Maskey, 2006 ]  Bayesian networks  SVMs [Xie and Liu, 2010]  Regression [Murray et al., 2005]  Others

31 Issues with supervised approaches Need labeled data for model training  Human agreement is low, therefore labeled training data is noisy Data is unbalanced (summary sentences are sparse) Classifiers learn to determine a sentence’s label (in summary or not) Sentence-level accuracy is different from summarization evaluation criterion (e.g., summary- level ROUGE scores)

32 Evaluation of summarization As in other domains, we can evaluate a summarizer extrinsically (within a task) or intrinsically (independent of task). e.g., we might ask subjects to perform time-constrained fact- gathering tasks given documents and:  Human-generated summaries,  Automatically-generated summaries,  No summaries. The speed and correctness of this task constitutes an extrinsic evaluation.

33 ROUGE A commonly used automatic intrinsic evaluation in summarization is ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE is named after and based upon the BLEU metric. ROUGE automatically scores a machine-generated candidate summary by measuring the degree of its n -gram overlap with human-generated summaries (references)

34 ROUGE-2 ROUGE-2 fixes the length of n -gram overlap at n = 2. Given Count match (bigram) that counts the number of distinct bigrams that occur in both the candidate summary and a given reference S from among all references, RefSumm, ROUGE-1 is identical, except it counts unigrams.

35 ROUGE-2 example Candidate: An egg falls of a wall.

36 Aspects of ROUGE ROUGE is measured relative to the number of -grams in the references, whereas BLEU was measured relative to the number of -grams in the candidate.  ROUGE is based on a desire to cover the same content as in human summaries.  Unfortunately, human summarizers often disagree about which sentences to include in a summary.  Overlap between human summaries can be very low.  Although it is a useful baseline, ROUGE is often supplemented with other assessment techniques.

37 Summarizing Summarization Extractive summarizers produce summaries by selecting important/representative sentences from a text. The relevance of sentences can be determined by:  Position: for instance, news articles tend to begin with their most relevant sentences.  Cue words: words that indicate relevance such as ‘crucially’ can be determined automatically  Cohesion: Sentences that contain many strong lexical chains or many co-reference chains ought to be similar to the rest of the document.

1 Summarization CENG 784, Fall 2013 Aykut Erdem. 2 Today Summarization  Motivation  Kinds of summaries  Approaches and paradigms  Evaluating summaries.

Similar presentations

Presentation on theme: "1 Summarization CENG 784, Fall 2013 Aykut Erdem. 2 Today Summarization  Motivation  Kinds of summaries  Approaches and paradigms  Evaluating summaries."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Summarization CENG 784, Fall 2013 Aykut Erdem. 2 Today Summarization  Motivation  Kinds of summaries  Approaches and paradigms  Evaluating summaries.

Similar presentations

Presentation on theme: "1 Summarization CENG 784, Fall 2013 Aykut Erdem. 2 Today Summarization  Motivation  Kinds of summaries  Approaches and paradigms  Evaluating summaries."— Presentation transcript:

Similar presentations

About project

Feedback