1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

Simplifications of Context-Free Grammars
Variations of the Turing Machine
AP STUDY SESSION 2.
1
Chapter 7 System Models.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
David Burdett May 11, 2004 Package Binding for WS CDL.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
PUBLIC KEY CRYPTOSYSTEMS Symmetric Cryptosystems 6/05/2014 | pag. 2.
Programming Language Concepts
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Solve Multi-step Equations
Break Time Remaining 10:00.
Factoring Quadratics — ax² + bx + c Topic
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
McGraw-Hill/Irwin McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
PP Test Review Sections 6-1 to 6-6
Bright Futures Guidelines Priorities and Screening Tables
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Copyright © 2013, 2009, 2005 Pearson Education, Inc.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Thomas Jellema & Wouter Van Gool 1 Question. 2Answer.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.
Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
Systems Analysis and Design in a Changing World, Fifth Edition
Chapter 12 Working with Forms Principles of Web Design, 4 th Edition.
Essential Cell Biology
CSE3201/4500 Information Retrieval Systems
Converting a Fraction to %
Chapter 8 Estimation Understandable Statistics Ninth Edition
Exponents and Radicals
PSSA Preparation.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
Essential Cell Biology
Ani Nenkova Lucy Vanderwende Kathleen McKeown SIGIR 2006.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Chapter 13 Web Page Design Studio
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
RefWorks: The Basics October 12, What is RefWorks? A personal bibliographic software manager –Manages citations –Creates bibliogaphies Accessible.
9. Two Functions of Two Random Variables
South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.
Basics of Statistical Estimation
The Pumping Lemma for CFL’s
Chapter 5 The Mathematics of Diversification
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Natural Language Processing for the Web
Presentation transcript:

1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University

2 What is Summarization?  Data as input (database, software trace, expert system), text summary as output  Text as input (one or more articles), paragraph summary as output  Multimedia in input or output  Summaries must convey maximal information in minimal space

3 3 Types of Summaries  Informative vs. Indicative  Replacing a document vs. describing the contents of a document  Extractive vs. Generative (abstractive)  Choosing bits of the source vs. generating something new  Single document vs. Multi Document  Generic vs. user-focused

4 4 Types of Summaries  Informative vs. Indicative  Replacing a document vs. describing the contents of a document  Extractive vs. Generative  Choosing bits of the source vs. generating something new  Single document vs. Multi Document  Generic vs user-focused

5 Questions (from Sparck Jones)  Should we take the reader into account and how?  “Similarly, the notion of a basic summary, i.e., one reflective of the source, makes hidden fact assumptions, for example that the subject knowledge of the output’s readers will be on a par with that of the readers for whom the source was intended. (p. 5)”  Is the state of the art sufficiently mature to allow summarization from intermediate representations and still allow robust processing of domain independent material?

6 Foundations of Summarization – Luhn; Edmunson  Text as input  Single document  Content selection  Methods  Sentence selection  Criteria

7 Sentence extraction  Sparck Jones:  `what you see is what you get’, some of what is on view in the source text is transferred to constitute the summary

8 Luhn 58  Summarization as sentence extraction   Term frequency determines sentence importance  TF*IDF  Stop word filtering  Similar words count as one  Cluster of frequent words indicates a good sentence

9 TF*IDF  Intuition: Important terms are those that are frequent in this document but not frequent across all documents

10 Term Weights  Local weights  Generally, some function of the frequency of terms in documents is used  Global weights  The standard technique is known as inverse document frequency N= number of documents; ni = number of documents with term i

11 TFxIDF Weighting  To get the weight for a term in a document, multiply the term’s frequency derived weight by its inverse document frequency. TF*IDF

12 Edmunson 69 Sentence extraction using 4 weighted features:  Cue words (“In this paper..”, “The worst thing was..”)  Title and heading words  Sentence location  Frequent key words

13 Sentence extraction variants  Lexical Chains  Barzilay and Elhadad  Silber and McCoy  Discourse coherence  Baldwin  Topic signatures  Lin and Hovy

14 Lexical Chains  “Dr.Kenny has invented an anesthetic machine. This device controls the rate at which an anesthetic is pumped into the blood.“  “Dr.Kenny has invented an anesthetic machine. The doctor spent two years on this research.“  Algorithm: Measure strength of a chain by its length and its homogeneity  Select the first sentence from each strong chain until length limit reached  Semantics needed?

15  Saudi Arabia on Tuesday decided to sign…  The official Saudi Press Agency reported that King Fahd made the decision during a cabinet meeting in Riyadh, the Saudi capital.  The meeting was called in response to … the Saudi foreign minister, that the Kingdom…  An account of the Cabinet discussions and decisions at the meeting…  The agency...  It Discourse Coherence

16 Topic Signature Words  Uses the log ratio test to find words that are highly descriptive of the input  the log-likelihood ratio test provides a way of setting a threshold to divide all words in the input into either descriptive or not  the probability of a word in the input is the same as in the background  the word has a different, higher probability, in the input than in the background  Binomial distribution used to compute the ratio of the two likelihoods  The sentences containing the highest proportion of topic signatures are extracted.

17 Summarization as a Noisy Channel Model  Summary/text pairs  Machine learning model  Identify which features help most

18 Julian Kupiec SIGIR 95 Paper Abstract  To summarize is to reduce in complexity, and hence in length while retaining some of the essential qualities of the original.  This paper focusses on document extracts, a particular kind of computed document summary.  Document extracts consisting of roughly 20% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries.  The trends in our results are in agreement with those of Edmundson who used a subjectively weighted combination of features as opposed to training the feature weights with a corpus.  We have developed a trainable summarization program that is grounded in a sound statistical framework.

19 Statistical Classification Framework  A training set of documents with hand-selected abstracts  Engineering Information Co provides technical article abstracts  188 document/summary pairs  21 journals  Bayesian classifier estimates probability of a given sentence appearing in abstract  Direct matches (79%)  Direct Joins (3%)  Incomplete matches (4%)  Incomplete joins (5%)  New extracts generated by ranking document sentences according to this probability

20 Features  Sentence length cutoff  Fixed phrase feature (26 indicator phrases)  Paragraph feature  First 10 paragraphs and last 5  Is sentence paragraph-initial, paragraph-final, paragraph medial  Thematic word feature  Most frequent content words in document  Upper case Word Feature  Proper names are important

21 Evaluation  Precision and recall  Strict match has 83% upper bound  Trained summarizer: 35% correct  Limit to the fraction of matchable sentences  Trained summarizer: 42% correct  Best feature combination  Paragraph, fixed phrase, sentence length  Thematic and Uppercase Word give slight decrease in performance

22 Questions (from Sparck Jones)  Should we take the reader into account and how?  “Similarly, the notion of a basic summary, i.e., one reflective of the source, makes hidden fact assumptions, for example that the subject knowledge of the output’s readers will be on a par with that of the readers for whom the source was intended. (p. 5)”  Is the state of the art sufficiently mature to allow summarization from intermediate representations and still allow robust processing of domain independent material?

23 Questions (from Sparck Jones)  Should we take the reader into account and how?  “Similarly, the notion of a basic summary, i.e., one reflective of the source, makes hidden fact assumptions, for example that the subject knowledge of the output’s readers will be on a par with that of the readers for whom the source was intended. (p. 5)”  Is the state of the art sufficiently mature to allow summarization from intermediate representations and still allow robust processing of domain independent material?

24 Text Summarization at Columbia  Shallow analysis instead of information extraction  Extraction of phrases rather than sentences  Generation from surface representations in place of semantics

25 Problems with Sentence Extraction  Extraneous phrases  “The five were apprehended along Interstate 95, heading south in vehicles containing an array of gear including …... authorities said.”  Dangling noun phrases and pronouns  “The five”  Misleading  Why would the media use this specific word (fundamentalists), so often with relation to Muslims? *Most of them are radical Baptists, Lutheran and Presbyterian groups.

26 Multi-Document Summarization Research Focus  Monitor variety of online information sources  News, multilingual   Gather information on events across source and time  Same day, multiple sources  Across time  Summarize  Highlighting similarities, new information, different perspectives, user specified interests in real-time

27 Our Approach  Use a hybrid of statistical and linguistic knowledge  Statistical analysis of multiple documents  Identify important new, contradictory information  Information fusion and rule-driven content selection  Generation of summary sentences  By re-using phrases  Automatic editing/rewriting summary

28 Newsblaster Integrated in online environment for daily news updates Ani Nenkova David Elson

29 Newsblaster  Clustering articles into events  Categorization by broad topic  Multi-document summarization  Generation of summary sentences  Fusion  Editing of references

30 Newsblaster Architecture Crawl News Sites Form Clusters Categorize Title Clusters Title Clusters Summary Router Summary Router Select Images Select Images Event Summary Event Summary Biography Summary Biography Summary Multi- Event Multi- Event Convert Output to HTML

31

32 Fusion

33 Theme Computation  Input: A set of related documents  Output: Sets of sentences that “mean” the same thing  Algorithm  Compute similarity across sentences using the Cosine Metric  Can compare word overlap or phrase overlap  (PLACEHOLDER: IR vector space model)

34 Sentence Fusion Computation  Common information identification  Alignment of constituents in parsed theme sentences: only some subtrees match  Bottom-up local multi-sequence alignment  Similarity depends on u Word/paraphrase similarity u Tree structure similarity  Fusion lattice computation  Choose a basis sentence  Add subtrees from fusion not present in basis  Add alternative verbalizations  Remove subtrees from basis not present in fusion  Lattice linearization  Generate all possible sentences from the fusion lattice  Score sentences using statistical language model

35

36

37 Tracking Across Days  Users want to follow a story across time and watch it unfold  Network model for connecting clusters across days  Separately cluster events from today’s news  Connect new clusters with yesterday’s news  Allows for forking and merging of stories  Interface for viewing connections  Summaries that update a user on what’s new  Statistical metrics to identify differences between article pairs  Uses learned model of features  Identifies differences at clause and paragraph levels

38

39

40

41

42 Different Perspectives  Hierarchical clustering  Each event cluster is divided into clusters by country  Different perspectives can be viewed side by side  Experimenting with update summarizer to identify key differences between sets of stories

43

44

45

46 Questions (from Sparck Jones)  Should we take the reader into account and how?  “Similarly, the notion of a basic summary, i.e., one reflective of the source, makes hidden fact assumptions, for example that the subject knowledge of the output’s readers will be on a par with that of the readers for whom the source was intended. (p. 5)”  Is the state of the art sufficiently mature to allow summarization from intermediate representations and still allow robust processing of domain independent material?