1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.

Slides:



Advertisements
Similar presentations
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
Advertisements

Ani Nenkova Lucy Vanderwende Kathleen McKeown SIGIR 2006.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Automatic Text Summarization: A Solid Base Martijn B. Wieling, Rijksuniversiteit Groningen November, 25 th 2004.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Information Retrieval in Practice
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
1 I256: Applied Natural Language Processing Marti Hearst Oct 2, 2006.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Mon 3-4 TA: Fadi Biadsy 702 CEPSR,
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Tues 4-5; Wed 1-2 TA: Yves Petinot 728 CEPSR,
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
1 Question Answering, Information Retrieval, and Summarization Chapter 23 November 2012 Lecture #14 Most of this from Kathy McKeown (I left her format.
Overview of Search Engines
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
13 Miscellaneous.  Computer languages ranking  ll&lang=all&lang2=sbcl
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2006.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
1 Reference Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, Xiaodan Zhu, Gerald Penn, “Evaluation.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield.
Chapter 23: Probabilistic Language Models April 13, 2004.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
WEB CONTENT SUMMARIZATION Timothy Washington A Look at Algorithms, Methodologies, and Live Systems.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Headline Generation Based on Statistical Translation Michele Banko Computer Science Department Johns Hopkins University Vibhu O.Mittal Just Research 報告人.
Probabilistic Text Structuring: Experiments with Sentence Ordering Mirella Lapata Department of Computer Science University of Sheffield, UK (ACL 2003)
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Text Summarization using Lexical Chains. Summarization using Lexical Chains Summarization? What is Summarization? Advantages… Challenges…
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
1 Summarization CENG 784, Fall 2013 Aykut Erdem. 2 Today Summarization  Motivation  Kinds of summaries  Approaches and paradigms  Evaluating summaries.
Information Retrieval in Practice
Natural Language Processing for the Web
Presentation transcript:

1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University

2 Today  HW3 assigned  Summarization (switch in order of topics)  WEKA tutorial (for HW3)  Midterms back

3 What is Summarization?  Data as input (database, software trace, expert system), text summary as output  Text as input (one or more articles), paragraph summary as output  Multimedia in input or output  Summaries must convey maximal information in minimal space

4 4 Types of Summaries  Informative vs. Indicative  Replacing a document vs. describing the contents of a document  Extractive vs. Generative (abstractive)  Choosing bits of the source vs. generating something new  Single document vs. Multi Document  Generic vs. user-focused

5 5 Types of Summaries  Informative vs. Indicative  Replacing a document vs. describing the contents of a document  Extractive vs. Generative  Choosing bits of the source vs. generating something new  Single document vs. Multi Document  Generic vs user-focused

6 Questions (from Sparck Jones)  Should we take the reader into account and how?  “Similarly, the notion of a basic summary, i.e., one reflective of the source, makes hidden fact assumptions, for example that the subject knowledge of the output’s readers will be on a par with that of the readers for whom the source was intended. (p. 5)”  Is the state of the art sufficiently mature to allow summarization from intermediate representations and still allow robust processing of domain independent material?

7 Foundations of Summarization – Luhn; Edmunson  Text as input  Single document  Content selection  Methods  Sentence selection  Criteria

8 Sentence extraction  Sparck Jones:  `what you see is what you get’, some of what is on view in the source text is transferred to constitute the summary

9 Luhn 58  Summarization as sentence extraction  Example  Term frequency determines sentence importance  TF*IDF  Stop word filtering  Similar words count as one  Cluster of frequent words indicates a good sentence

10 TF*IDF  Intuition: Important terms are those that are frequent in this document but not frequent across all documents

11 Term Weights  Local weights  Generally, some function of the frequency of terms in documents is used  Global weights  The standard technique is known as inverse document frequency N= number of documents; ni = number of documents with term i

12 TFxIDF Weighting  To get the weight for a term in a document, multiply the term’s frequency derived weight by its inverse document frequency. TF*IDF

13 Edmunson 69 Sentence extraction using 4 weighted features:  Cue words (“In this paper..”, “The worst thing was..”)  Title and heading words  Sentence location  Frequent key words

14 Sentence extraction variants  Lexical Chains  Barzilay and Elhadad  Silber and McCoy  Discourse coherence  Baldwin  Topic signatures  Lin and Hovy

15 Lexical Chains  “Dr.Kenny has invented an anesthetic machine. This device controls the rate at which an anesthetic is pumped into the blood.“  “Dr.Kenny has invented an anesthetic machine. The doctor spent two years on this research.“  Algorithm: Measure strength of a chain by its length and its homogeneity  Select the first sentence from each strong chain until length limit reached  Semantics needed?

16  Saudi Arabia on Tuesday decided to sign…  The official Saudi Press Agency reported that King Fahd made the decision during a cabinet meeting in Riyadh, the Saudi capital.  The meeting was called in response to … the Saudi foreign minister, that the Kingdom…  An account of the Cabinet discussions and decisions at the meeting…  The agency...  It Discourse Coherence

17 Topic Signature Words  Uses the log ratio test to find words that are highly descriptive of the input  the log-likelihood ratio test provides a way of setting a threshold to divide all words in the input into either descriptive or not  the probability of a word in the input is the same as in the background  the word has a different, higher probability, in the input than in the background  Binomial distribution used to compute the ratio of the two likelihoods  The sentences containing the highest proportion of topic signatures are extracted.

18 Summarization as a Noisy Channel Model  Summary/text pairs  Machine learning model  Identify which features help most

19 Julian Kupiec SIGIR 95 Paper Abstract  To summarize is to reduce in complexity, and hence in length while retaining some of the essential qualities of the original.  This paper focusses on document extracts, a particular kind of computed document summary.  Document extracts consisting of roughly 20% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries.  The trends in our results are in agreement with those of Edmundson who used a subjectively weighted combination of features as opposed to training the feature weights with a corpus.  We have developed a trainable summarization program that is grounded in a sound statistical framework.

20 Statistical Classification Framework  A training set of documents with hand-selected abstracts  Engineering Information Co provides technical article abstracts  188 document/summary pairs  21 journal articles  Bayesian classifier estimates probability of a given sentence appearing in abstract  Direct matches (79%)  Direct Joins (3%)  Incomplete matches (4%)  Incomplete joins (5%)  New extracts generated by ranking document sentences according to this probability

21 Features  Sentence length cutoff  Fixed phrase feature (26 indicator phrases)  Paragraph feature  First 10 paragraphs and last 5  Is sentence paragraph-initial, paragraph-final, paragraph medial  Thematic word feature  Most frequent content words in document  Upper case Word Feature  Proper names are important

22 Evaluation  Precision and recall  Strict match has 83% upper bound  Trained summarizer: 35% correct  Limit to the fraction of matchable sentences  Trained summarizer: 42% correct  Best feature combination  Paragraph, fixed phrase, sentence length  Thematic and Uppercase Word give slight decrease in performance

23 Questions (from Sparck Jones)  Should we take the reader into account and how?  “Similarly, the notion of a basic summary, i.e., one reflective of the source, makes hidden fact assumptions, for example that the subject knowledge of the output’s readers will be on a par with that of the readers for whom the source was intended. (p. 5)”  Is the state of the art sufficiently mature to allow summarization from intermediate representations and still allow robust processing of domain independent material?