Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2006.
Automatic Text Summarization: A Solid Base Martijn B. Wieling, Rijksuniversiteit Groningen November, 25 th 2004.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
1 I256: Applied Natural Language Processing Marti Hearst Oct 2, 2006.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Presented by Zeehasham Rasheed
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Albert Gatt Corpora and Statistical Methods Lecture 9.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
13 Miscellaneous.  Computer languages ranking  ll&lang=all&lang2=sbcl
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
BIO1130 Lab 2 Scientific literature. Laboratory objectives After completing this laboratory, you should be able to: Determine whether a publication can.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2006.
1 Reference Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, Xiaodan Zhu, Gerald Penn, “Evaluation.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Information Extraction and Automatic Summarisation *
Automatic Identification of Pro and Con Reasons in Online Reviews Soo-Min Kim and Eduard Hovy USC Information Sciences Institute Proceedings of the COLING/ACL.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
WEB CONTENT SUMMARIZATION Timothy Washington A Look at Algorithms, Methodologies, and Live Systems.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Automatic recognition of discourse relations Lecture 3.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
BIO1130 Lab 2 Scientific literature
Using lexical chains for keyword extraction
Erasmus University Rotterdam
BIO1130 Lab 2 Scientific literature
Presentation transcript:

Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005

2 In this part text summarization –Edmundson’s method –corpus-based approaches: KPC method

3 Edmundson’s method Edmundson (1969): New methods in automatic extracting extends earlier work to look at three features in addition to word frequencies: –cue phrases (e.g. ”significant”, ”impossible”, ”hardly”) –title and heading words –location

4 Edmundson’s method programs to weight sentences based on each of the four features –weight of a sentence = the sum of the weights for features programs were evaluated by comparison against manually created extracts corpus-based methodology: training set and test set –in the training phase, weights were manually readjusted

5 Edmundson’s method results: –three additional features dominated word frequency measures –the combination of cue-title-location was the best, with location being the best individual feature –keywords alone was the worst

6 Fundamental issues What are the most powerful but also more general features to exploit for summarization? How do we combine these features? How can we evaluate how well we are doing?

7 Linear Weighting Scheme U is a text unit such as a sentence, Greek letters denote tuning parameters Location Weight assigned to a text unit based on whether it occurs in lead, medial, or final position in a paragraph or the entire document, or whether it occurs in prominent sections such as the document’s intro or conclusion CuePhrase Weight assigned to a text unit in case lexical or phrasal in-text summary cues occur: positive weights for bonus words (“significant”, “verified”, etc.), negative weights for stigma words (“hardly”, “impossible”, etc.) StatTerm Weight assigned to a text unit due to the presence of statistically salient terms (e.g., tf.idf terms) in that unit AddTerm Weigth assigned to a text unit for terms in it that are also present in the title, headline, initial para, or the user’s profile or query

8 Corpus-based approaches in the classical methods, various features (thematic features, title, location, cue phrase) were used to determine the salience of information for summarization an obvious issue: determine the relative contribution of different features to any given text summarization task –tuning parameters in the previous slide

9 Corpus-based approaches contribution is dependent on the text genre, e.g. location: –in newspaper stories, the leading text often contains a summary –in TV news, a preview segment may contain a summary of the news to come –in scientific text: an author-written abstract

10 Corpus-based approaches the importance of different text features for any given summarization problem can be determined by counting the occurrences of such features in text corpora in particular, analysis of human-generated summaries, along with their full-text sources, can be used to learn rules for summarization

11 Corpus-based approaches challenges –creating a suitable text corpus –ensuring that a suitable set of summaries is available may already be available: scientific papers if not: author, professional abstractor, judge –evaluation in terms of accuracy on unseen test data –discovering new features for new genres

12 KPC method Kupiec, Pedersen, Chen (1995): A trainable document summarizer a learning method using –a corpus of journal articles and –abstracts written by professional human abstractors (Engineering Information Co.) naïve Bayesian classification method is used to create extracts

13 KPC method: general idea training phase: –select a set of features –calculate a probability of each feature value to appear in a summary sentence using a training corpus (e.g. originals + manual summaries)

14 KPC method: general idea when a new document is summarized: –for each sentence find values for the features calculate the probability for this feature value combination to appear in a summary sentence –choose n best scoring sentences

15 KPC method: features sentence-length cut-off feature –given a threshold (e.g. 5 words), the feature is true for all sentences longer than the threshold, and false otherwise F1(s) = 0, if sentence s has 5 or less words F1(s) = 1, if sentence s has more than 5 words

16 KPC method: features paragraph feature –sentences in the first 10 paragraphs and the last 5 paragraphs in a document get a higher value –in paragraphs: paragraph-initial, paragraph- final, paragraph-medial are distinguished F2(s) = i, if sentence s is the first sentence in a paragraph F2(s) = f, if there are at least 2 sentences in a paragraph, and s is the last one F2(s) = m, if there are at least 3 sentences in a paragraph, and s is neither the first nor the last sentence

17 KPC method: features thematic word feature –a small number of thematic words (the most frequent content words) are selected –each sentence is scored as a function of frequency of the thematic words –highest scoring sentences are selected –binary feature: feature is true for a sentence, if the sentence is present in the set of highest scoring sentences

18 KPC method: features fixed-phrase feature –this feature is true for sentences that contain any of 26 indicator phrases (e.g. ”this letter…”, ”In conclusion…”), or that follow section head that contain specific keywords (e.g. ”results”, ”conclusion”)

19 KPC method: features uppercase word feature –proper names and explanatory text for acronyms are usually important –feature is computed like the thematic word feature –an uppercase thematic word is not sentence-initial and begins with a capital letter and must occur several times –first occurrence is scored twice as much as later occurrences

20 Exercise (CNN news) sentence-length; F1: let threshold = 14 –< 14 words: F1(s) =0, else F1(s)=1 paragraph; F2: –i=first, f=last, m=medial thematic-words; F3 –score: how many thematic words a sentence has –F3(s) = 0, if score > 3, else F3(s) = 1

21 KPC method: classifier for each sentence s, we compute the probability that s will be included in a summary S given the k features F j, j=1…k the probability can be expressed using Bayes’ rule:

22 KPC method: classifier assuming statistical independence of the features: P(sS) is a constant, and P(F j | sS) and P(F j ) can be estimated directly from the training set by counting occurrences

23 KPC method: corpus corpus is acquired from Engineering Information Co, which provides abstracts of technical articles to online information services articles do not have author-written abstracts abstracts were created by professional abstractors

24 KPC method: corpus 188 document/summary pairs sampled from 21 publications in the scientific/technical domain summaries are mainly indicative, average length is 3 sentences average number of sentences in the original documents is 86 author, address, and bibliography were removed

25 KPC method: sentence matching the abstracts from the human abstractors are not extracts but inspired by the original sentences the automatic summarization task here: –extract sentences that the human abstractor might have chosen to prepare summary text (with minor modifications…) for training, a correspondence between the manual summary sentences and sentences in the original document need to be obtained matching can be done in several ways

26 KPC method: sentence matching matching can be done in several ways: –a direct sentence match the same sentence is found in both –a direct join 2 or more original sentences were used to form a summary sentence –summary sentence can be ’unmatchable’ –summary sentence (single or joined) can be ’incomplete’

27 KPC method: sentence matching matching was done in two passes –first, the best one-to-one sentence matches were found automatically –second, these matches were used as a starting point for the manual assignment of correspondences

28 KPC method: evaluation cross-validation strategy for evaluation –documents from a given journal were selected for testing one at a time –all other document/summary pairs (of this journal) were used for training –results were summed over journals unmatchable and incomplete summary sentences were excluded total of 498 unique sentences

29 KPC method: evaluation two ways of evaluation 1.the fraction of manual summary sentences that were faithfully reproduced by the summarizer program the summarizer produced the same number of sentences as were in the corresponding manual summary -> 35% of summary sentences reproduced 83% is the highest possible value, since unmatchable and incomplete sentences were excluded 2.the fraction of the matchable sentences that were correctly identified by the summarizer -> 42%

30 KPC method: evaluation –the effect of different features was also studied best combination (44%): paragraph, fixed-phrase, sentence- length baseline: selecting sentences from the beginning of the document (result: 24%) –if 25% of the original sentences selected: 84%