UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Slides:



Advertisements
Similar presentations
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Advertisements

Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
SQ3R: A Reading Technique
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Evaluating Search Engine
Information Retrieval in Practice
Automatic Classification of Accounting Literature Nineteenth Annual Strategic and Emerging Technologies Workshop Vasundhara Chakraborty, Victoria Chiu,
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 I256: Applied Natural Language Processing Marti Hearst Oct 2, 2006.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
© 2007 Pearson Education, Inc. publishing as Longman Publishers Chapter 2: Active Reading and Learning Efficient and Flexible Reading, 8/e Kathleen T.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Literature Review. What is a literature review? A literature review discusses published information in a particular subject area, and sometimes information.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
LECTURE FOURTEEN SUMMARY WRITING. Definition and characteristics Steps in writing a summary How to write a summary Writing Practice.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2006.
© 2005 Pearson Education, Inc. publishing as Longman Publishers Chapter 2: Active Reading and Learning Efficient and Flexible Reading, 7/e Kathleen T.
Synopsis Writing.
1 Module 9 Paraphrasing Matakuliah: G1112, Scientific Writing I Tahun: 2006 Versi: v 1.0 rev 1.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Reference Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, Xiaodan Zhu, Gerald Penn, “Evaluation.
Automatic Keyphrase Extraction (Jim Nuyens) Keywords are an everyday part of looking up topics and specific content. What are some of the ways of obtaining.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Information Extraction and Automatic Summarisation *
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Chapter 23: Probabilistic Language Models April 13, 2004.
Call to Write, Third edition Chapter Two, Reading for Academic Purposes: Analyzing the Rhetorical Situation.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Annotated Bibliography A how to for Sociology & The Culture Project Taken from Purdue Owl!
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Information Retrieval in Practice
Search Engine Architecture
Natural Language Processing for the Web
Concept of a document Lesson 3.
Thanks to Bill Arms, Marti Hearst
An Introduction to the Research Process
Writing an Engineering Report (Formal Reports)
Presentation transcript:

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16

Marti A. Hearst SIMS 202, Fall 1997 Summarization n What is it for? n Reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. (Kupiec et al. 95) n Other definitions? n What kinds of summaries are there? n Abstracts n Extracts n Highlights n Difficult to Evaluate Quality

Marti A. Hearst SIMS 202, Fall 1997 Abstracts n Act as Document Surrogates n Intermediate point between title and entire document n Summarize main points n Cohesive narratives n Reporting vs. Critical

Marti A. Hearst SIMS 202, Fall 1997 How to Generate Abstracts Automatically? “Understand” the text and summarize? “Understand” the text and summarize? Automated text understanding is still not possible. Automated text understanding is still not possible. Approximations to this require huge amounts of hand-coded knowledge. Approximations to this require huge amounts of hand-coded knowledge. Hard even for people (remember the SATs?) Hard even for people (remember the SATs?)

Marti A. Hearst SIMS 202, Fall 1997 Extracts n Excerpt directly from source material n Present inventory of all major topics or points n Easier to automate than Abstracts

Marti A. Hearst SIMS 202, Fall 1997 Automatic Extracting Define Goal Define Goal Summarize main point? Summarize main point? Create a survey of topics? Create a survey of topics? Show relation of document to user’s query terms? Show relation of document to user’s query terms? Show the context of document (for web page excerpts)? Show the context of document (for web page excerpts)?

Marti A. Hearst SIMS 202, Fall lecture 18 Lecture 18. Index language functions (Text Chapter 13) Objectives. The student should understand the principle of request-oriented (user- centered) size 4K - 23-Apr-97 - English 5. Actions/change, Accepted articles Research area of. Planning and Scheduling. Received Research Articles. The following articles have been received for the ETAI area "Planning and... size 2K - 8-Sep-97 - English 8. Wilson Readers' Guide Abstracts Wilson Readers' Guide Abstracts. Wilson Readers' Guide Abstracts includes citations and abstracts for articles from over 250 of the popular English size 3K - 29-May-97 - English

Marti A. Hearst SIMS 202, Fall 1997 Automating Extracting Just about any simple algorithm can get “good” results for coarse tasks Just about any simple algorithm can get “good” results for coarse tasks Pull out “important” phrases Pull out “important” phrases Find “meaningfully” related words Find “meaningfully” related words Create extract from document Create extract from document Major problem: Evaluation Major problem: Evaluation Need to define goal or purpose Need to define goal or purpose Human extractors agree on only about 25% of sentences (Rath et al. 61) Human extractors agree on only about 25% of sentences (Rath et al. 61)

Marti A. Hearst SIMS 202, Fall 1997 Summary of Summary Paper Summary of Summary Paper Kupiec, Pedersen, and Chen, SIGIR 94 To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. This paper focuses on document extracts, a particular kind of computed document summary. This paper focuses on document extracts, a particular kind of computed document summary. Document extracts consisting of roughly 20% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries. Document extracts consisting of roughly 20% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries. The trends in our results are in agreement with those of Edmunson who used a subjectively weighted combination of features as opposed to training the feature weights using a corpus. The trends in our results are in agreement with those of Edmunson who used a subjectively weighted combination of features as opposed to training the feature weights using a corpus. We have developed a trainable summarizer program that is grounded in a solid statistical framework. We have developed a trainable summarizer program that is grounded in a solid statistical framework.

Marti A. Hearst SIMS 202, Fall 1997 Text Pre-Processing The following steps are typical: Tokenization Tokenization Morphological Analysis (Stemming) Morphological Analysis (Stemming) inflectional, derivational, or crude IR methods Part-of-Speech Tagging Part-of-Speech Tagging I/Pro see/VP Pathfinder/PN on/P Mars/PN... Phrase Boundary Identification Phrase Boundary Identification [Subj I] [VP saw] [DO Pathfinder] [PP on Mars] [PP with a telescope].

Marti A. Hearst SIMS 202, Fall 1997 Extracting Example: sentence extraction from a single document (Kupiec et al.) Example: sentence extraction from a single document (Kupiec et al.) Start with training set of manually- generated extracts Start with training set of manually- generated extracts (This allows for objective evaluation.) Create heuristics Create heuristics Train Classification Function to estimate the probability a sentence is included in the extract. Train Classification Function to estimate the probability a sentence is included in the extract. 42% of assigned sentences actually belonged in the extracts. 42% of assigned sentences actually belonged in the extracts.

Marti A. Hearst SIMS 202, Fall 1997 Heuristic Feature Selection Sentence Length Cut-off Sentence Length Cut-off Key fixed-phrases Key fixed-phrases “this letter”, “in conclusion” “this letter”, “in conclusion” phrases appearing right after conclusions section phrases appearing right after conclusions section Position Position Of paragraph in document (first & last 10 paragraphs) Of paragraph in document (first & last 10 paragraphs) Of sentence within paragraph (first, last, median) Of sentence within paragraph (first, last, median) Thematic Words Thematic Words Most frequent content words Most frequent content words See Choueka article in reader See Choueka article in reader Uppercase Words (proper names) Uppercase Words (proper names)

Marti A. Hearst SIMS 202, Fall 1997 Classifier Function For each sentence S, compute the probability that S will appear in extract E: For each sentence S, compute the probability that S will appear in extract E: If a feature appears in sentences chosen to be in extracts, and not in other sentences, that feature is useful. If a feature appears in sentences chosen to be in extracts, and not in other sentences, that feature is useful. If a sentence contains many of the useful features, that sentence is likely to be chosen for the extract. If a sentence contains many of the useful features, that sentence is likely to be chosen for the extract.

Marti A. Hearst SIMS 202, Fall 1997 Classifier Function Compute: Compute: How likely is each feature to occur anywhere in any document? How likely is each feature to occur anywhere in any document? How likely is each feature to occur in a sentence that ends up in an extract? How likely is each feature to occur in a sentence that ends up in an extract? Combine feature scores for a sentence to compute probability for that sentence to be included in the extract. Combine feature scores for a sentence to compute probability for that sentence to be included in the extract.

Marti A. Hearst SIMS 202, Fall 1997

Marti A. Hearst SIMS 202, Fall 1997 Evaluation Corpus of Manual Extracts Corpus of Manual Extracts Engineering journal articles, average length 86 sentences Engineering journal articles, average length 86 sentences 188 document/extract pairs from 21 journals 188 document/extract pairs from 21 journals Statistics for manual extracts : Statistics for manual extracts : In Training set: Direct Sentence matches45179% Direct Joins193% Unmatchable Sentences509% Incomplete Single Sentences214% Incomplete Joins275% Total Extract Sentences568 Join: sentence combined with other material

Marti A. Hearst SIMS 202, Fall 1997 Evaluation Training set vs. Testing Set Training set vs. Testing Set Must keep them separate for legitimate results Must keep them separate for legitimate results Danger to avoid: over-fitting Danger to avoid: over-fitting

Marti A. Hearst SIMS 202, Fall 1997 Evaluation Baseline: Baseline: Use only sentences from beginning of document Use only sentences from beginning of document using length cut-off 121 of the original extracted sentences extracted by algorithm (24%) 121 of the original extracted sentences extracted by algorithm (24%) Results using classifier: Results using classifier: Algorithm assigns same number of sentences as manual abstractors did. Algorithm assigns same number of sentences as manual abstractors did. 195 direct matches + 6 direct joins = 35% correct 195 direct matches + 6 direct joins = 35% correct When extracts are larger (25% size of original text), algorithm selects 84% of the extracted sentences When extracts are larger (25% size of original text), algorithm selects 84% of the extracted sentences

Marti A. Hearst SIMS 202, Fall 1997 Evaluation Performance for each feature: Performance for each feature: FeatureAB Paragraph163 (33%)163 (33%) Fixed Phrases145 (29%)209 (42%) Length Cut-Off121 (24%)217 (44%) Thematic Word101 (20%)209 (42%) Uppercase Word211 (42%)211 (42%) A: Sentence level performance for individual features alone. If there are many sentences with the same feature, they are put in the abstract in order of appearance of the sentence within the document. B: How performance varies as features are combined cumulatively from the top down.

Marti A. Hearst SIMS 202, Fall 1997 Thought Points How well might this work on a different kind of collection? How well might this work on a different kind of collection? How to summarize a collection? How to summarize a collection?