©2012 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek

Slides:

Advertisements

Similar presentations

Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Course overview Introduction to summarization Lecture 1.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.

A Quality Focused Crawler for Health Information Tim Tang.

Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.

Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.

Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,

Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

1 Multi-document Summarization and Evaluation. 2 Task Characteristics  Input: a set of documents on the same topic  Retrieved during an IR search 

Overview of Search Engines

Indexing Overview Approaches to indexing Automatic indexing Information extraction.

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,

Sharad Oberoi and Susan Finger Carnegie Mellon University DesignWebs: Towards the Creation of an Interactive Navigational Tool to assist and support Engineering.

Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.

Flexible Text Mining using Interactive Information Extraction David Milward

Text Mining. ©2002 Paula Matuszek Challenges and Possibilities l Information overload. There’s too much. We would like –Better retrieval –Help with handling.

Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.

©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)

Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.

What to expect from the SAT.  Sentence completion—19 multiple choice questions that test your vocabulary in a complex sentence.  Passage-based reading—48.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

Algorithmic Detection of Semantic Similarity WWW 2005.

Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.

Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.

©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)

For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.

For Friday Finish chapter 24 No written homework.

For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.

©2012 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

©2011 1www.id-book.com Data Gathering Chapter 7. ©2011 Data Gathering What is data gathering? –The act of gathering data through a study The data can.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.

Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,

Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)

©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.

For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.

Bringing Order to the Web : Automatically Categorizing Search Results Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Hao Chen Susan Dumais.

AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

©2003 Paula Matuszek CSC 9010: AeroText, Ontologies, AeroDAML Dr. Paula Matuszek (610)

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

CSC 4510/9010: Applied Machine Learning

Structured Browsing for Unstructured Text

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin

K Nearest Neighbors and Instance-based methods

Summarizing Entities: A Survey Report

Search Techniques and Advanced tools for Researchers

Batyr Charyyev.

Introduction to Information Retrieval

Information Retrieval

Presentation transcript:

©2012 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610) )

©2012 Paula Matuszek Document Summarization l Document Summarization –Provide meaningful summary for each document l Examples: –Search tool returns “context” –Monthly progress reports from multiple projects –Summaries of news articles on the human genome l Often part of a document retrieval system, to enable user judge documents better l Surprisingly hard to make sophisticated l Surprisingly easy to make effective

©2012 Paula Matuszek Document Summaries: What l Goal of summarization: –Indicative: what’s it about? –Informative: substitute for reading it l Number of documents: single vs multi-document l Desired summary: –paragraph –keywords –template –focused –update

©2012 Paula Matuszek Document Summarization -- How l Two general approaches: l Abstractive: Capture in abstract representation, generate summary –Useful in well-defined domains with clearcut information needs. –Still a research topic in the more general case l Extractive: Extract representative sentences/clauses. –Useful in arbitrarily complex and unstructured domains; broadly applicable, and gets "general feel".

©2012 Paula Matuszek Abstractive: Capture and Generate l Documents can have arbitrary format l Knowledge needed is well-defined. l Often information need is for summarizations across multiple documents l Example: –Summarizing restaurant reviews. Take newspaper articles and produce price range, kind of food, atmosphere, quality, service.

©2012 Paula Matuszek Capture and Generate: Methods l State of the art: –Create "template" or "frame" –Represent the knowledge you want to capture –Extract Information to fill in frame –Standard information extraction problem –Typically relatively large frames with relatively few relations; mostly facts. –Generate based on template –Relatively simple "fill-in-the-blank" –More complex based on parse tree. l Still basically research: parse entire document into parse tree tied to rich semantic net; apply rules to trim tree; generate continuous narrative.

©2012 Paula Matuszek Capture and Generate: Advantages and Disadvantages l Advantages: –Produces very focused summaries. –Can readily incorporate multiple documents. –Not dependent on authors l Disadvantages –Assumes information need is clearly defined. –Information extraction component development time is significant –Document parsing slow; probably not real-time. l Comment: –Makes no attempt to capture author's intent

©2012 Paula Matuszek Extractive: Extract Representative Sentence/Clause/Word l Document format can be arbitrary l Document content can also be arbitrary; information need not clearcut l Summarization consists of text extracted directly from document. l Examples: –Context returned by Google for each hit –Google News summaries.

©2012 Paula Matuszek Find Representative Sentences: Method l Typically, choose representative individual terms, then broaden to capture sentence containing terms. The more terms contained, the more important the sentence. –If in response to a search or other information request, the search terms are representative –If no prior query, TF*IDF and other BOW approaches. May use pairs or n-ary groups of words.

©2012 Paula Matuszek Additional Features l Other features can include –position –nearness to specific phrases such as “in summary” : cues l Centrality: How similar is the sentence to the overall document? (Any of our similarity measures used for clustering)

©2012 Paula Matuszek Other Factors l Once we have sentences, there are new aspects to consider: –Sentence Ordering –Can make a big difference in how good a summary is perceived to be –Sentence Revision –named entity substitution –combining information from multiple sentences: fusing –eliminating information from some sentences

©2012 Paula Matuszek Find Representative Sentences: Advantages and Disadvantages l Advantages –Can be applied anywhere. –Relatively fast (compared to full parse) –Provides a good general idea or feel for content. –Can do multiple-document summaries. l Disadvantages –Often choppy or hard to read –Does poorly when document doesn't contain good summary sentences. –Can miss major information

©2012 Paula Matuszek Multi-Document Summarization l When we have multiple documents on a single subject, we might want: –one summary with all common information –similarities and differences among documents –support and opposition to specific ideas/concepts

©2012 Paula Matuszek Data-Driven Summarization l Extend “representative sentences” approach. l May also require some significant NLP

15 Based on slides from Ani Nenkova, Frequency as indicator of importance The topic of a document will be repeated many times In multi-document summarization, important content is repeated in different sources

16 Based on slides from Ani Nenkova, Greedy frequency method Compute word probability from input Compute sentence weight as function of word probability Pick best sentence

17 Based on slides from Ani Nenkova, Sentence clustering for theme identification 1. PAL was devastated by a pilots' strike in June and by the region's currency crisis. 2. In June, PAL was embroiled in a crippling three- week pilots' strike. 3. Tan wants to retain the 200 pilots because they stood by him when the majority of PAL's pilots staged a devastating strike in June.

18 Based on slides from Ani Nenkova, Cluster sentences from the input into similar themes Choose one sentence to represent a theme Consider bigger themes as more important

19 Based on slides from Ani Nenkova, Using machine learning Ask people to select sentences Use these as training examples for machine learning – Each sentence is represented as a number of features – Based on the features distinguish sentences that are appropriate for a summary and sentences that are not Run on new inputs

©2012 Paula Matuszek Machine Learning Difficulties l Getting the training data is expensive l Human raters are not very consistent –Any of several sentences may be equally good –Judgment depends in part on focus and intent of rater l Can be very domain-specific –TF*IDF scores vary widely across content domains –Features like position vary widely across format domains: chats vs newspaper vs scientific articles

21 Based on slides from Ani Nenkova, Automatic summary edits Some expressions might not be appropriate in the new context – References: – he – Putin – Russian Prime Minister Vladimir Putin – Discourse connectives However, moreover, subsequently Requires more sophisticated NLP techniques

22 Based on slides from Ani Nenkova, Before Pinochet was placed under arrest in London Friday by British police acting on a warrant issued by a Spanish judge. Pinochet has immunity from prosecution in Chile as a senator-for-life under a new constitution that his government crafted. Pinochet was detained in the London clinic while recovering from back surgery.

23 Based on slides from Ani Nenkova, After Gen. Augusto Pinochet, the former Chilean dictator, was placed under arrest in London Friday by British police acting on a warrant issued by a Spanish judge. Pinochet has immunity from prosecution in Chile as a senator-for-life under a new constitution that his government crafted. Pinochet was detained in the London clinic while recovering from back surgery.

24 Based on slides from Ani Nenkova, Before Turkey has been trying to form a new government since a coalition government led by Yilmaz collapsed last month over allegations that he rigged the sale of a bank. Ecevit refused even to consult with the leader of the Virtue Party during his efforts to form a government. Ecevit must now try to build a government. Demirel consulted Turkey's party leaders immediately after Ecevit gave up.

25 Based on slides from Ani Nenkova, After Turkey has been trying to form a new government since a coalition government led by Prime Minister Mesut Yilmaz collapsed last month over allegations that he rigged the sale of a bank. Premier-designate Bulent Ecevit refused even to consult with the leader of the Virtue Party during his efforts to form a government. Ecevit must now try to build a government. President Suleyman Demirel consulted Turkey's party leaders immediately after Ecevit gave up.

©2012 Paula Matuszek Getting There in Some Domains l

©2012 Paula Matuszek Some systems to try l This system has canned demos showing several forms of summarization. There is a downloadable trial version, Windows only. l These are systems which will take a search term and produce a summary: – (slow) – l This system will let you enter your own documents –iResearch-reporter.com ( 1 or 3)iResearch-reporter.com