Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2005.

Slides:



Advertisements
Similar presentations
Academic Writing Writing an Abstract.
Advertisements

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
Statistical NLP: Lecture 3
Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2006.
1 Discourse, coherence and anaphora resolution Lecture 16.
Word Order Choices Chapter 12
Chapter 18: Discourse Tianjun Fu Ling538 Presentation Nov 30th, 2006.
Reference Resolution #1 CSCI-GA.2590 Ralph Grishman NYU.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
CAP 252 Lecture Topic: Requirement Analysis Class Exercise: Use Cases.
CS 4705 Algorithms for Reference Resolution. Anaphora resolution Finding in a text all the referring expressions that have one and the same denotation.
Generative Models of Discourse Eugene Charniak Brown Laboratory for Linguistic Information Processing BL IP L.
CS 4705 Lecture 21 Algorithms for Reference Resolution.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Processing of large document collections Part 1 (Introduction, text representation, text categorization) Helena Ahonen-Myka Spring 2005.
A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
MECHANICS OF WRITING C.RAGHAVA RAO.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Please check, just in case…. Announcements: Office hour appointments filling up – get yours today! Don’t delay on getting started on next TWO assignments.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
Discourse Topics, Linguistics, and Language Teaching Richard Watson Todd King Mongkut’s University of Technology Thonburi arts.kmutt.ac.th/crs/research/
Processing of large document collections Fall 2002, Part 2.
Differential effects of constraints in the processing of Russian cataphora Kazanina and Phillips 2010.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Binding Theory Describing Relationships between Nouns.
A multiple knowledge source algorithm for anaphora resolution Allaoua Refoufi Computer Science Department University of Setif, Setif 19000, Algeria .
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출 
SOCIO-COGNITIVE APPROACHES TO TESTING AND ASSESSMENT
Bernd Bruegge & Allen H. Dutoit Object-Oriented Software Engineering: Using UML, Patterns, and Java 1 Object Modeling.
(Spring 2015) Instructor: Craig Duckett Lecture 10: Tuesday, May 12, 2015 Mere Mortals Chap. 7 Summary, Team Work Time 1.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Dr. Francisco Perlas Dumanig
IFS310: Module 6 3/1/2007 Data Modeling and Entity-Relationship Diagrams.
Coherence and Coreference Introduction to Discourse and Dialogue CS 359 October 2, 2001.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Minimally Supervised Event Causality Identification Quang Do, Yee Seng, and Dan Roth University of Illinois at Urbana-Champaign 1 EMNLP-2011.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
Automatic Evaluation of Linguistic Quality in Multi- Document Summarization Pitler, Louis, Nenkova 2010 Presented by Dan Feblowitz and Jeremy B. Merrill.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
MedKAT Medical Knowledge Analysis Tool December 2009.
Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002.
Academic vs Media Discourse week 3 B. Mitsikopoulou.
Processing of large document collections Part 1 (Introduction) Helena Ahonen-Myka Spring 2006.
Levels of Linguistic Analysis
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Abstracting.  An abstract is a concise and accurate representation of the contents of a document, in a style similar to that of the original document.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
Unit 6 Predicates, Referring Expressions, and Universe of Discourse.
A Database of Narrative Schemas A 2010 paper by Nathaniel Chambers and Dan Jurafsky Presentation by Julia Kelly.
Writing 2 ENG 221 Norah AlFayez. Lecture Contents Revision of Writing 1. Introduction to basic grammar. Parts of speech. Parts of sentences. Subordinate.
Statistical NLP: Lecture 3
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
What is linguistics?.
Entity- & Topic-Based Information Ordering
Improving a Pipeline Architecture for Shallow Discourse Parsing
Social Knowledge Mining
Writing Analytics Clayton Clemens Vive Kumar.
Algorithms for Reference Resolution
Levels of Linguistic Analysis
Presentation transcript:

Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2005

2 Discourse-based approaches discourse structure appears to play an important role in the strategies used by human abstractors and in the structure of their abstracts an abstract is not just a collection of sentences, but it has an internal structure –-> abstract should be coherent and it should represent some of the argumentation used in the source

3 Boguraev, Kennedy (BG) Boguraev, Kennedy (1999): Salience-based content characterisation of text documents goal: identify those phrasal units across the entire span of the document that best function as representative highlights of the document’s content these phrasal units are called topic stamps a set of topic stamps is called capsule overview

4 BG a capsule overview –not a set/sequence of sentences –a semi-formal (normalised) representation of the document, derived after a process of data reduction over the original text –not always very readable, but still represents the flow of the narrative can be combined with surrounding information to produce more coherent presentation

5 Priest is charged with Pope attack A Spanish priest was charged here today with attempting to murder the Pope. Juan Fernandez Krohn, aged 32, was arrested after a man armed with a bayonet approached the Pope while he was saying prayers at Fatima on Wednesday night. According to the police, Fernandez told the investigators today that he trained for the past six months for the assault. He was alleged to have claimed the Pope ’looked furious’ on hearing the priest’s criticism of his handling of the church’s affairs. If found quilty, the Spaniard faces a prison sentence of years.

6 Capsule overview vs. summary summary could be, e.g. –“A Spanish priest is charged after an unsuccessful murder attempt on the Pope” capsule overview: –A SPANISH PRIEST was charged –Attempting to murder the POPE –HE trained for the assault –POPE furious on hearing PRIEST’S criticisms

7 BG primary consideration: methods should apply to any document type and source (domain independence) also: efficient and scalable technology –shallow syntactic analysis, no comprehensive parsing engine needed

8 BG based on the findings on technical terms –technical terms have such linguistic properties that can be used to find terms automatically in different domains quite reliably –technical terms seem to be topical task of content characterization –identifying phrasal units that have lexico-syntactic properties similar to technical terms discourse properties that signify their status as most prominent

9 Terms as content indicators problems –undergeneration –overgeneration –differentiation

10 Undergeneration a set of phrases should contain an exhaustive description of all the entities that are discussed in the text the set of technical terms has to be extended to include also expressions with pronouns etc.

11 Overgeneration already the set of technical terms can be large extensions make the information overload even worse solution: phrases that refer to one participant in the discourse are combined with referential links

12 Differentiation the same list of terms may be used to describe two documents, even if they, e.g., focus on different subtopics it is necessary to differentiate term sets not only according to their membership, but also according to the relative representativeness of the terms they contain

13 Term sets and coreference classes phrases are extracted using a phrasal grammar (e.g. a noun with modifiers) –also expressions with pronouns and incomplete expressions are extracted –using a (Lingsoft) tagger that provides information about the part of speech, number, gender, and grammatical function of tokens in a text –solves the undergeneration problem

14 Term sets and coreference classes the phrase set has to be reduced to solve the problem of overgeneration -> a smaller set of expressions that uniquely identify the objects referred to in the text application of anaphora resolution –e.g. to which noun a pronoun ’he’ refers to?

15 Resolving coreferences procedure –moving through the text sentence by sentence and analysing the nominal expressions in each sentence from left to right –either an expression is identified as a new participant in the discourse, or it is taken to refer to a previously mentioned referent

16 Resolving coreferences coreference is determined by a 3 step procedure –a set of candidates is collected: all nominals within a local segment of discourse –some candidates are eliminated due to morphological mismatches or syntactical restrictions –remaining candidates are ranked according to their relative salience in the discourse

17 Salience factors sent(term) = 100 iff term is in the current sentence cntx(term) = 50 iff term is in the current discourse segment subj(term) = 80 iff term is a subject acc(term) = 50 iff term is a direct object dat(term) = 40 iff term is an indirect obj...

18 Local salience of a candidate the local salience of a candidate is the sum of the values of the salience factors the most salient candidate is selected as the antecedent if the coreference link cannot be established to some other expression, the nominal is taken to introduce a new referent -> coreferent (equivalence) classes

19 Topic stamps in order to further reduce the referent set, some additional structure has to be imposed –the referent set is ranked according to the salience of its members (= coreference classes) –objects in the centre of discussion have a high degree of salience –salience is measured like local saliency in coreference resolution, but tries to measure the importance of unique referents in the discourse salience of coreference class = sum of the salience factor values of the expressions in the class

20 Priest is charged with Pope attack A Spanish priest was charged here today with attempting to murder the Pope. Juan Fernandez Krohn, aged 32, was arrested after a man armed with a bayonet approached the Pope while he was saying prayers at Fatima on Wednesday night. According to the police, Fernandez told the investigators today that he trained for the past six months for the assault. He was alleged to have claimed the Pope ’looked furious’ on hearing the priest’s criticism of his handling of the church’s affairs. If found quilty, the Spaniard faces a prison sentence of years.

21 Saliency ’priest’ is the primary element –eight references to the same actor in the body of the story –these references occur in important syntactic positions: 5 are subjects of main clauses, 2 are subjects of embedded clauses, 1 is a possessive ’Pope attack’ is also important –’Pope’ occurs 5 times, but not in so important positions (2 are direct objects)

22 Capsule overview capsule overview: –occurrences of the topic stamps with some context A SPANISH PRIEST was charged Attempting to murder the POPE HE trained for the assault POPE furious on hearing PRIEST’S criticisms several granularities for context: phrase, sentence, paragraph…

23 Discourse segments if the intention is to use very concise descriptions of one or two salient phrases, i.e. topic stamps, longer text have to be broken down into smaller segments topically coherent, contiguous segments can be found by using a lexical similarity measure –assumption: distribution of words used changes when the topic changes

24 BG: Summarization process 1.linguistic analysis 2.discourse segmentation 3.extended phrase analysis 4.coreference resolution 5.calculation of discourse salience 6.topic stamp identification 7.capsule overview