NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.

Slides:



Advertisements
Similar presentations
A probabilistic model for retrospective news event detection
Advertisements

A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
1 Unsupervised Semantic Parsing Hoifung Poon and Pedro Domingos EMNLP 2009 Best Paper Award Speaker: Hao Xiong.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Presented by Zeehasham Rasheed
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Cover Coefficient based Multidocument Summarization CS 533 Information Retrieval Systems Özlem İSTEK Gönenç ERCAN Nagehan PALA.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Overview of Search Engines
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Natural Language Understanding
Mining and Summarizing Customer Reviews
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Rui Yan, Yan Zhang Peking University
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Event-Centric Summary Generation Lucy Vanderwende, Michele Banko and Arul Menezes One Microsoft Way, WA, USA DUC 2004.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Building a Semantic Parser Overnight
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Entity- & Topic-Based Information Ordering
Compact Query Term Selection Using Topically Related Text
Presented by: Prof. Ali Jaoua
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
Information Retrieval
Presentation transcript:

NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting

NTNU Speech Lab 2 References Sanda Harabagiu and Finley Lacatusu, “Topic Themes for Multi-Document Summarization”, SIGIR S. Harabagiu. Incremental Topic Representations. In Proceedings of the 20th COLING Conference, Geneva, Switzerland, 2004.

NTNU Speech Lab 3 Outline Introduction Topic representation Theme representation Using Topic and Theme representations for MDS Evaluation MDS Conclusions

NTNU Speech Lab 4 Introduction One of the problems of data overload that we are facing today is that there are many document that cover the same topic. Multi-document summaries need to be both informative and coherent. Much work in summarization dealt with these problems separately. An approach represents topics as a structure of themes. To dictate both (a) the information content to be included in an MDS as well as (b) the order of the themes that are selected.

NTNU Speech Lab 5 Topic representation(1/12) Five different topic representation (TRs) –(TR 1 ) representing topics via topic signature (TS 1 ) –(TR 2 ) representing topic via enhanced topic signature (TS 2 ) –(TR 3 ) representing topic via thematic signature (TS 3 ) –(TR 4 ) representing topic by modeling the content structure of documents –(TR 5 ) representing topic as templates implemented as a frame with slots and fillers.

NTNU Speech Lab 6 Topic representation(2/12) TR 1. Topic Representation 1: –The topic signature is represented as TS 1 = {topic, } where the terms t i are highly correlated to the topic with association weight w i. –Term selection and weight association are determined by the use of likelihood ratio. –With the likelihood ratio method, the confidence level for a specific value is found by (a) looking up the distribution table, (b) using the value c to select an appropriate cutoff associated weight, and (c) determining the terms selected in the topic signature based on the value c.

NTNU Speech Lab 7 Topic representation(3/12) TR 1. Topic Representation 1: –A set of documents is preclassified into (a) topic relevant texts, and (b) topic nonrelevant texts –Two hypotheses: Hypothesis 1 (H1) : Hypothesis 2 (H2) :

NTNU Speech Lab 8 Topic representation(4/12) TR 1. Topic Representation 1:

NTNU Speech Lab 9 Topic representation(5/12) TR 2. Topic Representation 2: –Topics can be represented by identifying the relevant relations that exist between topic signature terms: TS 2 = {topic, }, where r i is a binary relation between two topic concepts. –Two forms of topic relations are considered: (1) syntax- based relations between the VP and it’s Subject, Object, or Prepositional Attachments; and (2) C-relations between events and entities that cannot be identified by syntactic constraints, but belong to the same context. –The topic relations are discovered by starting with the topic terms uncovered in TS 1 and selecting a seed syntactic relation between the topic terms. –Only nouns and verbs are considered from TS 1.

NTNU Speech Lab 10 Topic representation(6/12) TR 2. Topic Representation 2: –The iterative process of discovering topic relations has four steps: –Step1-generate candidate relations –Step2-the candidate topic relations are ranked based on its Relevance-Rate and it’s Frequency. Relevance-Rate= Frequency/Count –Step3-select a new topic relation based on the ranking in step 2. –Step4-restart the discovery by using the latest discovered relation for classifying relevant documents.

NTNU Speech Lab 11 Topic representation(7/12) TR 2. Topic Representation 2:

NTNU Speech Lab 12 Topic representation(8/12) TR 3. Topic Representation 3: –A third topic representation that is based on the concept of themes. TS 3 = {topic, },where Th i is one the themes associated with the topic and r i is its rank. –The discovery of themes is based on (1) a segmentation of documents produced by the TextTiling algorithm (2) a method of (i) assigning labels to themes, and (ii) ranking them. –Four cases for theme labeling: Case 1: A single topic-relevant relation is identified in the segment. Case2: several topic relation are recognized in the segment. Case3: multiple topic Case4: the theme contains topic-relevant terms, but no topic relation.

NTNU Speech Lab 13 Topic representation(9/12) TR 4. Topic Representation 4: (Topics Represented as Content Models) –The content model is a Hidden Markov Model (HMM) wherein states correspond to topic themes and state transitions capture either (1) orderings within that domain, or (2) the probability of changing from one given topic theme to another. –Step1 initial topic induction: complete-link clustering –Step2 the model states and the emission/transition probabilities are determined. –Step3 Viterbi re-estimation –The cluster represents the topic representation TR 4

NTNU Speech Lab 14 Topic representation(10/12) TR 4. Topic Representation 4:

NTNU Speech Lab 15 Topic representation(11/12) TR 5. Topic Representation 5: (Topics Represented as Extraction Templates) –Topics can be represented as a set of inter-related concepts, implemented as a frame having slots and filler.

NTNU Speech Lab 16 Topic representation(12/12) TR 5. Topic Representation 5: –It is important to be able to generate scripts automatically from corpora. –Using the IS-A and Gloss lexical relations found in the WordNet lexical database to mine topic relations for topic relevant terms. –Combining the Is-A and GLOSS relations for generating the topical relations –An ad-hoc template generation algorithm (five step)

NTNU Speech Lab 17 Theme representation(1/4) In order to produce exhaustive summaries, MDS systems must be able to identify information that is (1) common to multiple documents in the collection (2) unique to a single document in the collection and (3) contradictory to information presented in other document in the collection. Extracting all similar sentences would produce a verbose and repetitive summary. By observing to fine the core of method of representing themes. Current semantic parsers are able to recognize all verbal predicates and their arguments. The predicates that were recognized are underlined.

NTNU Speech Lab 18 Theme representation(2/4)

NTNU Speech Lab 19 Theme representation(3/4) To generate the theme representation through the following six steps: –For every sentence in each document from the collection, the predicate-argument structures are identified. (involves the recognition of paraphrases as synonyms or idioms). –All sentences having at least one common predicate with a common argument are clustered together. The semantic consistency of the other arguments is also checked. –Conceptual representations for each cluster are generated. –Selection of the candidate themes is made by considering the mapping of the clusters into (1) the topic representation TR 3 and (2) the topic representation TR 4. –There are meaningful relations between the themes. Cohesion relations 、 Discourse relations. (recognized by the naïve Bayes classifiers) –The themes are structured into a graph.

NTNU Speech Lab 20 Theme representation(4/4)

NTNU Speech Lab 21 Using Topic and Theme representations for MDS Multi-document summarization is performed by (1) extracting sentences that contain the most salient information; (2) compressing the sentences for retaining the most important pieces of information and (3) ordering the extracted sentences into the final summary. To implement four extraction methods, two ordering methods, and a separate MDS method. –EM 1 (TR 1 ) 、 EM 2 (TR 2 ) 、 EM 3 (TR 3 ) 、 EM 4 (TR 5 ) –OM 1 、 OM 2

NTNU Speech Lab 22 Evaluating MDS(1/2)

NTNU Speech Lab 23 Evaluating MDS(2/2)

NTNU Speech Lab 24 Conclusions In this paper, they investigated to five topic representation that were used before in MDS and proposed a new representation based on topic themes. Additionally, to represent themes in a graph-like structure that improve the quality of ordering information for MDS.