Cover Coefficient based Multidocument Summarization CS 533 Information Retrieval Systems Özlem İSTEK Gönenç ERCAN Nagehan PALA.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Product Review Summarization Ly Duy Khang. Outline 1.Motivation 2.Problem statement 3.Related works 4.Baseline 5.Discussion.
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Information Retrieval in Practice
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
1 I256: Applied Natural Language Processing Marti Hearst Oct 2, 2006.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
1/23 Applications of NLP. 2/23 Applications Text-to-speech, speech-to-text Dialogues sytems / conversation machines NL interfaces to –QA systems –IR systems.
1 Multi-document Summarization and Evaluation. 2 Task Characteristics  Input: a set of documents on the same topic  Retrieved during an IR search 
Methodology Conceptual Database Design
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
Mining and Summarizing Customer Reviews
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
LexRank: Graph-based Centrality as Salience in Text Summarization
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Methodology - Conceptual Database Design. 2 Design Methodology u Structured approach that uses procedures, techniques, tools, and documentation aids to.
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Event-Centric Summary Generation Lucy Vanderwende, Michele Banko and Arul Menezes One Microsoft Way, WA, USA DUC 2004.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Principals of Research Writing. What is Research Writing? Process of communicating your research  Before the fact  Research proposal  After the fact.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Extractive Summarization using Inter- and Intra- Event Relevance Wenjie li, Wei Xu, Mingli Wu, Chunfa Yuan, Qin Lu H.K. Polytechnic Univ. & Tsinghua Univ.,
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Queensland University of Technology
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Using lexical chains for keyword extraction
Presented by Nick Janus
Presentation transcript:

Cover Coefficient based Multidocument Summarization CS 533 Information Retrieval Systems Özlem İSTEK Gönenç ERCAN Nagehan PALA

Outline Summarization History and Related Work Multidocument Summarization (MS) Our Approach: MS via C³M Datasets Evaluation Conclusion and Future Work References

Summarization Information overload problem Increasing need for IR and automated text summarization systems Summarization: Process of distilling the most salient information from a source/sources for a particular user and task

Steps for Summarization Transform text into a internal representation. Detect important text units. Generate summary  In extracts no generation but information ordering, anaphora resolution (or avoiding anaphoric structures)  In abstracts, text generation. Sentence fusion, paraphrasing, Natural Language Generation.

Summarization Techniques Surface level: Shallow features  Term frequency statistics, position in text, presence of text from the title, cue words/phrases: e.g. “in summary”, “important” Entity level: Model text entities and their relationship  Vocabulary overlap, distance between text units, co- occurence, syntactic structure, coreference Discourse level: Model global structure of text  Document outlines, narrative stucture Hybrid

History and Related Work in 1950’s: First systems surface level approaches  Term frequency (Luhn, Rath) in 1960’s: First entity level approaches  Syntactic analysis  Surface Level: Location features (Edmundson 1969) in 1970’s:  Surface Level: Cue phrases (Pollock and Zamora)  Entity Level  First Discourse Level: Stroy grammars in 1980’s:  Entity Level (AI): Use of scripts, logic and production rules, semantic networks (Dejong 1982, Fum et al.1985)  Hybrid (Aretoulaki 1994) from 1990’s-: explosuion of all

Multidocument Summarization (MS) Multiple source documents about a single topic or an event. Application oriented task, such as;  News portal, presenting articles from different sources  Corporate s organized by subjects.  Medical reports about a patient. Some real-life systems  Newsblaster, NewsInEssence, NewsFeed Researcher

Our Focus Multiple document summarization Building extracts for a topic Sentence selection (Surface level)

Term Frequency and Summarization Salient; Obvious, noticeable. Salient sentences should have more common terms with other sentences Two sentences are talking about the same fact if they share too much common terms. (Repetition) Select salient sentences, but inter- sentence-similarity should be low.

C 3 M vs. CC Summarization Select sentences with the highest si values, that are dissimilar to already selected sentences. Select seed documents with the highest p i Summary Power Function: s i =  i Seed Power Function: p i =  i   i  x di Calculate the number of summary sentences using compression percentage(i.e, %10) Calculate number of clusters Create sentence by sentence C matrixCreate document by document C matrix Uses sentence by term matrixUses document by term matrix Aim : to select the most representative sentences avoiding redundancy in the summary Aim: to cluster Summarization based on Cover Coefficient Clustering based on C 3 M

An Example C= Sorted si values; s3 ==> 0,64 s1 ==> 0,64 s5 ==> 0,63 s4 ==> 0,58 s2 ==> 0,44 Lets Form a Summary of 3 Sentences!!!

An Example (Step 1) C= Sorted si values; s3 ==> 0,64 s1 ==> 0,64 s5 ==> 0,63 s4 ==> 0,58 s2 ==> 0,44 Summary Sentences; s3 Choose the sentence which is most similar to others.

An Example (Step 2) C= Sorted si values; s3 ==> 0,64 s1 ==> 0,64 s5 ==> 0,63 s4 ==> 0,58 s2 ==> 0,44 Summary Sentences; s3 s1 s1 is next according to si values. Check if s1 is too much similar to s3, which is in summary. Include it to summary if s3 does not cover s1. AC3 = (c31 + c32 + c34+c35) / (5-1) AC3 = (1 – c33) / (5-1) AC3 = 0,35 (c35 = 0,194) < (AC3 = 0,35)

An Example (Step 3) C= Sorted si values; s3 ==> 0,64 s1 ==> 0,64 s5 ==> 0,63 s4 ==> 0,58 s2 ==> 0,44 Summary Sentences; s3 s1 s5 s5 is next. check with s3. (c35 = 0,083) < (AC3 = 0,35) ok check with s1 (c15 = 0,083) < (AC1 = 0,16) ok

Some Possible Improvements Integrate position of the sentence in its source document. Effect of stemming Effect of stop-word list Integrating time of the source document(s). (no promises)

Integrating Position Feature The probability distribution for α i is normal distribution in C 3 M. Use a probability distribution, where sentences that appear in first paragraphs are more probable.

Datasets We will use two datasets.  DUC (Document Understanding Conferences) dataset for English Multidocument Summarization.  Turkish New Event Detection and Tracking dataset for Turkish Multidocument Summarization.

Evaluation Two methods for evaluation: 1. We will use this method for English Multidocument Summarization. Overlap between the model summaries which are prepared by human judges and the system generated summary gives the accuracy of the summary.  ROUGE (Recall Oriented Understudy for Gist Evaluation) is the official scoring technique for Document Understanding Conference (DUC)  ROUGE uses different measures. ROUGE-N uses N- Grams to measure the overlap. ROUGE-L uses Longest Common Subsequence. ROUGE-W uses Weighted Longest Common Subsequence.

Evaluation 2. We wil use this method for Turkish Multidocument Summarization. We will add the extracted summaries as new documents. Then, we will select these summary documents as the centroids of clusters. Then, a centroid based clustering algorithm is used for clustering. If the documents are attracted by their centroids which is the summary of these documents then we can say our summarization approach is good.

Evaluation c1c1 d1d1 d2d2 d3d3 Summary documents are selected as the centroids c2c2 d4d4 d5d5 d6d6 d7d7 c 1 is the summary of d 1, d 2 and d 3. c 2 is the summary of d 4, d 5, d 6 and d 7.

Conclusion and Future Work Multidocument Summarization using Cover Coefficents of sentences is an intuitive and to our knowledge a new approach. This situation has its own advantages and disadvantages. We have fun because it is new. We are anxious about it because we have not seen any result summary yet.

Conclusion and Future Work After implementing the CC based summarization, we can try different methods on the same multidocuments set. First method:  A sentence-by-term matrix from all sentences of all documents can be formed.  Then, CC based Summarization can be applied.

Conclusion and Future Work Second method:  Cluster the documents using C3M.  Then, apply the first method to each cluster.  Combine the extracted summaries of each cluster to form one summary. Third method:  Summarize each document applying the first method. The only difference is that sentence-by-term matrices are constructed for sentences of each document.  Then, take the summaries of documents as documents and apply the first approach.

References Can, F., and Özkarahan, E. A. Concepts and Effectiveness of the Cover-Coefficient-Based Clustering Methodology for Text Databases, ACM Transactions on Database Systems, 15, 4 (1990) Lin, C. Y. Rouge, A Package for Automatic Evaluation of Summaries H. Luhn, The Automatic Creation of Literature Abstracts G.J.Rath, A. Rescnick and T.R. Savage, The Formation of Abstracts by the Selection of Sentences. H.P. Edmundson, New Methods in Automatic Extracting J.J. Pollock and A. Zamora, Automatic Abstacting Research at Chemical Abstracts Service. T.A. vanDijk, Macrostructures: An Interdisciplinary Study of Global Structures in Discourse, Interaction and Cognition G. F. Dejong, An Overview of the FRUMP System D. Fum, G. Guida and C.Tasso, Evaluating Importance: A Step Towards Text Summarization M. Aretoulaki, Towards a Hybrid Abstract Generation System

Questions Thank you.