Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.

Slides:



Advertisements
Similar presentations
Evaluating the Robustness of Learning from Implicit Feedback Filip Radlinski Thorsten Joachims Presentation by Dinesh Bhirud
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Information Retrieval in Practice
Search Engines and Information Retrieval
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
1 Discovering Unexpected Information from Your Competitor’s Web Sites Bing Liu, Yiming Ma, Philip S. Yu Héctor A. Villa Martínez.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
Lecture 09 Clustering-based Learning
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Rui Yan, Yan Zhang Peking University
Body Paragraphs Writing body paragraphs is always a T.R.E.A.T. T= Transition R= Reason/point from thesis/claim E= Evidence (quote from the text) A= Answer.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
Knowledge-rich approaches for text summarization Minna Vasankari
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Tamil Summary Generation for a Cricket Match
LexRank: Graph-based Centrality as Salience in Text Summarization
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
1 Single Table Queries. 2 Objectives  SELECT, WHERE  AND / OR / NOT conditions  Computed columns  LIKE, IN, BETWEEN operators  ORDER BY, GROUP BY,
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Query Based Event Extraction along a Timeline H.L. Chieu and Y.K. Lee DSO National Laboratories, Singapore (SIGIR 2004)
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Information Retrieval
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Processing of large document collections Part 1 (Introduction) Helena Ahonen-Myka Spring 2006.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen Yi-Ting.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç
Extractive Summarisation via Sentence Removal: Condensing Relevant Sentences into a Short Summary Marco Bonzanini, Miguel Martinez-Alvarez, and Thomas.
Information Retrieval in Practice
Introduction to Search Engines
Presented by Nick Janus
Presentation transcript:

Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena Ahonen-Myka Spring 2005

2 In this part… Summarization of multiple documents –MEAD Knowledge-rich approaches –STREAK Current topics in text summarization

3 Summarization of multiple documents Radev, et al (2004): Centroid-based summarization of multiple documents idea: summarizing news events –news stories come from several sources (e.g. news agencies) –all news stories talking about the same event (e.g. accident, earthquake,…) are clustered stories in one cluster repeat (partially) the same content stories have a chronological order (time stamp) –one summary for each cluster is created a reader does not have to read the same content several times

4 Centroid-based clustering each document is a tf * idf weighted vector documents are clustered 1.cluster centroid = first document 2.a new document D is compared to each centroid C 1.if sim(C, D) > threshold, D is included in C, and C is updated 2.if D is not included in any cluster, it becomes the centroid of a new cluster

5 MEAD extraction algorithm sentences are ranked according to a set of features input: –a cluster of documents, segmented into n sentences –compression rate r output: –a sequence of n x r sentences from the original documents presented in the same order as in the input documents

6 Features three features: –centroid value C i for sentence S i is the sum of the centroid values of all words in the sentence the centroid vector of the cluster represents importance of words for all the documents in the cluster

7 Features –positional value: C max = score of the highest-ranking sentence in the document according to the centroid value the i th sentence in a document gets a value

8 Features first sentence overlap F i : –the inner product of the current sentence S i and the first sentence of the document combined score of sentence S i : linear combination of three features –score(S i ) = w c C i + w p P i + w f F i

9 Cross-sentence dependencies scores of sentences can be further refined after considering possible cross-sentence dependencies, for instance –repeated content in sentences redundant content can be removed –chronological ordering earlier or later sentences can be preferred –source preferences e.g. Helsingin sanomat is trusted more than Iltalehti…

10 Repeated content 1.John Doe was found guilty of the murder. 2.The court found John Doe guilty of the murder of Jane Doe last August and sentenced him to life. (2. presents additional content -> 1. redundant) 3.Eighteen decapitated bodies have been found in a mass grave in northern Algeria, press reports said Thursday. 4.Algerian newspapers have reported on Thursday that 18 decapitulated bodies have been found by the authorities. (equivalent content)

11 Reranking based on repeated content redundancy penalty R ij for each sentence i which overlaps with sentences j that have higher score value redundancy penalty for a sentence: max (R ij ) new_score(s i ) = w c C i + w p P i + w f F i – w R R i all sentences are reranked by new_score and a new extract in created iteration until reranking does not result in a different extract

12 Knowledge-rich approaches structured information can be used as the starting point for summarization –structured information: e.g. data and knowledge bases –may have been produced by processing input text (information extraction) summarizer does not have to address the linguistic complexities and variability of the input, but also the structure of the input text is not available

13 Knowledge-rich approaches there is a need for measures of salience and relevance that are dependent on the knowledge source addressing cohesion, and fluency becomes the entire responsibility of the generator

14 STREAK McKeown, Robin, Kukich (1995): Generating concise natural language summaries goal: folding information from multiple facts into a single sentence using concise linguistic constructions

15 STREAK produces summaries of basketball games first creates a draft of essential facts then uses revision rules constrained by the draft wording to add in additional facts as the text allows –revision rules have been extracted by studying human-written game summaries

16 STREAK input: –a set of box scores for a basketball game –historical information (from a database) task: –summarize the highlights of the game, underscoring their significance in the light of previous games output: –a short summary: a few sentences

17 STREAK the box score input is represented as a conceptual network that expresses relations between what were the columns and rows of the table essential facts: the game result, its location, date and at least one final game statistic (the most remarkable statistic of a winning team player)

18 STREAK essential facts can be obtained directly from the box-score in addition, other potential facts –other notable game statistics of individual players - from box-score –game result streaks (Utah recorded its fourth straight win) - historical –extremum performances such as maximums or minimums - historical

19 STREAK essential facts are always included potential facts are included if there is space –decision on the potential facts to be included could be based on the possibility to combine the facts to the essential information in cohesive and stylistically successful ways

20 STREAK given facts: –Karl Malone scored 39 points. –Karl Malone’s 39 point performance is equal to his season high a single sentence is produced: –Karl Malone tied his season high with 39 points

21 Current topics in text summarization multi-document summarization non-extrative summarization (abstracts) spoken language (incl. dialogue) summarization multilingual summarization integrating of question answering and text summarization web-based & multimedia summarization evaluation of summarization systems –Document Understanding Conferences (DUC)

22 Mapping to the information retrieval process information need matching query reformulation documents document representations query result