D OCUMENT S UMMARIZATION Abhirut Gupta Mandar Joshi Piyush Dungarwal.

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Evaluating Search Engine
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.
Scalable Text Mining with Sparse Generative Models
Using Social Networking Techniques in Text Mining Document Summarization.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
LexRank: Graph-based Centrality as Salience in Text Summarization
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
From Social Bookmarking to Social Summarization: An Experiment in Community-Based Summary Generation Oisin Boydell, Barry Smyth Adaptive Information Cluster,
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
WEB CONTENT SUMMARIZATION Timothy Washington A Look at Algorithms, Methodologies, and Live Systems.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Single Document Key phrase Extraction Using Neighborhood Knowledge.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç
GRAPH BASED MULTI-DOCUMENT SUMMARIZATION Canan BATUR
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Document Summarization
Applying Key Phrase Extraction to aid Invalidity Search
Presented by Nick Janus
Presentation transcript:

D OCUMENT S UMMARIZATION Abhirut Gupta Mandar Joshi Piyush Dungarwal

M OTIVATION The advent of WWW has created a large reservoir of data A short summary, which conveys the essence of the document, helps in finding relevant information quickly Document summarization also provides a way to cluster similar documents and present a summary

M OTIVATION

O UTLINE Definition Types: Extractive and Abstractive Techniques: Supervised and Unsupervised- Single document summarization Approaches TextRank Multi document summarization Challenges NEATS Multilingual summarization Summarization competitions Evaluation

W HAT IS SUMMARIZATION ? A summary is a text that is produced from one or more texts, that contains a significant portion of the information in the original text(s), and that is no longer than half of the original text(s). Summaries may be classified as: Extractive Abstractive

T YPES OF SUMMARIES

E XTRACTIVE SUMMARIES Extractive summaries are created by reusing portions (words, sentences, etc.) of the input text verbatim. For example, search engines typically generate extractive summaries from webpages. Most of the summarization research today is on extractive summarization.

Text: Extractive summary:

A BSTRACTIVE SUMMARIES In abstractive summarization, information from the source text is re-phrased. Human beings generally write abstractive summaries (except when they do their assignments ). Abstractive summarization has not reached a mature stage because allied problems such as semantic representation, inference and natural language generation are relatively harder.

A BSTRACTIVE SUMMARY : B OOK REVIEW An innocent hobbit of The Shire journeys with eight companions to the fires of Mount Doom to destroy the One Ring and the dark lord Sauron forever.

S UMMARIZATION TECHNIQUES Summarization techniques can be supervised or unsupervised. Supervised techniques use a collection of documents and human-generated summaries for them to train a classifier for the given text.

S UPERVISED TECHNIQUES

Features (e.g. position of the sentence, number of words in the sentence etc.) of sentences that make them good candidates for inclusion in the summary are learnt. Sentences in an original training document can be labelled as “in summary” or “not in summary”.

S UPERVISED TECHNIQUES The main drawback of supervised techniques is that training data is expensive to produce and relatively sparse. Also, most readily available human generated summaries are abstractive in nature.

M AJOR SUPERVISED APPROACHES Wong et al. use SVM to judge the importance of a sentence using feature categories: Surface features: position, length of sentence etc. Content features: uses stats of content-bearing words Relevance features: Exploit inter-sentence relationship e.g. similarity of the sentence with the first sentence. Sentences are then ranked accordingly and top ranked sentences are included in the summary.

M AJOR SUPERVISED APPROACHES Lin and Hovy use the concept of topic signatures to rank sentences. TS = {topic, signature} = {topic,,…, } TS = {restaurant-visit,,,, … } Topic signatures are learnt using a set of documents pre-classified as relevant or non- relevant for each topic.

E XAMPLE If a document is classified as relevant to restaurant visit, we know the important words of this document will form the topic signature of the topic “restaurant visit”. This is a supervised process. Extraction of the important words from a document, is an un-supervised process. E.g., food, occurs a lot of time in a cook book and is an important word in that document.

T OPIC S IGNATURES During deployment topic signatures are used to find the topic or theme of the text. Sentences are then ranked according to the sum of weights of terms relevant to the topic in the sentence.

U NSUPERVISED TECHNIQUES

U NSUPERVISED TECHNIQUES : T EXT R ANK AND L EX R ANK This approach models the document as a graph and uses an algorithm similar to Google’s PageRank algorithm to find top-ranked sentences. The key intuition is the notion of centrality or prestige in social networks i.e. a sentence should be highly ranked if it is recommended by many other highly ranked sentences.

I NTUITION “If Sachin Tendulkar says Malinga is a good batsman, he should be regarded highly. But then if Sachin is a gentleman, who talks highly of everyone, Malinga might not really be as good.” F ORMULA

A N EXAMPLE GRAPH : Image courtesy: Wikipedia

T EXT AS A GRAPH Sentences in the text are modelled as vertices of the graph. Two vertices are connected if there exists a similarity relation between them. Similarity formula After the ranking algorithm is run on the graph, sentences are sorted in reversed order of their score, and the top ranked sentences are selected.

W HY T EXT R ANK WORKS ? Through the graphs, TextRank identifies connections between various entities, and implements the concept of recommendation. A text unit recommends other related text units, and the strength of the recommendation is recursively computed based on the importance of the units making the recommendation. The sentences that are highly recommended by other sentences in the text are likely to be more informative

NEWSBLASTER D EMO

M ULTI - DOCUMENT AND MULTILINGUAL SUMMARIZATION

M ULTI DOCUMENT SUMMARIZATION A large set of documents may have thematic diversity. Individual summaries may have overlapping content. Many desirable features of an ideal summary are relatively difficult to achieve in a multi document setting. Clear structure Meaningful paragraphs Gradual transition form general to specific Good readability

N E ATS Summarization is done in three stages – content selection, content filtering and presentation. Selection stage is similar to that used in single document summarization. Content Filtering uses the following techniques Sentence position Stigma Words MMR

N E ATS Stigma Words conjunctions (e.g., but, although, however ), the verb say and its derivatives, quotation marks, pronouns such as he, she, and they MMR Maximum Marginal Relevancy – Search for the sentence which is most relevant to the query and most dissimilar with the sentences already in the summary

P RESENTATION The presentation stage proposes to solve two major challenges – definite noun phrases and events spread along an extended timeline. Definite noun phrase problem “The Great Depression of 1929 caused severe strain on the economy. The President proposed the New Deal to tackle this challenge.” NeATS uses a buddy system where each sentence is paired with a suitable introductory sentence.

P RESENTATION In multi-document summarization, a date expression such as Monday occurring in two different documents might mean the same date or different dates. Time Annotation is used to tackle such problems. Publication dates are used as reference points to compute the actual date/time for date expressions – weekdays (Sunday, Monday, etc), (past | next | coming) + weekdays, today, yesterday, last night.

M ULTI LINGUAL SUMMARIZATION Two approaches exist Documents from the source language are translated to the target language and then summarization is performed on the target language. Language specific tools are used to perform summarization in the source language and then machine translation is applied on the summarized text.

C OMPARISON Machine Translation is an involved process and not very precise, hence the first approach tends to be expensive as translation is applied to a large text(all the documents in the source language to be summarized) and also leads to error propagation. For the second approach, if the source language is not very widely used, language specific tools for summarization in the source language may not exist and creation of such tools may be expensive.

E XAMPLE A PPROACH 1- Hindi text – ऐसे कई तरीके हैं जिससे आप स्कूल में अपने बच्चे / बच्ची की मदद कर सकते हैं | वह स्कूल नियमित रुप से और समय पर जाता / जाती है यह निश्चित करने के लिए आप कानूनी तौर पर उत्तरदायी हैं | परंतु स्कूल के नियमों और होमवर्क के लिए इसके द्वारा किये जाने वाले व्यवस्थाओं का समर्थन करते हुए भी आप मदद कर सकते हैं | आप स्कूल की नितिओं का समर्थन करते हैं यह बात आपका बच्चा जानता है यह निश्चित करें |

E XAMPLE A PPROACH 1- English translation – There are many ways you can help your child in school. By law you are responsible for making sure he or she goes to school regularly, and on time. But you can also help by supporting the school's rules, and its arrangements for homework. Make sure your child knows that you support the school's policies. Summary of the translated text – You are equally responsible to ensure your child follows school rules.

E XAMPLE A PPROACH 2- Hindi Text – स्कूल में अपने बच्चे की मदद आप किस प्रकार कर सकते हैं इस विषय में आप अपने बच्चे के शिक्षकों से पूछें | ऐसा कुछ भी जिससे कि स्कूल में आपके बच्चे के कार्य पर प्रभाव पड़ सकता है उस बारे में आप उन्हें बताएं, यदि आप अपने बच्चे के प्रगति के विषय में चिन्तित है तो उनसे बातचीत करें | निश्चित करें स्कूल इस बात से अवगत है कि आप चाहतें हैं कि किसी भी समस्या के उत्पन्न होने पर, जिसमें कि आपका बच्चा शामिल है आपको तुरंत ही इस बारे में बताया जाए |

E XAMPLE A PPROACH 2- Hindi summary - आपके बच्चे के शिक्षकों के साथ अच्छा संपर्क आपके बच्चे की प्रगति में लाभदायक है | Translated Summary – Good communication with your child’s teachers is beneficial for your child’s progress.

E VALUATION

C OMPETITIONS : DUC AND MUC Research forums for encouraging development of new technologies for information extraction and text summarization. Have resulted in the development of evaluation criteria of summarization systems

E VALUATION ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. Recall based score to compare system generated summary with one or more human generated summaries. Can be with respect to n-gram matching. Unigram matching is found to be the best indicator for evaluation. ROUGE-1 is computed as division of count of unigrams in reference that appear in system and count of unigrams in reference summary.

E XAMPLES Reference-summary: Beijing hosted the summer Olympics. System-summary: The summer Olympics were held in Beijing. ROUGE-1 score: 0.75 Reference-summary: The policemen killed the gunman System-summary: The gunman killed the policemen ROUGE-1 score: 1

E VALUATION The ROUGE score is averaged for multiple references. ROUGE-1 does not determine if the result is coherent or if the sentences flow together in a sensible manner. A higher order n-gram ROUGE score can measure fluency to some degree.

C ONCLUSION Most of the current research is based on extractive multi-document summarization. Current summarization systems are widely used to summarize NEWS and other online articles. Top level view: Query-based vs. Generic Query based techniques give consideration to user preferences which can be formulated as a query Rule-based vs. ML/ Statistical Most of the early techniques were rule-based whereas the current one apply statistical approaches Rules(such as Sentence position) can be used as heuristic

C ONCLUSION Keyword vs. Graph-based Keyword based techniques rank sentences based on the occurrence of relevant keywords. Graph based techniques rank sentences based on content overlap.

R EFERENCES Wikipedia page on summarization n n Mihalcea R. and Tarau P TextRank: Bringing Order into Text. In Proc. of EMNLP Lin C.Y. and Hovy E The automated acquisition of topic signatures for text summarization. In Proc. of the 18th conference on Computational linguistics - Volume 1. Lin C.Y. and Hovy E From single to multi- document summarization: a prototype system and its evaluation. In Proc. of the 40th Annual Meeting on Association for Computational Linguistics (ACL '02).

R EFERENCES Kam-Fai Wong, Mingli Wu, and Wenjie Li Extractive summarization using supervised and semi-supervised learning. In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1 Newsblaster:

Q & A