Download presentation
Presentation is loading. Please wait.
Published byMelvin Lindsey Modified over 9 years ago
1
Search and Information Extraction Lab IIIT Hyderabad
2
Information Overload Explosive growth of information on web Failure of information retrieval systems to satisfy user’s information need Need for sophisticated information access solutions
3
Summarization Summary is a condensed version of source document(s) having a recognizable genre : to give the reader an exact and concise idea of the contents of the source. Text interpretationExtraction of Relevant informationCondensing Extracted InformationSummary Generation
4
Flavors of Summarization Progressive Single document Query Focused Opinion/ Sentiment Code Comparative Guided Personalized
5
Extract Vs. Abstract Extract An extract is a summary consisting of entirely of material from the input text Abstract An abstract is a summary at least some of whose material is not present in the input. eg. paraphrases of content, subject of categories
6
Towards Abstraction Personalized, Cross Lingual Summarization Guided Summarization Code Summarization Comparison Summarization Blog summarization Progressive Summarization Abstractive Single Document, Query Focused Multi Document Summarization
7
Technological Aspects
9
Query Focused Summarization Documents should be ranked in order of probability of relevance to the request or information need, as calculated from whatever evidence is available to the system Query Dependent ranking: Relevance Based Language models Language models (PHAL) Query Independent ranking: Sentence Prior
10
RBLM is an IR approach that computes the conditional probabilities of relevance from document and query PHAL- probabilistic extension to HAL spaces HAL constructs dependencies of a term w on other terms based on their occurrence in its context in the corpus
11
Log Linear Relevance Sentence prior captures importance of sentence explicitly using pseudo relevant documents (Web, Wikipedia) Based on Domain knowledge, Background Information, Centrality
12
DUC Peformance 38 systems participated in 2006 Significant difference between first two systems 2006
13
Extract vs. Abstract Summarization We conducted a study (post TAC 2006) Generated best possible extracts Calculated the scores for these extracts Evaluation with respect to the reference summaries Rouge 2 Rouge SU4 Human Answers 0.10250.1624 Best Answers 0.099650.15407 HAL Feature 0.076180.13805
14
Cross Lingual Summarization
15
A bridge between CLIR and MT Extended our mono-lingual summarization framework to a cross-lingual setting in RBLM framework Designed a cross-lingual experimental setup using DUC 2005 dataset Experiments were conducted for Telugu-English language pair Comparison with mono-lingual baseline shows about 90% performance in ROUGE-SU4 and about 85% in ROUGE-2 f-measures
16
Progressive Summarization Emerging area of research in summarization Summarization with a sense of prior knowledge Introduced as “Update Summarization” at DUC 2007, TAC 2008, TAC 2009 Generate a short summary of a set of newswire articles, under the assumption that the user has already read a given set of earlier articles. To keep track of temporal news stories
17
Key challenge To detect information that is not only relevant but also new given the prior knowledge of reader Relevant and new Vs Non-Relevant and new Vs Relevant and redundant
18
Novelty Detection Identifying sentences containing new information (Novelty Detection) from cluster of documents is the key of progressive summarization Shares similarity with Novelty track at TREC from 2002 – 2004 Task 1: Extract relevant sentences from a set of documents for a topic Task 2: Eliminate redundant sentences from relevant sentences Progressive summarization differs, as in producing summary from novel sentences (requires scoring and ranking)
19
Three level approach to Novelty Detection Sentence Scoring Developing new features that capture novelty along with relevance of a sentence NF, NW Ranking Sentences are re ranked based on the amount of novelty it contains ITSim, CoSim Summary Generation A selected pool of sentences that contain novel facts. All remaining sentences are filtered out
20
Evaluations TAC 2008 Update Summarization data for training: 48 topics Each topic divided into A, B with 10 documents Summary for cluster A is normal summary and cluster B is update summary TAC 2009 update Summarization for testing: 44 topics Baseline summarizer generates summary by picking first 100 words of last document Run1 – DFS + SL1 Run2 – PHAL + KL
21
Personalized Summarization Perception of text differs with background of the reader Need of incorporating user background in the summarization process Summarization not only a function of input text but also the reader Serve Tennis player Hotel manager Politician
22
Web-based profile creation: Personal information available on web- a conference page, a project page, an online paper, or even in a Weblog. Estimate Model P(w/Mu) to incorporate user in sentence extraction process
23
Opinion summarization Sentiment Analysis User-generated-content is growing rapidly through blogs Sentiment analysis provides better access to information Sentiment Textual information on the Web can be categorized as facts and opinions Computational study of opinions, sentiments in market perspective
25
Optimization of sentiment in the summary to the maximum extent Sentiment summarization as a two stage classification problem at sentence level Polarity Estimation Opinion/fact Positive/Negative
27
Code Summarization What factors lead to accurate comprehension of code? How developers create perceptions of relevance in existing code? What would be the utility of code summaries? Vocabulary Problem No one identifier in the code fully represent code’s purpose Language specificity Misrepresentative content
28
Approach Partial order Extraction Partial order extraction Formal concept analysis Concept disambiguation Word sense disambiguation using word net Use domain dictionary or semi supervised methods Partial order ranking/selection coposet selection based on number of extents Summary construction Template based sentence creation for each poset Use of rules for lexical choices based on operators
29
Comparative summarization Summaries for comparing multiples items belonging to a category Category of “Mobile phones“ will have “Nokia”, “Black berry’ as its items Comparative summaries provide the properties or facts common to these items and their corresponding values with respect to each item. “Memory”, “Display”, “Battery Life”, Memory Battery Life
30
Comparative Summaries Generation Attribute Extraction Find the attributes of the product class Attribute Ranking Rank the attributes according to importance in comparison Summary Generation Find the occurrence of attributes in various products
31
Guided Summarization Query Focused Summarization User’s information need expressed as a query along with a narrative Set of documents related to the topic Goal is to produce a shot coherent summary focusing answer to the query Guided Summarization Each topic is classified into a set of predefined categories Each category has a template of important aspects about the topic Summary is expected to answer all the aspects of template while containing other relevant information
32
When What Where Who How Summarizer Summary Guided Summary Docs Query
33
Guided summarization Encourage deeper linguistic and semantic analysis of the source documents instead of relying only on document word frequencies to select important concepts Shares similarity with information extraction Specific information from unstructured text is identified and consequently classified into a set of semantic labels (templates) Makes information more suitable for other information processing tasks A guided summarization system has to produce a readable summary encompassing all the information about the templates Very few investigations exploring the potential of merging summarization with information extraction techniques
34
Our approach Building a domain model Essential background knowledge for information extraction Sentence Annotations To identify sentences having answers to aspects of template Concept Mining To use semantic concepts instead of words to calculate sentence importance Summary Extraction Modification of summary extraction algorithm to adapt to the requirements using sentence annotations
35
ROUGE 2ROUGE SU4PyramidResponsiveness Run10.09574 (1/43) 0.13014 (1/43)0.425 (1/43)3.130 (2/43) Run 20.0695 (23/43) 0.10788 (22/43)0.347 (21/43)2.804 (21/43) CategoryPyramid scoreResponsiveness Accidents0.4453.429 Attacks0.5243.286 Health and safety0.3002.583 Endangered Resources0.3963.100 Investigations0.5203.500 Run1 is successful in producing informative summaries for cluster A Ranked first in all evaluation metrics including pyramids and ROUGE Difficulty of task depends on the type of category. Summarizing Health and safety, Endangered resources is relatively hard
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.