Search and Information Extraction Lab IIIT Hyderabad.

Slides:



Advertisements
Similar presentations
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Advertisements

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
IVITA Workshop Summary Session 1: interactive text analytics (Session chair: Professor Huamin Qu) a) HARVEST: An Intelligent Visual Analytic Tool for the.
Search Engines and Information Retrieval
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Rutgers’ HARD Track Experiences at TREC 2004 N.J. Belkin, I. Chaleva, M. Cole, Y.-L. Li, L. Liu, Y.-H. Liu, G. Muresan, C. L. Smith, Y. Sun, X.-J. Yuan,
Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Data Mining – Intro.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Information Retrieval in Practice
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Mining and Summarizing Customer Reviews
Opinion mining in social networks Student: Aleksandar Ponjavić 3244/2014 Mentor: Profesor dr Veljko Milutinović.
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Rui Yan, Yan Zhang Peking University
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Search Engines and Information Retrieval Chapter 1.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Search Engine Architecture
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Facilitating Document Annotation using Content and Querying Value.
1 KINDS OF PARAGRAPH. There are at least seven types of paragraphs. Knowledge of the differences between them can facilitate composing well-structured.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Facilitating Document Annotation Using Content and Querying Value.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Queensland University of Technology
Sentiment analysis algorithms and applications: A survey
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
CSE 635 Multimedia Information Retrieval
Topic: Semantic Text Mining
Presentation transcript:

Search and Information Extraction Lab IIIT Hyderabad

Information Overload Explosive growth of information on web Failure of information retrieval systems to satisfy user’s information need Need for sophisticated information access solutions

Summarization Summary is a condensed version of source document(s) having a recognizable genre : to give the reader an exact and concise idea of the contents of the source. Text interpretationExtraction of Relevant informationCondensing Extracted InformationSummary Generation

Flavors of Summarization Progressive Single document Query Focused Opinion/ Sentiment Code Comparative Guided Personalized

Extract Vs. Abstract  Extract  An extract is a summary consisting of entirely of material from the input text  Abstract  An abstract is a summary at least some of whose material is not present in the input.  eg. paraphrases of content, subject of categories

Towards Abstraction Personalized, Cross Lingual Summarization Guided Summarization Code Summarization Comparison Summarization Blog summarization Progressive Summarization Abstractive Single Document, Query Focused Multi Document Summarization

Technological Aspects

Query Focused Summarization  Documents should be ranked in order of probability of relevance to the request or information need, as calculated from whatever evidence is available to the system  Query Dependent ranking: Relevance Based Language models  Language models (PHAL)  Query Independent ranking: Sentence Prior

 RBLM is an IR approach that computes the conditional probabilities of relevance from document and query  PHAL- probabilistic extension to HAL spaces  HAL constructs dependencies of a term w on other terms based on their occurrence in its context in the corpus

 Log Linear Relevance  Sentence prior captures importance of sentence explicitly using pseudo relevant documents (Web, Wikipedia)  Based on Domain knowledge, Background Information, Centrality

DUC Peformance 38 systems participated in 2006 Significant difference between first two systems 2006

Extract vs. Abstract Summarization  We conducted a study (post TAC 2006)  Generated best possible extracts  Calculated the scores for these extracts  Evaluation with respect to the reference summaries Rouge 2 Rouge SU4 Human Answers Best Answers HAL Feature

Cross Lingual Summarization

 A bridge between CLIR and MT  Extended our mono-lingual summarization framework to a cross-lingual setting in RBLM framework  Designed a cross-lingual experimental setup using DUC 2005 dataset  Experiments were conducted for Telugu-English language pair  Comparison with mono-lingual baseline shows about 90% performance in ROUGE-SU4 and about 85% in ROUGE-2 f-measures

Progressive Summarization  Emerging area of research in summarization  Summarization with a sense of prior knowledge  Introduced as “Update Summarization” at DUC 2007, TAC 2008, TAC 2009  Generate a short summary of a set of newswire articles, under the assumption that the user has already read a given set of earlier articles.  To keep track of temporal news stories

Key challenge To detect information that is not only relevant but also new given the prior knowledge of reader Relevant and new Vs Non-Relevant and new Vs Relevant and redundant

Novelty Detection  Identifying sentences containing new information (Novelty Detection) from cluster of documents is the key of progressive summarization  Shares similarity with Novelty track at TREC from 2002 – 2004  Task 1: Extract relevant sentences from a set of documents for a topic  Task 2: Eliminate redundant sentences from relevant sentences  Progressive summarization differs, as in producing summary from novel sentences (requires scoring and ranking)

Three level approach to Novelty Detection Sentence Scoring  Developing new features that capture novelty along with relevance of a sentence  NF, NW Ranking  Sentences are re ranked based on the amount of novelty it contains ITSim, CoSim Summary Generation A selected pool of sentences that contain novel facts. All remaining sentences are filtered out

Evaluations  TAC 2008 Update Summarization data for training: 48 topics  Each topic divided into A, B with 10 documents  Summary for cluster A is normal summary and cluster B is update summary  TAC 2009 update Summarization for testing: 44 topics  Baseline summarizer generates summary by picking first 100 words of last document  Run1 – DFS + SL1  Run2 – PHAL + KL

Personalized Summarization  Perception of text differs with background of the reader  Need of incorporating user background in the summarization process  Summarization not only a function of input text but also the reader Serve Tennis player Hotel manager Politician

 Web-based profile creation: Personal information available on web- a conference page, a project page, an online paper, or even in a Weblog. Estimate Model P(w/Mu) to incorporate user in sentence extraction process

Opinion summarization Sentiment Analysis  User-generated-content is growing rapidly through blogs  Sentiment analysis provides better access to information Sentiment  Textual information on the Web can be categorized as facts and opinions  Computational study of opinions, sentiments in market perspective

 Optimization of sentiment in the summary to the maximum extent  Sentiment summarization as a two stage classification problem at sentence level  Polarity Estimation  Opinion/fact  Positive/Negative

Code Summarization  What factors lead to accurate comprehension of code?  How developers create perceptions of relevance in existing code?  What would be the utility of code summaries?  Vocabulary Problem  No one identifier in the code fully represent code’s purpose  Language specificity  Misrepresentative content

Approach Partial order Extraction Partial order extraction Formal concept analysis Concept disambiguation Word sense disambiguation using word net Use domain dictionary or semi supervised methods Partial order ranking/selection coposet selection based on number of extents Summary construction Template based sentence creation for each poset Use of rules for lexical choices based on operators

Comparative summarization  Summaries for comparing multiples items belonging to a category  Category of “Mobile phones“ will have “Nokia”, “Black berry’ as its items  Comparative summaries provide the properties or facts common to these items and their corresponding values with respect to each item.  “Memory”, “Display”, “Battery Life”, Memory Battery Life

Comparative Summaries Generation  Attribute Extraction  Find the attributes of the product class  Attribute Ranking  Rank the attributes according to importance in comparison  Summary Generation  Find the occurrence of attributes in various products

Guided Summarization  Query Focused Summarization  User’s information need expressed as a query along with a narrative  Set of documents related to the topic  Goal is to produce a shot coherent summary focusing answer to the query  Guided Summarization  Each topic is classified into a set of predefined categories  Each category has a template of important aspects about the topic  Summary is expected to answer all the aspects of template while containing other relevant information

When What Where Who How Summarizer Summary Guided Summary Docs Query

Guided summarization  Encourage deeper linguistic and semantic analysis of the source documents instead of relying only on document word frequencies to select important concepts  Shares similarity with information extraction  Specific information from unstructured text is identified and consequently classified into a set of semantic labels (templates)  Makes information more suitable for other information processing tasks  A guided summarization system has to produce a readable summary encompassing all the information about the templates  Very few investigations exploring the potential of merging summarization with information extraction techniques

Our approach  Building a domain model  Essential background knowledge for information extraction  Sentence Annotations  To identify sentences having answers to aspects of template  Concept Mining  To use semantic concepts instead of words to calculate sentence importance  Summary Extraction  Modification of summary extraction algorithm to adapt to the requirements using sentence annotations

ROUGE 2ROUGE SU4PyramidResponsiveness Run (1/43) (1/43)0.425 (1/43)3.130 (2/43) Run (23/43) (22/43)0.347 (21/43)2.804 (21/43) CategoryPyramid scoreResponsiveness Accidents Attacks Health and safety Endangered Resources Investigations Run1 is successful in producing informative summaries for cluster A Ranked first in all evaluation metrics including pyramids and ROUGE Difficulty of task depends on the type of category. Summarizing Health and safety, Endangered resources is relatively hard