Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha Adafre School of Computing Dublin City University.

Slides:



Advertisements
Similar presentations
Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments SIGIR´09, July 2009.
Advertisements

Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Knowledge Base Completion via Search-Based Question Answering
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Evaluating Search Engine
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Information Retrieval in Practice
Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
© Anselm Spoerri Lecture 13 Housekeeping –Term Projects Evaluations –Morse, E., Lewis, M., and Olsen, K. (2002) Testing Visual Information Retrieval Methodologies.
Overview of Search Engines
Information Retrieval in Practice
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Rui Yan, Yan Zhang Peking University
Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Question Answering From Zero to Hero Elena Eneva 11 Oct 2001 Advanced IR Seminar.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
Search and Information Extraction Lab IIIT Hyderabad.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Third International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands, Spain, May 2002 Columbia University Catalogued recommended.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Information Retrieval in Practice
Summarizing Entities: A Survey Report
Wikitology Wikipedia as an Ontology
IR Theory: Evaluation Methods
Text-to-Text Generation
Relevance and Reinforcement in Interactive Browsing
Topic: Semantic Text Mining
Presentation transcript:

Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha Adafre School of Computing Dublin City University Maarten de Rijke ISLA, University of Amsterdam

Outline Motivation Task Approaches Experimental Setup Results Concluding Remarks

Motivation Over 60% of Web queries are informational  “Tell me about X.”  Queries are short TREC “Other” Questions DUC 2004 – Summarization  “Who is X?”

Motivation Increasing amount of user annotated data - Wikipedia The largest reference work Open content Anyone can edit its content Rich set of categories Wikipedia as an “importance model” (Mishne et al. 2005)  Nuggets from a news paper corpus are compared with nuggets from Wikipedia.  Higher similarity implies importance.

Applications Uses of sentence importance estimation  Information retrieval Question Answering (Ahn et al., 2004) Novelty checking (Allan, et al., 2003)  Summarization Graph-based methods (Erkan & Radev, 2004)  Topic Tracking (Kraaij & Spitters, 2003)

Task Given a topic, identify sentences that are important for the topic, in a general newspaper text corpus  Example  “William H. McNeill” (born 1917,Vancouver, British Columbia) is a Canadian historian.  He is currently Professor Emeritus of History at the University of Chicago.  McNeill’s most popular work is “The Rise of the West”.  The book explored human history in terms of the effect of different old world civilizations on one another, and especially the dramatic effect of western civilization on others in the past 500 years.  It had a major impact on historical theory, especially in distinction to Oswald Scientific aim: To compare techniques for determining important sentences

System Overview Passage retrievalGet Wikipedia categories Select sample articlesSentence extraction Candidate sentencesReference sorpus Rank sentences Ranked sentences

Candidate Sentence Selection Input  Topic Name and category of a person  Source corpus AQUAINT Corpus Sentence extraction  Source corpus split into passages and indexed  The topic is submitted as query Top 200 passages selected Passages are split into sentences Sentences containing the topic words are retained

Sentence Ranking Sentences are ranked based on their similarity with reference sentences Reference sentences  Given a topic, and its category Brad Pitt, and Actor  Reference corpus is a set of sentences describing other entities in the same category, i.e., other actors.

System Overview Passage retrievalGet Wikipedia categories Select sample articlesSentence extraction Candidate sentencesReference sorpus Rank sentences Ranked sentences

Ranking sentences Two dimension  Graph-based vs non-graph-based  Using (or not) a reference corpus Five ways  Word overlap  Language Modelling  Graph-based methods Generic Graph-based method with reference corpus Graph-based method with reference corpus plus lexical layer

Assumptions Given an entity of some category  We consider other entities of the same category and the properties that are typically described for them.  That is, if a property is included in the descriptions of a significant portion of entities in the same category as our input entity, we assume it to be an important one.

Sentence Ranking Similarity Measures  Word Overlap Compute Jaccard coefficient b/n candidate and references sentences Sentences are ranked by their maximum scores  Language modelling Sentences are ranked by their likelihood w.r.t. the language model of the reference corpus  Graph-based method …

Sentence Ranking Graph-based method for summarization (Erkan & Radev, 2004)  Given a text to be summarized  Construct a graph by linking related sentences Word overlap  Assign score to each sentence using PageRank  The sentence with highest PageRank score is assumed to contain the salient information

Sentence Ranking Graph-based method T1 T3 T2 T4 T7 T5 Target sentences T6 Target sentences Reference sentences T1 T3 T2 R2 R1 R3 R4 T1 T3 T2 Target sentences Reference sentences R1 R3 R2 R4 W1 W3 W2 Generic method without reference corpus With reference corpus With lexical level

Research questions Does the use of reference corpora help in improving importance estimation? Do graph-based estimation methods outperform non-graph-based methods? Does the additional representation of important lexical items help improve importance estimation for sentences?

Experimental Setup Data set  TREC data Set? Preliminary experiment  Some important snippets not included, Eg.  Fred Durst: Born in Jacksonville, Fla., Durst grew up in Gastonia, N.C., where his love of hip-hop music and break dancing made him an outcast.  Eileen Marie Collins: She was born Nov. 19, 1956, in Elmira, N.Y., to Jim and Rose Collins.  New data set 30 Topics – Persons 10 Occupations

Experimental Setup Assessment  Take the top 20 snippets returned by the different systems  Manually assess each snippet for important biographical information Two assessors  Assessors were allowed to examine the topic in Wikipedia or using a general purpose web search engine. Agreement – Kappa = 0.70 Baseline  Rank sentence based on the retrieval scores (Performed well at TREC 2003)

Results 600 total snippets for each runs Two Score WOD – with out duplicates WD – with duplicates

Summary of importance estimation methods Word-overlap  Based on single sentence  Returns several duplicates Language Modelling  Based on the combined corpus  Does not distinguish between sentences  Less effective Generic graph-based method  Do not use on the reference corpus  Based on redundancy in the news corpus Graph-based + reference-corpus  Combine evidence from multiple sentences

Concluding Remark Task: estimating importance of sentences Main finding: combination of a corpus-based approach to capturing the knowledge encoded in sentences known to be important and graph- based method for ranking sentences performs best

Thank you

Result Significant differences?