Answering Definition Questions Using Multiple Knowledge Sources Wesley Hildebrandt, Boris Katz, and Jimmy Lin MIT Computer Science and Artificial Intelligence.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Improved TF-IDF Ranker
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
Evaluating Search Engine
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Employing Two Question Answering Systems in TREC 2005 Harabagiu, Moldovan, et al 2005 Language Computer Corporation.
Chapter 5: Information Retrieval and Web Search
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Querying Structured Text in an XML Database By Xuemei Luo.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Structured Use of External Knowledge for Event-based Open Domain Question Answering Hui Yang, Tat-Seng Chua, Shuguang Wang, Chun-Keat Koh National University.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Presenter: Shanshan Lu 03/04/2010
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.
Question Answering over Implicitly Structured Web Content
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An information-pattern-based approach to novelty detection Presenter : Lin, Shu-Han Authors : Xiaoyan.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003 Ellen Voorhees NIST.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
IR Theory: Evaluation Methods
Extracting Semantic Concept Relations
Searching with context
Introduction to Information Retrieval
CS246: Information Retrieval
Presentation transcript:

Answering Definition Questions Using Multiple Knowledge Sources Wesley Hildebrandt, Boris Katz, and Jimmy Lin MIT Computer Science and Artificial Intelligence Laboratory 32 Vassar Street, Cambridge, MA

Abstract Definition questions represent a largely unexplored area of question answering. Multi-strategy approach:  a database constructed offline with surface patterns  a Web-based dictionary  an off-the-shelf document retriever Results are from:  component-level evaluation  end-to-end evaluation of the system at the TREC 2003 Question Answering Track

Answering Definition Questions 1: extract the target term (target) 2: lookup in a database created from the AQUAINT corpus 3: lookup in a Web dictionary followed by answer projection *4: lookup directly in the AQUAINT corpus with an IR engine 5: answers from 2~4 are then merged to produce the final system output

Target Extraction A simple pattern-based parser to extract the target term using regular expressions The extractor was tested on all definition questions from the TREC-9 and TREC-10 QA Track testsets and performed with 100% accuracy But, several instances were not correctly extracted from the definition questions in TREC 2003

Database Lookup Surface patterns for answer extraction  An effective strategy for factoid question  but often suffers from low recall To boost recall  applying the set of surface patterns offline  “precompile” from the AQUAINT corpus a list of nuggets about every entity  construct an immense relational database containing nuggets distilled from every article in the corpus. The task then becomes a simple lookup for the relevant term

Database Lookup Surface patterns operated both at the word and part-of-speech level.  Rudimentary chunking, such as marking the boundaries of noun phrases, was performed by grouping words based on their part-of-speech tags. Contextual information results in higher-quality answers 11 surface patterns in Table1, with examples in Table2

11 Surface Patterns

Surface Patterns with examples

Dictionary Lookup TREC evaluations requires pairs of (ans, doc)  Answer projection techniques to (Brill, 2001) Dictionary-Lookup Approach:  Keywords from the target term’s dictionary definition and the target itself  used as the query to Lucene  Top 100 documents returned and tokenized into individual sentences, discarding sentences without target term  remaining sentences are scored by keyword overlap with the dictionary definition, weighted by the idf of each keyword  non-zero-score sentences are retained and shortened to 100 characters centered around the target term, if necessary.

Document Lookup This approach is adopted If no answers were found by the previous two techniques (Database & Dictionary Lookup) Using traditional IR technique

Answer Merging Redundancy removal  this problem is especially severe since every entity in the entire AQUAINT corpus was precompiled  Simple heuristic: if two responses share more than 60% of their keywords, one of them is randomly discarded Expected accuracy  all responses are ordered by EA to extract nugget # of responses to be returned  Given n total responses, the number is

Component Evaluation 160 definition questions from the TREC-9 and TREC-10 QA Track testsets  Database lookup: 8 nuggets per question at accuracy  Dictionary lookup: 1.5 nuggets per question at accuracy Recall of the techniques is extremely hard to measure directly The results: represent baseline for the performance of each technique The focus: not on perfecting each individual pattern and the dictionary matching algorithm, but on building a complete working system

Component Evaluation

TREC 2003 Results This system performed well, ranking 8th out of 25 groups that participated in TREC 2003 QA Track Official results for the definition sub-task are shown in Table 4 The formula used to calculate the F- measure is given in Figure 1

TREC 2003 Results

BBN (Xu et al., 2003)  The best run, with an F-measure of  use many of the same techniques described here  one important exception—they did not precompile nuggets into a database  they also cited recall as a major cause of bad performance  IR baseline achieved an F-measure of which beat all other runs Because the F-measure heavily favored recall over precision, simple IR techniques worked extremely well

TREC 2003 Results

Evaluation Reconsidered The Scoring Metric Variations in Judgment

The Scoring Metric Recall( R ) = r/R  a system that returned every non-vital nugget but no vital nuggets would receive a score of zero The distinction between vital and non-vital nuggets is itself somewhat arbitrary.  For example, “What is Bausch & Lomb?” world’s largest eye care company -> vital about employees -> vital in 50 countries -> vital approx. $1.8 billion annual revenue -> vital based in Rochester, New York -> non-vital

The Scoring Metric What is the proper value of β?  If β=1, the difference in performance between our system and that of the top system is virtually indistinguishable  The advantages of surface patterns, linguistic processing, answer fusion, and other techniques become more obvious if the F-measure is not as heavily biased towards recall

Variations in Judgment Human naturally have differing opinions These differences of opinion are not mistakes, but legitimate variations in what assessors consider to be acceptable Different assessors may judge nuggets differently, contributing to detectable variations in score

Variations in Judgment For the assessors’ nugget list, it should satisfy:  Atomic — each nugget should ideally represent an atomic concept  Uniqueness — nuggets should be unique, not only in their text but also in their meaning  Completeness — many relevant items of information returned did not make it onto the assessors’ nugget list (even as non-vital nuggets) Evaluating answers to definition questions is a challenging task. Consistent, repeatable, and meaningful scoring guidelines are critical to the field

Examples of Atomic & Unique Atomic  Harlem civil rights leader: provide one fact  Alexander Pope is “English poet”: two separate facts Uniqueness

Future Work Robust named-entity extractor for  Target extraction: key, non-trivial capability critical to the success of a system  Database lookup: works only if the relevant target terms are identified and indexed while preprocessing the corpus  For example, able to identify specialized names (e.g., “Bausch & Lomb”, “Destiny’s Child”, “Akbar the Great”)

Future Work More accurate surface patterns  expanding the context on which these patterns operate to reduce false matches  As an example, consider e1_is pattern: over 60% of irrelevant nuggets were cases where the target is the object of a preposition and not the subject of the copular verb immediately following it For example, Question “What is mold?” to Sentence “tools you need to look for mold are...” Good-nugget Predictor  separate “good” from “bad” nuggets  using machine learning techniques

Conclusion A novel set of strategies from multiple sources  Database, Dictionary & Documents Smoothly integrate the derived answers to produce a final set of answers The analyses show:  Difficulty of evaluating definition questions  Inability of present metrics to accurately capture the information needs of real-world users