AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Vikas BhardwajColumbia University NLP for the Web – Spring 2010 Improving QA Accuracy by Question Inversion Prager et al. IBM T.J. Watson Res. Ctr. 02/18/2010.
Evaluating Search Engine
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
CS 430 / INFO 430 Information Retrieval
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Use of Patterns for Detection of Answer Strings Soubbotin and Soubbotin.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( The Multiple Language Question Answering Track at CLEF 2003.
BELLWORK From the History of Latin America textbook, read the “Voyages of Columbus” (pgs ) and answer the following: 1.What were the Spaniards motives.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Question Answering From Zero to Hero Elena Eneva 11 Oct 2001 Advanced IR Seminar.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Structured Use of External Knowledge for Event-based Open Domain Question Answering Hui Yang, Tat-Seng Chua, Shuguang Wang, Chun-Keat Koh National University.
Natural Language Based Reformulation Resource and Web Exploitation for Question Answering Ulf Hermjakob, Abdessamad Echihabi, Daniel Marcu University of.
Quality Control Lecture 5
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
CLEF 2007 Workshop Budapest, September 19, 2007  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. Comas (1),
A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.
Users and Assessors in the Context of INEX: Are Relevance Dimensions Relevant? Jovan Pehcevski, James A. Thom School of CS and IT, RMIT University, Australia.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
AQUAINT Testbed John Aberdeen, John Burger, Conrad Chang, John Henderson, Scott Mardis The MITRE Corporation © 2002, The MITRE Corporation.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Automatic Question Answering  Introduction  Factoid Based Question Answering.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
AQUAINT IBM PIQUANT ARDACYCORP Subcontractor: IBM Question Answering Update piQuAnt ARDA/AQUAINT December 2002 Workshop This work was supported in part.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Chapter 6 - Standardized Measurement and Assessment
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2.
Improving QA Accuracy by Question Inversion John Prager, Pablo Duboue, Jennifer Chu-Carroll Presentation by Sam Cunningham and Martin Wintz.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
AQUAINT R&D Program Phase I Kickoff Workshop WELCOME.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003 Ellen Voorhees NIST.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.
Evaluation Issues: June 2002 Donna Harman Ellen Voorhees.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
A Trainable Multi-factored QA System Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia, Verginica Barbu-Mititelu Research Institute for.
Evaluation.
University of Illinois System in HOO Text Correction Shared Task
Retrieval Evaluation - Reference Collections
Presentation transcript:

AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees

NIST AQUAINT Components of Evaluation Pilot investigations into how to evaluate advanced QA systems TREC QA track

NIST AQUAINT Pilot Studies Definitions Dialogue Fixed Domain Justifications Multimedia & multilingual systems No answer Opinions Relationships

NIST AQUAINT Pilot Studies Too many pilots diluted efforts –dialog and definition pilots mostly complete –relationship and opinion pilots had start –data obtained for fixed domain pilot Emphasis for 2nd year of AQUAINT –fixed domain –relationships –multimedia/multilingual versions

NIST AQUAINT TREC QA Track Community-wide evaluation of open domain QA systems –currently limited to “factoid” questions Who was the first person to run the mile in less than four minutes? When was the Rosenberg trial? How deep is Crater Lake? –use news corpus as source of answers systems required to return document from corpus in support of answer

NIST AQUAINT QA Track Participation 34 groups from North America, Europe, & Asia

NIST AQUAINT Task Return exactly one response for each of 500 questions –response is either [doc, string] pair or NIL –NIL indicates belief that there is no answer in the corpus –new this year, string must be nothing more or less than an answer Rank questions by confidence in answer –emphasis on getting systems to recognize when they have found a good answer

NIST AQUAINT Data New AQUAINT document set articles from NY Times newswire ( ), AP newswire ( ), and Xinhua News Agency ( ) approximately 3 gb of text approximately 1,033,000 articles Questions taken from MSNSearch and AskJeeves logs no definition questions some spelling/grammatical errors remain 46 questions with no known answer in docs

NIST AQUAINT Motivation for Exact Answers Text snippets masking important differences among systems Pinpointing precise extent of answer important to driving technology –not a statement that deployed systems should return only exact answers –exact answers may be important as component in larger language systems

NIST AQUAINT Exact Answers Human assessors judged responses Wrong: string does not contain a correct answer or answer is unresponsive Not Supported: string contains a correct answer, but doc does not support that answer Not Exact: string contains correct answer and doc supports it, but string contains too much (or too little) info Right: string is exactly a correct answer that is supported by the doc

NIST AQUAINT Distribution of Judgments 15,948 judgments across 500 questions In general, systems can find extent of answer if they can find it at all distribution skewed across systems attempt to get exact answer sometimes caused units to be lost (so marked wrong) 12, %Wrong %Unsupported %ineXact 2, %Right

NIST AQUAINT Confidence-weighted Scoring Focus on getting systems to know when they have found a good answer –questions ranked by confidence in answer –compute score based on ranking  i=1 number right to rank i / i N N

NIST AQUAINT Main Task Results

NIST AQUAINT Main Themes Many systems now using specific data sources for expected question types name lists gazetteers Web used by most systems, but in different ways primary source of answer that is then mapped to corpus one of several sources whose results are fused place to validate answer found in corpus

NIST AQUAINT Confidence Ranking Different approaches –most groups used the type of question as a factor –some systems that use scoring techniques to rank candidate answers also used score for ranking questions –few groups used training set to learn good feature set and corresponding weights, then applied classifier to test set –many groups ranked NIL questions last