Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.

Slides:



Advertisements
Similar presentations
Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments SIGIR´09, July 2009.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Wrap-up Dr. John D. Prange AQUAINT Program Manager
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
CS305: HCI in SW Development Evaluation (Return to…)
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Evaluating Search Engine
Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
INFO 624 Week 3 Retrieval System Evaluation
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Web Logs and Question Answering Richard Sutcliffe 1, Udo Kruschwitz 2, Thomas Mandl University of Limerick, Ireland 2 - University of Essex, UK 3.
An evaluation framework
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
JAVELIN Project Briefing 1 AQUAINT Year I Mid-Year Review Language Technologies Institute Carnegie Mellon University Status Update for Mid-Year Program.
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Classroom Assessment and Grading
S519: Evaluation of Information Systems Week 14: April 7, 2008.
Evaluation Framework Prevention vs. Intervention CHONG POH WAN 21 JUNE 2011.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( The Multiple Language Question Answering Track at CLEF 2003.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Question Answering From Zero to Hero Elena Eneva 11 Oct 2001 Advanced IR Seminar.
ZLOT Prototype Assessment John Carlo Bertot Associate Professor School of Information Studies Florida State University.
Assessment in Education Patricia O’Sullivan Office of Educational Development UAMS.
Predicting Question Quality Bruce Croft and Stephen Cronen-Townsend University of Massachusetts Amherst.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
Target -Method Match Selecting The Right Assessment.
Users and Assessors in the Context of INEX: Are Relevance Dimensions Relevant? Jovan Pehcevski, James A. Thom School of CS and IT, RMIT University, Australia.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
AQUAINT R&D Program Phase I Kickoff Workshop WELCOME.
Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003 Ellen Voorhees NIST.
Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, 1.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
1 Prepared by: Laila al-Hasan. 1. Definition of research 2. Characteristics of research 3. Types of research 4. Objectives 5. Inquiry mode 2 Prepared.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Evaluation Issues: June 2002 Donna Harman Ellen Voorhees.
Sampath Jayarathna Cal Poly Pomona
Evaluation of Information Retrieval Systems
Walid Magdy Gareth Jones
Evaluation of IR Systems
IR Theory: Evaluation Methods
Presentation and project
Retrieval Evaluation - Reference Collections
Retrieval Performance Evaluation - Measures
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Reference Collections
Presentation transcript:

Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman

Text REtrieval Conference (TREC) TREC QA Track Goal: encourage research into systems that return answers, rather than document lists, in response to a question NIST (TREC-8) subgoal: investigate whether the evaluation methodology used for text retrieval systems is appropriate for another NLP task

Text REtrieval Conference (TREC) Task For each closed-class question, return a ranked list of 5 [docid, text-snippet] pairs snippets drawn from large news collection score: reciprocal rank of first correct response Test conditions –TRECs 8, 9: 50 or 250 byte snippets answer guaranteed to exist in collection –TREC 2001: 50 byte snippets only no guarantee of answer in collection

Text REtrieval Conference (TREC) Sample Questions How many calories are there in a Big Mac? What is the fare for a round trip between New York and London on the Concorde? Who was the 16th President of the United States? Where is the Taj Mahal? When did French revolutionaries storm the Bastille?

Text REtrieval Conference (TREC) Selecting Questions TREC-8 –most questions created specifically for track –NIST staff selected questions TREC-9 –questions suggested by logs of real questions –much more ambiguous, and therefore difficult Who is Colin Powell? vs. Who invented the paper clip? TREC 2001 –questions taken directly from filtered logs –large percentage of definition questions

Text REtrieval Conference (TREC) What Evaluation Methodology? Different philosophies –IR: the “user” is the sole judge of a satisfactory response human assessors judge responses flexible interpretation of correct response final scores comparative, not absolute –IE: there exists the answer answer keys developed by application expert requires enumeration of all acceptable responses at outset subsequent scoring trivial; final scores absolute

Text REtrieval Conference (TREC) QA Track Evaluation Methodology NIST assessors judge answer strings –binary judgment of correct/incorrect –document provides context for answer In TREC-8, each question was independently judged by 3 assessors –built high-quality final judgment set –provided data for measuring effect of differences between judges on final scores

Text REtrieval Conference (TREC) Judging Guidelines Document context used –frame of reference Who is President of the United States? What is the world’s population? –credit for getting answer from mistaken doc Answers must be responsive –no credit when list of possible answers given –must include units, appropriate punctuation –for questions about famous objects, answers must pertain to that one object

Text REtrieval Conference (TREC) Validating Evaluation Methodology Is user-based evaluation appropriate? Is it reliable? –can assessors perform the task? –do differences affect scores? Does the methodology produce a QA test collection? –evaluate runs that were not judged

Text REtrieval Conference (TREC) Assessors can Perform Task Examined assessors during the task –written comments –think-aloud sessions Measured agreement among assessors –on average, 6% of judged strings had some disagreement –mean overlap of.641 across 3 judges for 193 questions that had some correct answer found

Text REtrieval Conference (TREC) User Evaluation Necessary Even for these questions, context matters –Taj Mahal casino in Atlantic City Legitimate differences in opinion as to whether string contains correct answer –granularity of dates –completeness of names –“confusability” of answer string If assessors’ opinions differ, so will eventual end-users’ opinions

Text REtrieval Conference (TREC) QA Track Scoring Metric Mean reciprocal rank –score for individual question is the reciprocal of the rank at which the first correct response returned (0 if no correct response returned) –score of a run is the mean over the test set of questions

Text REtrieval Conference (TREC) TREC 2001 QA Results Scores for the best run of the top 8 groups using strict evaluation

Text REtrieval Conference (TREC) Comparative Scores Stable Quantify effect of different judgments by calculating correlation between rankings of systems –mean Kendall  of.96 (both TREC-8 & T2001) –equivalent to variation found in IR collections Judgment sets based on 1 judge’s opinion equivalent to adjudicated judgment set –adjudicated > 3 times the cost of 1-judge

Text REtrieval Conference (TREC) Methodology Summary User-based evaluation is appropriate and necessary for the QA task +user-based evaluation accommodates different opinions different assessors have conflicting opinions as to the correctness of a response reflects real-world +assessors understand their task and can do it +comparative results are stable –effect of differences on training unknown –need more coherent user model

Text REtrieval Conference (TREC) TREC 2001 Introduced new tasks –new task that requires collating information from multiple documents to form a list of responses What are 9 novels written by John Updike? –short sequences of interrelated questions Where was Chuck Berry born? What was his first song on the radio? When did it air?

Text REtrieval Conference (TREC) Participation in QA Tracks

Text REtrieval Conference (TREC) TREC 2001 QA Participants

Text REtrieval Conference (TREC) AQUAINT Proposals

Text REtrieval Conference (TREC) Proposed Evaluations Extended QA (end-to-end) track in TREC Knowledge base task Dialog task

Text REtrieval Conference (TREC) Dialog Evaluation Goal: evaluate interactive use of QA systems explore issues at the analyst/system interface Participants: contractors whose main emphasis is on dialog (e.g., SUNY/Rutgers, LCC, SRI) others welcome, including TREC interactive participants Plan: design pilot study for 2002 during breakout full evaluation in 2003

Text REtrieval Conference (TREC) Knowledge Base Evaluation Goal: investigate systems’ ability to exploit deep knowledge assume KB already exists Participants: SAIC, SRI, Cyc others welcome (AAAI ‘99 workshop participants?) Plan: design full evaluation plan for 2002 during breakout

Text REtrieval Conference (TREC) End-to-End Evaluation Goals: continue TREC QA track as suggested by roadmap add AQUAINT-specific conditions introduce task variants based on question typology Participants: all contractors, other TREC participants Plan: quickly devise interim typology for use in 2002 ad hoc working group to develop more formal typology by December, 2002