Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman
Text REtrieval Conference (TREC) TREC QA Track Goal: encourage research into systems that return answers, rather than document lists, in response to a question NIST (TREC-8) subgoal: investigate whether the evaluation methodology used for text retrieval systems is appropriate for another NLP task
Text REtrieval Conference (TREC) Task For each closed-class question, return a ranked list of 5 [docid, text-snippet] pairs snippets drawn from large news collection score: reciprocal rank of first correct response Test conditions –TRECs 8, 9: 50 or 250 byte snippets answer guaranteed to exist in collection –TREC 2001: 50 byte snippets only no guarantee of answer in collection
Text REtrieval Conference (TREC) Sample Questions How many calories are there in a Big Mac? What is the fare for a round trip between New York and London on the Concorde? Who was the 16th President of the United States? Where is the Taj Mahal? When did French revolutionaries storm the Bastille?
Text REtrieval Conference (TREC) Selecting Questions TREC-8 –most questions created specifically for track –NIST staff selected questions TREC-9 –questions suggested by logs of real questions –much more ambiguous, and therefore difficult Who is Colin Powell? vs. Who invented the paper clip? TREC 2001 –questions taken directly from filtered logs –large percentage of definition questions
Text REtrieval Conference (TREC) What Evaluation Methodology? Different philosophies –IR: the “user” is the sole judge of a satisfactory response human assessors judge responses flexible interpretation of correct response final scores comparative, not absolute –IE: there exists the answer answer keys developed by application expert requires enumeration of all acceptable responses at outset subsequent scoring trivial; final scores absolute
Text REtrieval Conference (TREC) QA Track Evaluation Methodology NIST assessors judge answer strings –binary judgment of correct/incorrect –document provides context for answer In TREC-8, each question was independently judged by 3 assessors –built high-quality final judgment set –provided data for measuring effect of differences between judges on final scores
Text REtrieval Conference (TREC) Judging Guidelines Document context used –frame of reference Who is President of the United States? What is the world’s population? –credit for getting answer from mistaken doc Answers must be responsive –no credit when list of possible answers given –must include units, appropriate punctuation –for questions about famous objects, answers must pertain to that one object
Text REtrieval Conference (TREC) Validating Evaluation Methodology Is user-based evaluation appropriate? Is it reliable? –can assessors perform the task? –do differences affect scores? Does the methodology produce a QA test collection? –evaluate runs that were not judged
Text REtrieval Conference (TREC) Assessors can Perform Task Examined assessors during the task –written comments –think-aloud sessions Measured agreement among assessors –on average, 6% of judged strings had some disagreement –mean overlap of.641 across 3 judges for 193 questions that had some correct answer found
Text REtrieval Conference (TREC) User Evaluation Necessary Even for these questions, context matters –Taj Mahal casino in Atlantic City Legitimate differences in opinion as to whether string contains correct answer –granularity of dates –completeness of names –“confusability” of answer string If assessors’ opinions differ, so will eventual end-users’ opinions
Text REtrieval Conference (TREC) QA Track Scoring Metric Mean reciprocal rank –score for individual question is the reciprocal of the rank at which the first correct response returned (0 if no correct response returned) –score of a run is the mean over the test set of questions
Text REtrieval Conference (TREC) TREC 2001 QA Results Scores for the best run of the top 8 groups using strict evaluation
Text REtrieval Conference (TREC) Comparative Scores Stable Quantify effect of different judgments by calculating correlation between rankings of systems –mean Kendall of.96 (both TREC-8 & T2001) –equivalent to variation found in IR collections Judgment sets based on 1 judge’s opinion equivalent to adjudicated judgment set –adjudicated > 3 times the cost of 1-judge
Text REtrieval Conference (TREC) Methodology Summary User-based evaluation is appropriate and necessary for the QA task +user-based evaluation accommodates different opinions different assessors have conflicting opinions as to the correctness of a response reflects real-world +assessors understand their task and can do it +comparative results are stable –effect of differences on training unknown –need more coherent user model
Text REtrieval Conference (TREC) TREC 2001 Introduced new tasks –new task that requires collating information from multiple documents to form a list of responses What are 9 novels written by John Updike? –short sequences of interrelated questions Where was Chuck Berry born? What was his first song on the radio? When did it air?
Text REtrieval Conference (TREC) Participation in QA Tracks
Text REtrieval Conference (TREC) TREC 2001 QA Participants
Text REtrieval Conference (TREC) AQUAINT Proposals
Text REtrieval Conference (TREC) Proposed Evaluations Extended QA (end-to-end) track in TREC Knowledge base task Dialog task
Text REtrieval Conference (TREC) Dialog Evaluation Goal: evaluate interactive use of QA systems explore issues at the analyst/system interface Participants: contractors whose main emphasis is on dialog (e.g., SUNY/Rutgers, LCC, SRI) others welcome, including TREC interactive participants Plan: design pilot study for 2002 during breakout full evaluation in 2003
Text REtrieval Conference (TREC) Knowledge Base Evaluation Goal: investigate systems’ ability to exploit deep knowledge assume KB already exists Participants: SAIC, SRI, Cyc others welcome (AAAI ‘99 workshop participants?) Plan: design full evaluation plan for 2002 during breakout
Text REtrieval Conference (TREC) End-to-End Evaluation Goals: continue TREC QA track as suggested by roadmap add AQUAINT-specific conditions introduce task variants based on question typology Participants: all contractors, other TREC participants Plan: quickly devise interim typology for use in 2002 ad hoc working group to develop more formal typology by December, 2002