Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.

Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman

Text REtrieval Conference (TREC) TREC QA Track Goal: encourage research into systems that return answers, rather than document lists, in response to a question NIST (TREC-8) subgoal: investigate whether the evaluation methodology used for text retrieval systems is appropriate for another NLP task

Text REtrieval Conference (TREC) Task For each closed-class question, return a ranked list of 5 [docid, text-snippet] pairs snippets drawn from large news collection score: reciprocal rank of first correct response Test conditions –TRECs 8, 9: 50 or 250 byte snippets answer guaranteed to exist in collection –TREC 2001: 50 byte snippets only no guarantee of answer in collection

Text REtrieval Conference (TREC) Sample Questions How many calories are there in a Big Mac? What is the fare for a round trip between New York and London on the Concorde? Who was the 16th President of the United States? Where is the Taj Mahal? When did French revolutionaries storm the Bastille?

Text REtrieval Conference (TREC) Selecting Questions TREC-8 –most questions created specifically for track –NIST staff selected questions TREC-9 –questions suggested by logs of real questions –much more ambiguous, and therefore difficult Who is Colin Powell? vs. Who invented the paper clip? TREC 2001 –questions taken directly from filtered logs –large percentage of definition questions

Text REtrieval Conference (TREC) What Evaluation Methodology? Different philosophies –IR: the “user” is the sole judge of a satisfactory response human assessors judge responses flexible interpretation of correct response final scores comparative, not absolute –IE: there exists the answer answer keys developed by application expert requires enumeration of all acceptable responses at outset subsequent scoring trivial; final scores absolute

Text REtrieval Conference (TREC) QA Track Evaluation Methodology NIST assessors judge answer strings –binary judgment of correct/incorrect –document provides context for answer In TREC-8, each question was independently judged by 3 assessors –built high-quality final judgment set –provided data for measuring effect of differences between judges on final scores

Text REtrieval Conference (TREC) Judging Guidelines Document context used –frame of reference Who is President of the United States? What is the world’s population? –credit for getting answer from mistaken doc Answers must be responsive –no credit when list of possible answers given –must include units, appropriate punctuation –for questions about famous objects, answers must pertain to that one object

Text REtrieval Conference (TREC) Validating Evaluation Methodology Is user-based evaluation appropriate? Is it reliable? –can assessors perform the task? –do differences affect scores? Does the methodology produce a QA test collection? –evaluate runs that were not judged

Text REtrieval Conference (TREC) Assessors can Perform Task Examined assessors during the task –written comments –think-aloud sessions Measured agreement among assessors –on average, 6% of judged strings had some disagreement –mean overlap of.641 across 3 judges for 193 questions that had some correct answer found

Text REtrieval Conference (TREC) User Evaluation Necessary Even for these questions, context matters –Taj Mahal casino in Atlantic City Legitimate differences in opinion as to whether string contains correct answer –granularity of dates –completeness of names –“confusability” of answer string If assessors’ opinions differ, so will eventual end-users’ opinions

Text REtrieval Conference (TREC) QA Track Scoring Metric Mean reciprocal rank –score for individual question is the reciprocal of the rank at which the first correct response returned (0 if no correct response returned) –score of a run is the mean over the test set of questions

Text REtrieval Conference (TREC) TREC 2001 QA Results Scores for the best run of the top 8 groups using strict evaluation

Text REtrieval Conference (TREC) Comparative Scores Stable Quantify effect of different judgments by calculating correlation between rankings of systems –mean Kendall  of.96 (both TREC-8 & T2001) –equivalent to variation found in IR collections Judgment sets based on 1 judge’s opinion equivalent to adjudicated judgment set –adjudicated > 3 times the cost of 1-judge

Text REtrieval Conference (TREC) Methodology Summary User-based evaluation is appropriate and necessary for the QA task +user-based evaluation accommodates different opinions different assessors have conflicting opinions as to the correctness of a response reflects real-world +assessors understand their task and can do it +comparative results are stable –effect of differences on training unknown –need more coherent user model

Text REtrieval Conference (TREC) TREC 2001 Introduced new tasks –new task that requires collating information from multiple documents to form a list of responses What are 9 novels written by John Updike? –short sequences of interrelated questions Where was Chuck Berry born? What was his first song on the radio? When did it air?

Text REtrieval Conference (TREC) Participation in QA Tracks

Text REtrieval Conference (TREC) TREC 2001 QA Participants

Text REtrieval Conference (TREC) AQUAINT Proposals

Text REtrieval Conference (TREC) Proposed Evaluations Extended QA (end-to-end) track in TREC Knowledge base task Dialog task

Text REtrieval Conference (TREC) Dialog Evaluation Goal: evaluate interactive use of QA systems explore issues at the analyst/system interface Participants: contractors whose main emphasis is on dialog (e.g., SUNY/Rutgers, LCC, SRI) others welcome, including TREC interactive participants Plan: design pilot study for 2002 during breakout full evaluation in 2003

Text REtrieval Conference (TREC) Knowledge Base Evaluation Goal: investigate systems’ ability to exploit deep knowledge assume KB already exists Participants: SAIC, SRI, Cyc others welcome (AAAI ‘99 workshop participants?) Plan: design full evaluation plan for 2002 during breakout

Text REtrieval Conference (TREC) End-to-End Evaluation Goals: continue TREC QA track as suggested by roadmap add AQUAINT-specific conditions introduce task variants based on question typology Participants: all contractors, other TREC participants Plan: quickly devise interim typology for use in 2002 ad hoc working group to develop more formal typology by December, 2002

Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.

Similar presentations

Presentation on theme: "Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.

Similar presentations

Presentation on theme: "Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman."— Presentation transcript:

Similar presentations

About project

Feedback