Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.

Similar presentations


Presentation on theme: "Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman."— Presentation transcript:

1 Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman

2 Text REtrieval Conference (TREC) TREC QA Track Goal: encourage research into systems that return answers, rather than document lists, in response to a question NIST (TREC-8) subgoal: investigate whether the evaluation methodology used for text retrieval systems is appropriate for another NLP task

3 Text REtrieval Conference (TREC) Task For each closed-class question, return a ranked list of 5 [docid, text-snippet] pairs snippets drawn from large news collection score: reciprocal rank of first correct response Test conditions –TRECs 8, 9: 50 or 250 byte snippets answer guaranteed to exist in collection –TREC 2001: 50 byte snippets only no guarantee of answer in collection

4 Text REtrieval Conference (TREC) Sample Questions How many calories are there in a Big Mac? What is the fare for a round trip between New York and London on the Concorde? Who was the 16th President of the United States? Where is the Taj Mahal? When did French revolutionaries storm the Bastille?

5 Text REtrieval Conference (TREC) Selecting Questions TREC-8 –most questions created specifically for track –NIST staff selected questions TREC-9 –questions suggested by logs of real questions –much more ambiguous, and therefore difficult Who is Colin Powell? vs. Who invented the paper clip? TREC 2001 –questions taken directly from filtered logs –large percentage of definition questions

6 Text REtrieval Conference (TREC) What Evaluation Methodology? Different philosophies –IR: the “user” is the sole judge of a satisfactory response human assessors judge responses flexible interpretation of correct response final scores comparative, not absolute –IE: there exists the answer answer keys developed by application expert requires enumeration of all acceptable responses at outset subsequent scoring trivial; final scores absolute

7 Text REtrieval Conference (TREC) QA Track Evaluation Methodology NIST assessors judge answer strings –binary judgment of correct/incorrect –document provides context for answer In TREC-8, each question was independently judged by 3 assessors –built high-quality final judgment set –provided data for measuring effect of differences between judges on final scores

8 Text REtrieval Conference (TREC) Judging Guidelines Document context used –frame of reference Who is President of the United States? What is the world’s population? –credit for getting answer from mistaken doc Answers must be responsive –no credit when list of possible answers given –must include units, appropriate punctuation –for questions about famous objects, answers must pertain to that one object

9 Text REtrieval Conference (TREC) Validating Evaluation Methodology Is user-based evaluation appropriate? Is it reliable? –can assessors perform the task? –do differences affect scores? Does the methodology produce a QA test collection? –evaluate runs that were not judged

10 Text REtrieval Conference (TREC) Assessors can Perform Task Examined assessors during the task –written comments –think-aloud sessions Measured agreement among assessors –on average, 6% of judged strings had some disagreement –mean overlap of.641 across 3 judges for 193 questions that had some correct answer found

11 Text REtrieval Conference (TREC) User Evaluation Necessary Even for these questions, context matters –Taj Mahal casino in Atlantic City Legitimate differences in opinion as to whether string contains correct answer –granularity of dates –completeness of names –“confusability” of answer string If assessors’ opinions differ, so will eventual end-users’ opinions

12 Text REtrieval Conference (TREC) QA Track Scoring Metric Mean reciprocal rank –score for individual question is the reciprocal of the rank at which the first correct response returned (0 if no correct response returned) –score of a run is the mean over the test set of questions

13 Text REtrieval Conference (TREC) TREC 2001 QA Results Scores for the best run of the top 8 groups using strict evaluation

14 Text REtrieval Conference (TREC) Comparative Scores Stable Quantify effect of different judgments by calculating correlation between rankings of systems –mean Kendall  of.96 (both TREC-8 & T2001) –equivalent to variation found in IR collections Judgment sets based on 1 judge’s opinion equivalent to adjudicated judgment set –adjudicated > 3 times the cost of 1-judge

15 Text REtrieval Conference (TREC) Methodology Summary User-based evaluation is appropriate and necessary for the QA task +user-based evaluation accommodates different opinions different assessors have conflicting opinions as to the correctness of a response reflects real-world +assessors understand their task and can do it +comparative results are stable –effect of differences on training unknown –need more coherent user model

16 Text REtrieval Conference (TREC) TREC 2001 Introduced new tasks –new task that requires collating information from multiple documents to form a list of responses What are 9 novels written by John Updike? –short sequences of interrelated questions Where was Chuck Berry born? What was his first song on the radio? When did it air?

17 Text REtrieval Conference (TREC) Participation in QA Tracks

18 Text REtrieval Conference (TREC) TREC 2001 QA Participants

19 Text REtrieval Conference (TREC) AQUAINT Proposals

20 Text REtrieval Conference (TREC) Proposed Evaluations Extended QA (end-to-end) track in TREC Knowledge base task Dialog task

21 Text REtrieval Conference (TREC) Dialog Evaluation Goal: evaluate interactive use of QA systems explore issues at the analyst/system interface Participants: contractors whose main emphasis is on dialog (e.g., SUNY/Rutgers, LCC, SRI) others welcome, including TREC interactive participants Plan: design pilot study for 2002 during breakout full evaluation in 2003

22 Text REtrieval Conference (TREC) Knowledge Base Evaluation Goal: investigate systems’ ability to exploit deep knowledge assume KB already exists Participants: SAIC, SRI, Cyc others welcome (AAAI ‘99 workshop participants?) Plan: design full evaluation plan for 2002 during breakout

23 Text REtrieval Conference (TREC) End-to-End Evaluation Goals: continue TREC QA track as suggested by roadmap add AQUAINT-specific conditions introduce task variants based on question typology Participants: all contractors, other TREC participants Plan: quickly devise interim typology for use in 2002 ad hoc working group to develop more formal typology by December, 2002


Download ppt "Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman."

Similar presentations


Ads by Google