Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003 Ellen Voorhees NIST.

Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003 Ellen Voorhees NIST

QA Tracks in NIST Pilot evaluation in ARDA AQUAINT program (fall, 2002) Pilot evaluation in ARDA AQUAINT program (fall, 2002)  The purpose of each pilot is to develop an effective evaluation methodology for systems that answer a certain kind of question.  The paper in HLT-NAACL 2003 is about the Definition Pilot.

QA Tracks in NIST (Cont.) TREC 2003 QA Track (August, 2003) TREC 2003 QA Track (August, 2003)  Passage task  Systems returned a single text snippet in response to factoid questions.  Main task  The task contains factoid, list, and definition questions.  The final score is a combination of the scores for the separate question types.

Definition Questions Asking for the definition or explanation of a term, or an introduction of a person or an organization Asking for the definition or explanation of a term, or an introduction of a person or an organization  “What is mold?” and “Who is Colin Powell?” Longer answer text Longer answer text Various answers, not easy to evaluate the performance of systems Various answers, not easy to evaluate the performance of systems  Precision? Recall? Exactness?

Example of Response of Def Q “Who is Christopher Reeve?” “Who is Christopher Reeve?” System responses:  Actor  the actor who was paralyzed when he fell off his horse  the name attraction  stars on Sunday in ABC’s remake of ”Rear Window  was injured in a show jumping accident and has become a spokesman for the cause

First Round of Def Pilot 8 runs (ABCDEFGH); allowing multiple answers for each question in one run; no length limit 8 runs (ABCDEFGH); allowing multiple answers for each question in one run; no length limit Two assessors (author of the questions and the other person) Two assessors (author of the questions and the other person) Two kinds of scores (0-10 pt.) Two kinds of scores (0-10 pt.)  Content score: higher if more useful and less misleading information  Organization score: higher if useful information appears earlier Final score is the combination with more emphasis on content score. Final score is the combination with more emphasis on content score.

Result of 1st Round of Def Pilot Ranking of runs: Ranking of runs:  Author: FADEBGCH  Other: FAEGDBHC Scores varied across assessors. Scores varied across assessors.  Different interpretation of “organization score”  But organization score was strongly correlated with content score. Some relative ranking was shown. Some relative ranking was shown.

Second Round of Def Pilot Goal: develop a more quantitative evaluation of system responses Goal: develop a more quantitative evaluation of system responses “Information nuggets”: pieces of (atomic) information about the target of the question “Information nuggets”: pieces of (atomic) information about the target of the question What assessors do: What assessors do: 1. Create a list of info nuggets 2. Decide which nuggets are vital (must appear in a good definition) 3. Mark which nuggets appear in a system response

Example of Assessment Concept recall is quite straightforward: ratio of concepts retrieved. Concept recall is quite straightforward: ratio of concepts retrieved. Precision is hard to define. (Hard to divide text into concepts. Denominator is unknown.) Precision is hard to define. (Hard to divide text into concepts. Denominator is unknown.) Using only recall to evaluate systems is untenable. (Entire documents get full recall.) Using only recall to evaluate systems is untenable. (Entire documents get full recall.)

Approximation to Precision Borrowed from DUC (Harman and Over, 2002) Borrowed from DUC (Harman and Over, 2002) An allowance of 100 (non-space) characters for each nugget retrieved An allowance of 100 (non-space) characters for each nugget retrieved Punishment if the length of the response is longer than allowance Punishment if the length of the response is longer than allowance Precision=1-(length-allowance)/length Precision=1-(length-allowance)/length  In the previous example, allowance=4*100, length=175, thus precision=1.

Final Score Recall is computed only over vital nuggets. (2/3 in prev.) Recall is computed only over vital nuggets. (2/3 in prev.) Precision is computed over all nuggets. Precision is computed over all nuggets. Letrbe the number of vital nuggets returned in a response; abe the number of acceptable (non-vital but in the list) nuggets returned in a response; Rbe the total number of vital nuggets in the assessor’s list; lenbe of the number of non-white space characters in an answer string summed over all answer strings in the response; Then

Result of 2nd Round of Def Pilot F-measure F-measure Different β value results in different f-measure ranking. Different β value results in different f-measure ranking. β =5 approximates the ranking of first round. β =5 approximates the ranking of first round. author other length author other length F0.688F0.757F935.6more verbose A0.606A0.687A1121.2more verbose D0.568G0.671D281.8 G0.562D0.669G164.5relatively terse E0.555E0.657E533.9 B0.467B0.522B1236.5complete sentence C0.349C0.384C84.7 H0.330H0.365H33.7single snippet... Rankings are stable!

Def Task in TREC QA 50 questions 50 questions  30 for person (e.g. Andrea Bocceli, Ben Hur)  10 for organization (e.g. Friends of the Earth)  10 for other thing (e.g. TB, feng shui) Scenario Scenario  The questioner is an adult, a native speaker of English, and an “average” reader of US newspapers. In reading an article, the user has come across a term that they would like to find out more about. They may have some basic idea of what the term means either from the context of the article (for example, a bandicoot must be a type of animal) or basic background knowledge (Ulysses S. Grant was a US president). They are not experts in the domain of the target, and therefore are not seeking esoteric details (e.g., not a zoologist looking to distinguish the different species in genus Perameles).

Result of Def Task, QA Track

Analysis of TREC QA Track Fidelity: the extent to which the evaluation measures what it is intended to measure. Fidelity: the extent to which the evaluation measures what it is intended to measure.  TREC: the extent to which the abstraction captures (some of) the issues of the real task Reliability: the extent to which an evaluation result can be trusted. Reliability: the extent to which an evaluation result can be trusted.  TERC: an evaluation ranks a better system ahead of a worse system

Definition Task Fidelity It is unclear whether the average user strongly prefer recall. (since β =5) It is unclear whether the average user strongly prefer recall. (since β =5) And it seems longer responses receive higher scores. And it seems longer responses receive higher scores. Determine how selective a system is Determine how selective a system is  Baseline: returns all sentences in the corpus containing the target  Smarter baseline (BBN): as the baseline but the overlap between sentences is small

Definition Task Fidelity (Cont.) 225 No conclusion of β value can be made. No conclusion of β value can be made. At least β =5 matches the user need in the pilot. At least β =5 matches the user need in the pilot.

Definition Task Reliability Noise or error: Noise or error:  Human mistake in judgment  Different opinions from different assessors  Questions set Evaluating the effect of different opinions Evaluating the effect of different opinions  Two assessors create two different nugget sets.  Runs are scored using two nugget lists.  The stability of rankings is measured by Kendall’s τ. The τ score is 0.848 (considered stable if τ>0.9) The τ score is 0.848 (considered stable if τ>0.9)  Not good enough

Example of Different Nugget Lists “What is a golden parachute?” 1vitalAgreement between companies and top executives 2vitalProvides remuneration to executives who lose jobs 3vitalRemuneration is usually very generous 4Encourages execs not to resist takeover beneficial to shareholders 5Incentive for execs to join companies 6Arrangement for which IRS can impose excise tax 1vitalprovides remuneration to executives who lose jobs 2vitalassures officials of rich compensation if lose job due to takeover 3vitalcontract agreement between companies and their top executives 4aids in hiring and retention 5encourages officials not to resist a merger 6IRS can impose taxes

Definition Task Reliability (Cont.) Use two large question sets with the same size, F-measure scores of the system should be similar. Use two large question sets with the same size, F-measure scores of the system should be similar. Simulation of such evaluation Simulation of such evaluation  Randomly create two question sets of the required size  Define error rate as the percentage of rank swaps  Grouping by the difference of F( β =5)

Definition Task Reliability (Cont.) Most errors (rank swaps) happen in small diff groups. Most errors (rank swaps) happen in small diff groups.  Difference > 0.123 is required to have confidence in F( β =5) More questions are needed in the test set to increase the sensitivity while remaining equally confident in the result. More questions are needed in the test set to increase the sensitivity while remaining equally confident in the result.

List Task List questions with multiple possible answers List questions with multiple possible answers  “  “List the names of chewing gums” No target number is specified. No target number is specified. Final answer list of a question is the collection of correct answers in the corpus. Final answer list of a question is the collection of correct answers in the corpus. Instance precision (IP) and instance recall (IR) Instance precision (IP) and instance recall (IR) F=2*IP*IR/(IP+IR) F=2*IP*IR/(IP+IR)

Example of Final Answer List 1915: List the names of chewing gums. StimorolOrbitWinterfreshDouble Bubble DirolTridentSpearmintBazooka DoublemintDentyneFreedentHubba Bubba Juicy FruitBig RedChicletsNicorette

Other Tasks Passage Task: Passage Task:  return a short (<250) span of text containing an answer.  Texts are restricted to extraction of a document Factoid Task: Factoid Task:  Exact answers Passage task is evaluated separately. Passage task is evaluated separately. The final score of the main task is The final score of the main task isFinalScore=1/2*FactoidScore+1/4*ListScore+1/4*DefScore

Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003 Ellen Voorhees NIST.

Similar presentations

Presentation on theme: "Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003 Ellen Voorhees NIST."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003 Ellen Voorhees NIST.

Similar presentations

Presentation on theme: "Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003 Ellen Voorhees NIST."— Presentation transcript:

Similar presentations

About project

Feedback