Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003 Ellen Voorhees NIST.

Slides:



Advertisements
Similar presentations
Test Development.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Improved TF-IDF Ranker
1 Extended Gloss Overlaps as a Measure of Semantic Relatedness Satanjeev Banerjee Ted Pedersen Carnegie Mellon University University of Minnesota Duluth.
Programming with Alice Computing Institute for K-12 Teachers Summer 2011 Workshop.
Active Learning and Collaborative Filtering
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Information Retrieval Visualization CPSC 533c Class Presentation Qixing Zheng March 22, 2004.
Evaluating Search Engine
Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
CS 430 / INFO 430 Information Retrieval
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Test Validity S-005. Validity of measurement Reliability refers to consistency –Are we getting something stable over time? –Internally consistent? Validity.
Classroom Assessment A Practical Guide for Educators by Craig A. Mertler Chapter 9 Subjective Test Items.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
1 Development of Valid and Reliable Case Studies for Teaching, Diagnostic Reasoning, and Other Purposes Margaret Lunney, RN, PhD Professor College of.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Question Answering From Zero to Hero Elena Eneva 11 Oct 2001 Advanced IR Seminar.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
Information Retrieval Evaluation and the Retrieval Process.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
Essay Writing.
Answering Definition Questions Using Multiple Knowledge Sources Wesley Hildebrandt, Boris Katz, and Jimmy Lin MIT Computer Science and Artificial Intelligence.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
EFFECTIVE WRITING 8 Readability. Writing - time and resource consuming, stressful process Texts have a strong tendency of using more complex, more sophisticated.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Sampling Fundamentals 2 Sampling Process Identify Target Population Select Sampling Procedure Determine Sampling Frame Determine Sample Size.
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
UIC at TREC 2007: Genomics Track Wei Zhou, Clement Yu University of Illinois at Chicago Nov. 8, 2007.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Information Retrieval Quality of a Search Engine.
Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, 1.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
Evaluation Issues: June 2002 Donna Harman Ellen Voorhees.
Queensland University of Technology
INTRODUCTION TO THE ELPAC
Text Based Information Retrieval
Test Validity.
Presentation 王睿.
IR Theory: Evaluation Methods
Large Scale Findability Analysis
Retrieval Performance Evaluation - Measures
Presentation transcript:

Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003 Ellen Voorhees NIST

QA Tracks in NIST Pilot evaluation in ARDA AQUAINT program (fall, 2002) Pilot evaluation in ARDA AQUAINT program (fall, 2002)  The purpose of each pilot is to develop an effective evaluation methodology for systems that answer a certain kind of question.  The paper in HLT-NAACL 2003 is about the Definition Pilot.

QA Tracks in NIST (Cont.) TREC 2003 QA Track (August, 2003) TREC 2003 QA Track (August, 2003)  Passage task  Systems returned a single text snippet in response to factoid questions.  Main task  The task contains factoid, list, and definition questions.  The final score is a combination of the scores for the separate question types.

Definition Questions Asking for the definition or explanation of a term, or an introduction of a person or an organization Asking for the definition or explanation of a term, or an introduction of a person or an organization  “What is mold?” and “Who is Colin Powell?” Longer answer text Longer answer text Various answers, not easy to evaluate the performance of systems Various answers, not easy to evaluate the performance of systems  Precision? Recall? Exactness?

Example of Response of Def Q “Who is Christopher Reeve?” “Who is Christopher Reeve?” System responses:  Actor  the actor who was paralyzed when he fell off his horse  the name attraction  stars on Sunday in ABC’s remake of ”Rear Window  was injured in a show jumping accident and has become a spokesman for the cause

First Round of Def Pilot 8 runs (ABCDEFGH); allowing multiple answers for each question in one run; no length limit 8 runs (ABCDEFGH); allowing multiple answers for each question in one run; no length limit Two assessors (author of the questions and the other person) Two assessors (author of the questions and the other person) Two kinds of scores (0-10 pt.) Two kinds of scores (0-10 pt.)  Content score: higher if more useful and less misleading information  Organization score: higher if useful information appears earlier Final score is the combination with more emphasis on content score. Final score is the combination with more emphasis on content score.

Result of 1st Round of Def Pilot Ranking of runs: Ranking of runs:  Author: FADEBGCH  Other: FAEGDBHC Scores varied across assessors. Scores varied across assessors.  Different interpretation of “organization score”  But organization score was strongly correlated with content score. Some relative ranking was shown. Some relative ranking was shown.

Second Round of Def Pilot Goal: develop a more quantitative evaluation of system responses Goal: develop a more quantitative evaluation of system responses “Information nuggets”: pieces of (atomic) information about the target of the question “Information nuggets”: pieces of (atomic) information about the target of the question What assessors do: What assessors do: 1. Create a list of info nuggets 2. Decide which nuggets are vital (must appear in a good definition) 3. Mark which nuggets appear in a system response

Example of Assessment Concept recall is quite straightforward: ratio of concepts retrieved. Concept recall is quite straightforward: ratio of concepts retrieved. Precision is hard to define. (Hard to divide text into concepts. Denominator is unknown.) Precision is hard to define. (Hard to divide text into concepts. Denominator is unknown.) Using only recall to evaluate systems is untenable. (Entire documents get full recall.) Using only recall to evaluate systems is untenable. (Entire documents get full recall.)

Approximation to Precision Borrowed from DUC (Harman and Over, 2002) Borrowed from DUC (Harman and Over, 2002) An allowance of 100 (non-space) characters for each nugget retrieved An allowance of 100 (non-space) characters for each nugget retrieved Punishment if the length of the response is longer than allowance Punishment if the length of the response is longer than allowance Precision=1-(length-allowance)/length Precision=1-(length-allowance)/length  In the previous example, allowance=4*100, length=175, thus precision=1.

Final Score Recall is computed only over vital nuggets. (2/3 in prev.) Recall is computed only over vital nuggets. (2/3 in prev.) Precision is computed over all nuggets. Precision is computed over all nuggets. Letrbe the number of vital nuggets returned in a response; abe the number of acceptable (non-vital but in the list) nuggets returned in a response; Rbe the total number of vital nuggets in the assessor’s list; lenbe of the number of non-white space characters in an answer string summed over all answer strings in the response; Then

Result of 2nd Round of Def Pilot F-measure F-measure Different β value results in different f-measure ranking. Different β value results in different f-measure ranking. β =5 approximates the ranking of first round. β =5 approximates the ranking of first round. author other length author other length F0.688F0.757F935.6more verbose A0.606A0.687A1121.2more verbose D0.568G0.671D281.8 G0.562D0.669G164.5relatively terse E0.555E0.657E533.9 B0.467B0.522B1236.5complete sentence C0.349C0.384C84.7 H0.330H0.365H33.7single snippet... Rankings are stable!

Def Task in TREC QA 50 questions 50 questions  30 for person (e.g. Andrea Bocceli, Ben Hur)  10 for organization (e.g. Friends of the Earth)  10 for other thing (e.g. TB, feng shui) Scenario Scenario  The questioner is an adult, a native speaker of English, and an “average” reader of US newspapers. In reading an article, the user has come across a term that they would like to find out more about. They may have some basic idea of what the term means either from the context of the article (for example, a bandicoot must be a type of animal) or basic background knowledge (Ulysses S. Grant was a US president). They are not experts in the domain of the target, and therefore are not seeking esoteric details (e.g., not a zoologist looking to distinguish the different species in genus Perameles).

Result of Def Task, QA Track

Analysis of TREC QA Track Fidelity: the extent to which the evaluation measures what it is intended to measure. Fidelity: the extent to which the evaluation measures what it is intended to measure.  TREC: the extent to which the abstraction captures (some of) the issues of the real task Reliability: the extent to which an evaluation result can be trusted. Reliability: the extent to which an evaluation result can be trusted.  TERC: an evaluation ranks a better system ahead of a worse system

Definition Task Fidelity It is unclear whether the average user strongly prefer recall. (since β =5) It is unclear whether the average user strongly prefer recall. (since β =5) And it seems longer responses receive higher scores. And it seems longer responses receive higher scores. Determine how selective a system is Determine how selective a system is  Baseline: returns all sentences in the corpus containing the target  Smarter baseline (BBN): as the baseline but the overlap between sentences is small

Definition Task Fidelity (Cont.) 225 No conclusion of β value can be made. No conclusion of β value can be made. At least β =5 matches the user need in the pilot. At least β =5 matches the user need in the pilot.

Definition Task Reliability Noise or error: Noise or error:  Human mistake in judgment  Different opinions from different assessors  Questions set Evaluating the effect of different opinions Evaluating the effect of different opinions  Two assessors create two different nugget sets.  Runs are scored using two nugget lists.  The stability of rankings is measured by Kendall’s τ. The τ score is (considered stable if τ>0.9) The τ score is (considered stable if τ>0.9)  Not good enough

Example of Different Nugget Lists “What is a golden parachute?” 1vitalAgreement between companies and top executives 2vitalProvides remuneration to executives who lose jobs 3vitalRemuneration is usually very generous 4Encourages execs not to resist takeover beneficial to shareholders 5Incentive for execs to join companies 6Arrangement for which IRS can impose excise tax 1vitalprovides remuneration to executives who lose jobs 2vitalassures officials of rich compensation if lose job due to takeover 3vitalcontract agreement between companies and their top executives 4aids in hiring and retention 5encourages officials not to resist a merger 6IRS can impose taxes

Definition Task Reliability (Cont.) Use two large question sets with the same size, F-measure scores of the system should be similar. Use two large question sets with the same size, F-measure scores of the system should be similar. Simulation of such evaluation Simulation of such evaluation  Randomly create two question sets of the required size  Define error rate as the percentage of rank swaps  Grouping by the difference of F( β =5)

Definition Task Reliability (Cont.) Most errors (rank swaps) happen in small diff groups. Most errors (rank swaps) happen in small diff groups.  Difference > is required to have confidence in F( β =5) More questions are needed in the test set to increase the sensitivity while remaining equally confident in the result. More questions are needed in the test set to increase the sensitivity while remaining equally confident in the result.

List Task List questions with multiple possible answers List questions with multiple possible answers  “  “List the names of chewing gums” No target number is specified. No target number is specified. Final answer list of a question is the collection of correct answers in the corpus. Final answer list of a question is the collection of correct answers in the corpus. Instance precision (IP) and instance recall (IR) Instance precision (IP) and instance recall (IR) F=2*IP*IR/(IP+IR) F=2*IP*IR/(IP+IR)

Example of Final Answer List 1915: List the names of chewing gums. StimorolOrbitWinterfreshDouble Bubble DirolTridentSpearmintBazooka DoublemintDentyneFreedentHubba Bubba Juicy FruitBig RedChicletsNicorette

Other Tasks Passage Task: Passage Task:  return a short (<250) span of text containing an answer.  Texts are restricted to extraction of a document Factoid Task: Factoid Task:  Exact answers Passage task is evaluated separately. Passage task is evaluated separately. The final score of the main task is The final score of the main task isFinalScore=1/2*FactoidScore+1/4*ListScore+1/4*DefScore