Evaluating Question Answering Validation Anselmo Peñas (and Alvaro Rodrigo) NLP & IR group UNED nlp.uned.es Information Science Institute Marina del Rey,

Slides:

Advertisements

Similar presentations

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.

Advertisements

Question Answering for Machine Reading Evaluation Evaluation Campaign at CLEF 2011 Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner.

Recognizing Textual Entailment Challenge PASCAL Suleiman BaniHani.

ResPubliQA 2010: QA on European Legislation Anselmo Peñas, UNED, Spain Pamela Forner, CELCT, Italy Richard Sutcliffe, U. Limerick, Ireland Alvaro Rodrigo,

Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.

1 CLEF 2012, Rome QA4MRE, Question Answering for Machine Reading Evaluation Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner (CELCT,

UNIT-III By Mr. M. V. Nikum (B.E.I.T). Programming Language Lexical and Syntactic features of a programming Language are specified by its grammar Language:-

1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy.

Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.

CLEF 2008 Multilingual Question Answering Track UNED Anselmo Peñas Valentín Sama Álvaro Rodrigo CELCT Danilo Giampiccolo Pamela Forner.

3rd Answer Validation Exercise ( AVE 2008) QA subtrack at Cross-Language Evaluation Forum 2008 UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks.

Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.

1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M.

UNED at PASCAL RTE-2 Challenge IR&NLP Group at UNED nlp.uned.es Jesús Herrera Anselmo Peñas Álvaro Rodrigo Felisa Verdejo.

CLEF 2007 Multilingual Question Answering Track Danilo Giampiccolo, CELCT Anselmo Peñas, UNED.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

ResPubliQA IR baselines and UNED participation Álvaro Rodrigo Joaquín Pérez Anselmo Peñas Guillermo Garrido Lourdes Araujo nlp.uned.es.

Answer Validation Exercise Anselmo Peñas UNED NLP Group 2005 Breakout session.

Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

 Ad-hoc - This track tests mono- and cross- language text retrieval. Tasks in 2009 will test both CL and IR aspects.

 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.

Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,

Overview of the Fourth Recognising Textual Entailment Challenge NIST-Nov. 17, 2008TAC Danilo Giampiccolo (coordinator, CELCT) Hoa Trang Dan (NIST)

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.

A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.

CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.

CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( The Multiple Language Question Answering Track at CLEF 2003.

RTE Planning Session Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo.

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.

Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.

Abstract Question answering is an important task of natural language processing. Unification-based grammars have emerged as formalisms for reasoning about.

MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.

Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.

Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner 1, Danilo Giampiccolo 1, Bernardo Magnini 2, Anselmo Peñas 3, Álvaro Rodrigo 3,

CLEF 2007 Workshop Budapest, September 19, 2007  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. Comas (1),

Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

CLEF 2009 Workshop Corfu, September 30, 2009  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. R. Comas,TALP.

1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.

Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.

Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.

CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.

QA Pilot Task at CLEF 2004 Jesús Herrera Anselmo Peñas Felisa Verdejo UNED NLP Group Cross-Language Evaluation Forum Bath, UK - September 2004.

Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

LREC Marrakech, May 29, 2008 Question Answering on Speech Transcriptions: the QAST evaluation in CLEF L. Lamel 1, S. Rosset 1, C. Ayache 2, D. Mostefa.

Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.

Towards Entailment Based Question Answering: ITC-irst at Clef 2006 Milen Kouylekov, Matteo Negri, Bernardo Magnini & Bonaventura Coppola ITC-irst, Centro.

 General domain question answering system.  The starting point was the architecture described in Brill, Eric. ‘Processing Natural Language without Natural.

1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.

AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.

CLEF 2008 Workshop Aarhus, September 17, 2008  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. Comas (1), L.

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

Overview of Statistical NLP IR Group Meeting March 7, 2006.

1 Predicting Answer Location Using Shallow Semantic Analogical Reasoning in a Factoid Question Answering System Hapnes Toba, Mirna Adriani, and Ruli Manurung.

Real-time aspects June 19, 2016

A Trainable Multi-factored QA System Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia, Verginica Barbu-Mititelu Research Institute for.

CLEF Budapest1 Measuring the contribution of Word Sense Disambiguation for QA Proposers: UBC: Agirre, Lopez de Lacalle, Otegi, Rigau, FBK: Magnini.

Presentation 王睿.

What is the Entrance Exams Task

UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…

Machine Reading.

CLEF 2008 Multilingual Question Answering Track

Presentation transcript:

Evaluating Question Answering Validation Anselmo Peñas (and Alvaro Rodrigo) NLP & IR group UNED nlp.uned.es Information Science Institute Marina del Rey, December 11, 2009

UNED nlp.uned.es Old friends Question Answering Nothing else than answering a question Natural Language Understanding Something there, if you are able to answer a question QA: extrinsic evaluation for NLU Suddenly… (See the track?) …The QA Track at TREC

UNED nlp.uned.es Question Answering at TREC Object of evaluation itself Redefined as a (roughly speaking): Highly-precision-oriented IR task Where NLP was necessary Specially for Answer Extraction

UNED nlp.uned.es What’s this story about? QA Tasks at CLEF Multiple Language QA Main TaskResPubliQA Temporal restrictions and lists Answer Validation Exercise (AVE) GikiCLEF Real Time QA over Speech Transcriptions (QAST) WiQA WSD QA

UNED nlp.uned.es Outline 1. Motivation and goals 2. Definition and general framework 3. AVE AVE 2007 & QA 2009

UNED nlp.uned.es Short cycleLong cycle Out-line 1. Analysis of current systems performance 2. Mid term goals and strategy 3. Evaluation Task definition 4. Analysis of the evaluation cycle Result analysis Methodology analysis Generation of methodology and evaluation resources Task activation and development

UNED nlp.uned.es Systems performance (Spanish) Overall Best result <60% Definitions Best result >80% NOT IR approach

UNED nlp.uned.es Pipeline Upper Bounds SOMETHING to break the pipeline Question Answer Question analysis Passage Retrieval Answer Extraction Answer Ranking xx= Not enough evidence

UNED nlp.uned.es Results in CLEF-QA 2006 (Spanish) Perfect combination 81% Best system 52,5% Best with ORGANIZATION Best with PERSON Best with TIME

UNED nlp.uned.es Collaborative architectures Different systems response better different types of questions Specialization Collaboration QA sys 1 QA sys 2 QA sys 3 QA sys n Question Candidate answers SOMETHING for combining / selecting Answer

UNED nlp.uned.es Collaborative architectures How to select the good answer? Redundancy Voting Confidence score Performance history Why not deeper content analysis?

UNED nlp.uned.es Mid Term Goal Goal Improve QA systems performance New mid term goal Improve the devices for: Rejecting / Accepting / Selecting Answers The new task (2006) Validate the correctness of the answers Given by real QA systems......the participants at CLEF QA

UNED nlp.uned.es Outline 1. Motivation and goals 2. Definition and general framework 3. AVE AVE 2007 & QA 2009

UNED nlp.uned.es Define Answer Validation Decide whether an answer is correct or not More precisely: The Task: Given Question Answer Supporting Text Decide if the answer is correct according to the supporting text Let’s call it Answer Validation Exercise (AVE)

UNED nlp.uned.es Whish list Test collection Questions Answers Supporting Texts Human assessments Evaluation measures Participants

UNED nlp.uned.es Evaluation linked to main QA task Question Answering Track Systems’ answers Systems’ Supporting Texts Answer Validation Exercise Questions (ACCEPT / REJECT) Human Judgements (R,W,X,U) QA Track results Mapping (ACCEPT / REJECT) Evaluation AVE Track results Reuse human assessments

UNED nlp.uned.es Candidate answer Supporting Text Answer is not correct or not enough evidence Question Answer is correct Answer Validation Answer Validation Exercise (AVE) AVE Textual Entailment Hypothesis Automatic Hypothesis Generation AVE 2006

UNED nlp.uned.es Outline Motivation and goals Definition and general framework AVE 2006 Underlying architecture: pipeline Evaluating the validation As RTE exercise: pairs text-hypothesis AVE 2007 & 2008 QA 2009

UNED nlp.uned.es AVE 2006: A RTE exercise If the text semantically entails the hypothesis, then the answer is expected to be correct. Question Supporting snippet Exact Answer QA system Hypothesis Text Entailment? Is this true? Yes 95% with current QA systems (J LOG COMP 2009)

UNED nlp.uned.es Collections AVE 2006 Available at:nlp.uned.es/clef-qa/ave/ Testing (pairs entail.)Training English2088 (10% YES)2870 (15% YES) Spanish2369 (28% YES)2905 (22% YES) German1443 (25% YES) French3266 (22% YES) Italian1140 (16% YES) Dutch807 (10% YES) Portuguese1324 (14% YES)

UNED nlp.uned.es Evaluating the Validation Validation Decide if each candidate answer is correct or not YES | NO Not balanced collections Approach: Detect if there is enough evidence to accept an answer Measures: Precision, recall and F over correct answers Baseline system: Accept all answers

UNED nlp.uned.es Evaluating the Validation Correct Answer Incorrect Answer Answer Accepted n CA n WA Answer Rejected n CR n WR

UNED nlp.uned.es Results AVE 2006 LanguageBaseline (F) Best system (F) Reported Techiques English.27.44Logic Spanish.45.61Logic German.39.54Lexical, Syntax, Semantics, Logic, Corpus French.37.47Overlapping, Learning Dutch.19.39Syntax, Learning Portuguese.38.35Overlapping Italian.29.41Overlapping, Learning

UNED nlp.uned.es Outline Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008 Underlying architecture: multi-stream Quantify the potential benefit of AV in QA Evaluating the correct selection of one answer Evaluating the correct rejection of all answers QA 2009

UNED nlp.uned.es QA sys 1 QA sys 2 QA sys 3 QA sys n Question Candidate answers + Supporting Texts Answer Validation & Selection Answer Participant systems in a CLEF – QA Evaluation of Answer Validation & Selection AVE 2007 & 2008

UNED nlp.uned.es Collections What is Zanussi? was an Italian producer of home appliances Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought who had also been in Cassibile since August 31 Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August (1985) 3 Out of 5 Live (1985) What Is This?

UNED nlp.uned.es Evaluating the Selection Goals Quantify the potential gain of Answer Validation in Question Answering Compare AV systems with QA systems Develop measures more comparable to QA accuracy

UNED nlp.uned.es Evaluating the selection Given a question with several candidate answers Two options: Selection Select an answer ≡ try to answer the question Correct selection: answer was correct Incorrect selection: answer was incorrect Rejection Reject all candidate answers ≡ leave question unanswered Correct rejection: All candidate answers were incorrect Incorrect rejection: Not all candidate answers were incorrect

UNED nlp.uned.es Evaluating the Selection n questions n= n CA + n WA + n WS + n WR + n CR Question with Correct Answer Question without Correct Answer Question Answered Correctly (One Answer Selected) n CA - Question Answered Incorrectly n WA n WS Question Unanswered (All Answers Rejected) n WR n CR

UNED nlp.uned.es Evaluating the Selection Rewards rejection (not balanced cols) Interpretation for QA: all questions correctly rejected by AV will be answered correctly

UNED nlp.uned.es Evaluating the Selection Interpretation for QA: questions correctly rejected has value as if they were answered correctly in qa_accuracy proportion

UNED nlp.uned.es Analysis and discussion (AVE 2007 Spanish) Validation Selection Comparing AV & QA

UNED nlp.uned.es Techniques in AVE 2007 Generates hypotheses 6 Wordnet 3 Chunking 3 n-grams, longest common Subsequences 5 Phrase transformations 2 NER 5 Num. Expressions 6 Temp. expressions 4 Coreference resolution 2 Dependency analysis 3 Syntactic similarity4 Functions (sub, obj, etc)3 Syntactic transformations1 Word-sense disambiguation2 Semantic parsing4 Semantic role labeling2 First order logic representation3 Theorem prover3 Semantic similarity2

UNED nlp.uned.es Conclusion of AVE Answer Validation before It was assumed as a QA module But no space for its own development The new devices should help to i mprove QA they Introduce more content analysis Use Machine Learning techniques Are able to break pipelines or combine streams Let’s transfer them to QA main task

UNED nlp.uned.es Outline Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008 QA 2009

UNED nlp.uned.es CLEF QA 2009 campaign ResPubliQA: QA on European Legislation GikiCLEF: QA requiring geographical reasoning on Wikipedia QAST: QA on Speech Transcriptions of European Parliament Plenary sessions

UNED nlp.uned.es CLEF QA 2009 campaign Task Registered groups Participant groups Submitted Runs Organizing people ResPubliQA (baseline runs) 9 Giki CLEF27817 runs2 QAST12486 (5 subtasks)8 Total 59 showed interest 23 Groups 147 runs evaluated 19 + additional assessors

ResPubliQA 2009: QA on European Legislation Organizers Anselmo Peñas Pamela Forner Richard Sutcliffe Álvaro Rodrigo Corina Forascu Iñaki Alegria Danilo Giampiccolo Nicolas Moreau Petya Osenova Additional Assessors Fernando Luis Costa Anna Kampchen Julia Kramme Cosmina Croitoru Advisory Board Donna Harman Maarten de Rijke Dominique Laurent

UNED nlp.uned.es Evolution of the task Target languages Collections News News Wikipedia Nov European Legislation Number of questions Type of questions 200 Factoid + Temporal restrictions + Definitions - Type of question + Lists + Linked questions + Closed lists - Linked + Reason + Purpose + Procedure Supporting information DocumentSnippetParagraph Size of answer SnnipetExactParagraph

UNED nlp.uned.es Collection Subset of JRC-Acquis (10,700 docs x lang) Parallel at document level EU treaties, EU legislation, agreements and resolutions Economy, health, law, food, … Between 1950 and 2006

UNED nlp.uned.es 500 questions REASON Why did a commission expert conduct an inspection visit to Uruguay? PURPOSE/OBJECTIVE What is the overall objective of the eco-label? PROCEDURE How are stable conditions in the natural rubber trade achieved? In general, any question that can be answered in a paragraph

UNED nlp.uned.es 500 questions Also FACTOID In how many languages is the Official Journal of the Community published? DEFINITION What is meant by “whole milk”? No NIL questions

UNED nlp.uned.es Systems response No Answer ≠ Wrong Answer 1. Decide if they answer or not [ YES | NO ] Classification Problem Machine Learning, Provers, etc. Textual Entailment 2. Provide the paragraph (ID+Text) that answers the question Aim To leave a question unanswered has more value than to give a wrong answer

UNED nlp.uned.es Assessments R: The question is answered correctly W: The question is answered incorrectly NoA: The question is not answered NoA R: NoA, but the candidate answer was correct NoA W: NoA, and the candidate answer was incorrect Noa Empty: NoA and no candidate answer was given Evaluation measure: Extension of the traditional accuracy (as proportion of questions correctly answered) Considering unanswered questions

UNED nlp.uned.es Evaluation measure n: Number of questions n R : Number of correctly answered questions n U : Number of unanswered questions

UNED nlp.uned.es Evaluation measure If n U = 0 then R /n  Accuracy If n R = 0 then If n U = n then Leave a question unanswered gives value only if this avoids to return a wrong answer Accuracy The added value is the performance shown with the answered questions: Accuracy

UNED nlp.uned.es List of Participants SystemTeam elixELHUYAR-IXA, SPAIN iciaRACAI, ROMANIA iiitSearch & Info Extraction Lab, INDIA ilesLIMSI-CNRS-2, FRANCE isikISI-Kolkata, INDIA logaU.Koblenz-Landau, GERMAN miraMIRACLE, SPAIN nlelU. politecnica Valencia, SPAIN synaSynapse Developpment, FRANCE uaicAI.I.Cuza U. of IASI, ROMANIA unedUNED, SPAIN

UNED nlp.uned.es Value of reducing wrong answers R #NoA W #NoA empty combination icia092roro icia091roro UAIC092roro UAIC091roro base092roro base091roro

UNED nlp.uned.es Detecting wrong answers R #NoA W#NoA empty combination loga091dede loga092dede base092dede base091dede Maintaining the number of correct answers, the candidate answer was not correct for 83% of unanswered questions Very good step towards improving the system

UNED nlp.uned.es IR important, not enough R#NoA W#NoA empty combination uned092enen uned091enen nlel091enen uaic092enen base092enen base091enen elix092enen uaic091enen elix091enen syna091enen isik091enen iiit091enen elix092euen elix091euen Achievable Task Perfect combination is 50% better than best system Many systems under the IR baselines

UNED nlp.uned.es Outline Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008 QA 2009 Conclusion

UNED nlp.uned.es Conclusion New QA evaluation setting Assuming that To leave a question unanswered has more value than to give a wrong answer This assumption give space to further development QA systems And hopefully improve their performance

Thanks! Acknowledgement: EU project T-CLEF (ICT )