1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M. Cabral A. Peñas P. Forner R. Sutcliffe Á. Rodrigo C. Forascu I. Alegria D. Giampiccolo N. Moreau P. Osenova
2 QA Tasks & Time QA Tasks Multiple Language QA Main TaskResPubliQA Temporal restrictions and lists Answer Validation Exercise (AVE) GikiCLEF Real Time QA over Speech Transcriptions (QAST) WiQA WSD QA
campaign ResPubliQA: QA on European Legislation GikiCLEF: QA requiring geographical reasoning on Wikipedia QAST: QA on Speech Transcriptions of European Parliament Plenary sessions
4 QA 2009 campaign Task Registered groups Participant groups Submitted Runs Organizing people ResPubliQA (baseline runs) 9 Giki CLEF27817 runs2 QAST12486 (5 subtasks)8 Total 59 showed interest 23 Groups 147 runs evaluated 19 + additional assessors
5 ResPubliQA 2009: QA on European Legislation Organizers Anselmo Peñas Pamela Forner Richard Sutcliffe Álvaro Rodrigo Corina Forascu Iñaki Alegria Danilo Giampiccolo Nicolas Moreau Petya Osenova Additional Assessors Fernando Luis Costa Anna Kampchen Julia Kramme Cosmina Croitoru Advisory Board Donna Harman Maarten de Rijke Dominique Laurent
6 Evolution of the task Target languages Collections News News Wikipedia Nov European Legislation Number of questions Type of questions 200 Factoid + Temporal restrictions + Definitions - Type of question + Lists + Linked questions + Closed lists - Linked + Reason + Purpose + Procedure Supporting information DocumentSnippetParagraph Size of answer SnnipetExactParagraph
7 Objectives 1. Move towards a domain of potential users 2. Compare systems working in different languages 3. Compare QA Tech. with pure IR 4. Introduce more types of questions 5. Introduce Answer Validation Tech.
8 Collection Subset of JRC-Acquis (10,700 docs x lang) Parallel at document level EU treaties, EU legislation, agreements and resolutions Economy, health, law, food, … Between 1950 and 2006 XML-TEI.2 encoding Unfortunately, non parallel at the paragraph level -> extra work
9 500 questions REASON Why did a commission expert conduct an inspection visit to Uruguay? PURPOSE/OBJECTIVE What is the overall objective of the eco-label? PROCEDURE How are stable conditions in the natural rubber trade achieved? In general, any question that can be answered in a paragraph
questions Also FACTOID In how many languages is the Official Journal of the Community published? DEFINITION What is meant by “whole milk”? No NIL questions
11
12 Translation of questions
13 Selection of the final pool of 500 questions out of the 600 produced
14
15 Systems response No Answer ≠ Wrong Answer 1. Decide if the answer is given or not [ YES | NO ] Classification Problem Machine Learning, Provers, etc. Textual Entailment 2. Provide the paragraph (ID+Text) that answers the question Aim To leave a question unanswered has more value than to give a wrong answer
16 Assessments R: The question is answered correctly W: The question is answered incorrectly NoA: The question is not answered NoA R: NoA, but the candidate answer was correct NoA W: NoA, and the candidate answer was incorrect Noa Empty: NoA and no candidate answer was given Evaluation measure: Extension of the traditional accuracy (as proportion of questions correctly answered) Considering unanswered questions
17 Evaluation measure n: Number of questions n R : Number of correctly answered questions n U : Number of unanswered questions
18 Evaluation measure If n U = 0 then R /n Accuracy If n R = 0 then If n U = n then Leave a question unanswered gives value only if this avoids to return a wrong answer Accuracy The added value is the performance shown with the answered questions: Accuracy
19 List of Participants SystemTeam elixELHUYAR-IXA, SPAIN iciaRACAI, ROMANIA iiitSearch & Info Extraction Lab, INDIA ilesLIMSI-CNRS-2, FRANCE isikISI-Kolkata, INDIA logaU.Koblenz-Landau, GERMAN miraMIRACLE, SPAIN nlelU. politecnica Valencia, SPAIN synaSynapse Developpment, FRANCE uaicAI.I.Cuza U. of IASI, ROMANIA unedUNED, SPAIN
20 Value of reducing wrong answers R #NoA W #NoA empty combination icia092roro icia091roro UAIC092roro UAIC091roro base092roro base091roro
21 Detecting wrong answers R #NoA W#NoA empty combination loga091dede loga092dede base092dede base091dede Maintaining the number of correct answers, the candidate answer was not correct for 83% of unanswered questions Very good step towards improving the system
22 IR important, not enough R#NoA W#NoA empty combination uned092enen uned091enen nlel091enen uaic092enen base092enen base091enen elix092enen uaic091enen elix091enen syna091enen isik091enen iiit091enen elix092euen elix091euen Feasible Task Perfect combination is 50% better than best system Many systems under the IR baselines
23 Comparison across languages Same questions Same documents Same baseline systems Strict comparison only affected by the variable of language But it is feasible to detect the most promising approaches across languages
24 Comparison across languages SystemROESENITDE icia nlel uned uned icia nlel uaic uaic loga loga Baseline Systems above the baselines Icia, Boolean + intensive NLP + ML- based validation & very good knowledge of the collection (Eurovoc terms…) Baseline, Okapi- BM25 tuned for paragraph retrieval
25 Comparison across languages SystemROESENITDE icia nlel uned uned icia nlel uaic uaic loga loga Baseline Systems above the baselines nlel092, ngram- based retrieval, combining evidence from several languages Baseline, Okapi- BM25 tuned for paragraph retrieval
26 Comparison across languages SystemROESENITDE icia nlel uned uned icia nlel uaic uaic loga loga Baseline Systems above the baselines Uned, Okapi-BM25 + NER + paragraph validation + ngram based re-ranking Baseline, Okapi- BM25 tuned for paragraph retrieval
27 Comparison across languages SystemROESENITDE icia nlel uned uned icia nlel uaic uaic loga loga Baseline Systems above the baselines nlel091, ngram-based paragraph retrieval Baseline, Okapi- BM25 tuned for paragraph retrieval
28 Comparison across languages SystemROESENITDE icia nlel uned uned icia nlel uaic uaic loga loga Baseline Systems above the baselines Baseline, Okapi- BM25 tuned for paragraph retrieval Loga, Lucene + deep NLP + Logic + ML- based validation
29 Conclusion Compare systems working in different languages Compare QA Tech. with pure IR Pay more attention to paragraph retrieval Old issue, late 90’s state of the art (English) Pure IR performance: Highest difference respect IR baselines: 0.44 – 0.68 Intensive NLP ML-based answer validation Introduce more types of questions Some types difficult to distinguish Any question that can be answered in a paragraph Analysis of results by question types (in progress)
30 Conclusion Introduce Answer Validation Tech. Evaluation measure: Value of reducing wrong answers Detecting wrong answers is feasible Feasible task 90% of questions have been answered Room for improvement: Best systems around 60% Even with less participants we have More comparison More analysis More learning ResPubliQA proposal for 2010 SC and breakout session
31 Interest on ResPubliQA 2010 GROUP 1Uni. "Al.I.Cuza" Iasi (Dan Cristea, Diana Trandabat) 2Linguateca (Nuno Cardoso) 3RACAI (Dan Tufis, Radu Ion) 4Jesus Vilares 5Univ. Koblenz-Landlau (Bjorn Pelzer) 6Thomson Reuters (Isabelle Moulinier) 7Gracinda Carvalho 8UNED (Alvaro Rodrigo) 9Uni. Politecnica Valencia (Paolo Rosso & Davide Buscaldi) 10Uni. Hagen (Ingo Glockner) 11Linguit (Jochen L. Leidner) 12Uni. Saarland (Dietrich Klakow) 13ELHUYAR-IXA (Arantxa Otegi) 14MIRACLE TEAM (Paloma Martínez Fernández) But we need more You have already a Gold Standard of 500 questions & answers to play with…