1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M.

Slides:

Advertisements

Similar presentations

Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.

Advertisements

SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Question Answering for Machine Reading Evaluation Evaluation Campaign at CLEF 2011 Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner.

ResPubliQA 2010: QA on European Legislation Anselmo Peñas, UNED, Spain Pamela Forner, CELCT, Italy Richard Sutcliffe, U. Limerick, Ireland Alvaro Rodrigo,

Methodology Overview: Developing Marine Ecological Scorecards Commission for Environmental Cooperation.

1 CLEF 2012, Rome QA4MRE, Question Answering for Machine Reading Evaluation Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner (CELCT,

1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy.

Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.

European Transport Workers´ Federation ETF WOMEN’S COMMITTEE Report of activities European Transport Workers Federation WOMEN’S CONFERECE Mariehamn,

CLEF 2008 Multilingual Question Answering Track UNED Anselmo Peñas Valentín Sama Álvaro Rodrigo CELCT Danilo Giampiccolo Pamela Forner.

Linked Data as an enabler of cross-media and multilingual content analytics for enterprises across Europe A.Gómez-Pérez (UPM) Project Coordinator.

3rd Answer Validation Exercise ( AVE 2008) QA subtrack at Cross-Language Evaluation Forum 2008 UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks.

Jumping Off Points Ideas of possible tasks Examples of possible tasks Categories of possible tasks.

Web Logs and Question Answering Richard Sutcliffe 1, Udo Kruschwitz 2, Thomas Mandl University of Limerick, Ireland 2 - University of Essex, UK 3.

Alicante, September, 22, Workshop Overview of the Multilingual Question Answering Track Danilo Giampiccolo.

CLEF 2007 Multilingual Question Answering Track Danilo Giampiccolo, CELCT Anselmo Peñas, UNED.

ResPubliQA IR baselines and UNED participation Álvaro Rodrigo Joaquín Pérez Anselmo Peñas Guillermo Garrido Lourdes Araujo nlp.uned.es.

Answer Validation Exercise Anselmo Peñas UNED NLP Group 2005 Breakout session.

Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

 Ad-hoc - This track tests mono- and cross- language text retrieval. Tasks in 2009 will test both CL and IR aspects.

PROFESSIONAL ETHICS IN THE SML: TRANSLATING AND INTERPRETING Dr Sandra Salin School of Modern Languages Thanks to Angela Uribe de.

 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.

Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,

Overview of the Fourth Recognising Textual Entailment Challenge NIST-Nov. 17, 2008TAC Danilo Giampiccolo (coordinator, CELCT) Hoa Trang Dan (NIST)

Overview of the Database Development Process

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.

A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.

CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.

CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( The Multiple Language Question Answering Track at CLEF 2003.

Some Personal Observations Donna Harman NIST. Language issues I see learning about accessing information both within and across different languages as.

Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.

Evaluating Question Answering Validation Anselmo Peñas (and Alvaro Rodrigo) NLP & IR group UNED nlp.uned.es Information Science Institute Marina del Rey,

The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield

Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner 1, Danilo Giampiccolo 1, Bernardo Magnini 2, Anselmo Peñas 3, Álvaro Rodrigo 3,

1 Interoperability of Spatial Data Sets and Services Data quality and Metadata: what is needed, what is feasible, next steps Interoperability of Spatial.

CLEF 2007 Workshop Budapest, September 19, 2007  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. Comas (1),

LREC 2010 Malta, May 20, 2010  ELDA 1 Evaluation Protocol and Tools for Question-Answering on Speech Transcripts N. Moreau, O. Hamon, D. Mostefa ELDA/ELRA,

CLEF 2009 Workshop Corfu, September 30, 2009  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. R. Comas,TALP.

1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.

1/14 White Paper, “A Strategy for Europe on Nutrition, Overweight and Obesity related Health Issues” Platform Plenary 4 July 2007 Ceri Thompson Directorate-General.

Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.

CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.

Thomas Mandl: GeoCLEF Track Overview Cross-Language Evaluation Forum (CLEF) Thomas Mandl, (U. Hildesheim) 8 th Workshop.

QA Pilot Task at CLEF 2004 Jesús Herrera Anselmo Peñas Felisa Verdejo UNED NLP Group Cross-Language Evaluation Forum Bath, UK - September 2004.

Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.

LREC Marrakech, May 29, 2008 Question Answering on Speech Transcriptions: the QAST evaluation in CLEF L. Lamel 1, S. Rosset 1, C. Ayache 2, D. Mostefa.

VQEG MM Phase 2 Working document towards a project plan.

Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.

 General domain question answering system.  The starting point was the architecture described in Brill, Eric. ‘Processing Natural Language without Natural.

1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.

The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.

Proposal for a Global Network for Beam Instrumentation [BIGNET] BI Group Meeting – 08/06/2012 J-J Gras CERN-BE-BI.

CLEF 2008 Workshop Aarhus, September 17, 2008  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. Comas (1), L.

Thomas Mandl: Robust CLEF Overview 1 Cross-Language Evaluation Forum (CLEF) Thomas Mandl Information Science Universität Hildesheim

The CLEF 2005 interactive track (iCLEF) Julio Gonzalo 1, Paul Clough 2 and Alessandro Vallin Departamento de Lenguajes y Sistemas Informáticos, Universidad.

Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, 1.

Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.

1 INFILE - INformation FILtering Evaluation Evaluation of adaptive filtering systems for business intelligence and technology watch Towards real use conditions.

A Trainable Multi-factored QA System Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia, Verginica Barbu-Mititelu Research Institute for.

CLEF Budapest1 Measuring the contribution of Word Sense Disambiguation for QA Proposers: UBC: Agirre, Lopez de Lacalle, Otegi, Rigau, FBK: Magnini.

IR Theory: Evaluation Methods

ESTP course on International Trade in Goods Statistics

What is the Entrance Exams Task

UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…

Analysis of the notification of compensatory measures

Machine Reading.

CLEF 2008 Multilingual Question Answering Track

Presentation transcript:

1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M. Cabral A. Peñas P. Forner R. Sutcliffe Á. Rodrigo C. Forascu I. Alegria D. Giampiccolo N. Moreau P. Osenova

2 QA Tasks & Time QA Tasks Multiple Language QA Main TaskResPubliQA Temporal restrictions and lists Answer Validation Exercise (AVE) GikiCLEF Real Time QA over Speech Transcriptions (QAST) WiQA WSD QA

campaign ResPubliQA: QA on European Legislation GikiCLEF: QA requiring geographical reasoning on Wikipedia QAST: QA on Speech Transcriptions of European Parliament Plenary sessions

4 QA 2009 campaign Task Registered groups Participant groups Submitted Runs Organizing people ResPubliQA (baseline runs) 9 Giki CLEF27817 runs2 QAST12486 (5 subtasks)8 Total 59 showed interest 23 Groups 147 runs evaluated 19 + additional assessors

5 ResPubliQA 2009: QA on European Legislation Organizers Anselmo Peñas Pamela Forner Richard Sutcliffe Álvaro Rodrigo Corina Forascu Iñaki Alegria Danilo Giampiccolo Nicolas Moreau Petya Osenova Additional Assessors Fernando Luis Costa Anna Kampchen Julia Kramme Cosmina Croitoru Advisory Board Donna Harman Maarten de Rijke Dominique Laurent

6 Evolution of the task Target languages Collections News News Wikipedia Nov European Legislation Number of questions Type of questions 200 Factoid + Temporal restrictions + Definitions - Type of question + Lists + Linked questions + Closed lists - Linked + Reason + Purpose + Procedure Supporting information DocumentSnippetParagraph Size of answer SnnipetExactParagraph

7 Objectives 1. Move towards a domain of potential users 2. Compare systems working in different languages 3. Compare QA Tech. with pure IR 4. Introduce more types of questions 5. Introduce Answer Validation Tech.

8 Collection Subset of JRC-Acquis (10,700 docs x lang) Parallel at document level EU treaties, EU legislation, agreements and resolutions Economy, health, law, food, … Between 1950 and 2006 XML-TEI.2 encoding Unfortunately, non parallel at the paragraph level -> extra work

9 500 questions REASON Why did a commission expert conduct an inspection visit to Uruguay? PURPOSE/OBJECTIVE What is the overall objective of the eco-label? PROCEDURE How are stable conditions in the natural rubber trade achieved? In general, any question that can be answered in a paragraph

questions Also FACTOID In how many languages is the Official Journal of the Community published? DEFINITION What is meant by “whole milk”? No NIL questions

11

12 Translation of questions

13 Selection of the final pool of 500 questions out of the 600 produced

14

15 Systems response No Answer ≠ Wrong Answer 1. Decide if the answer is given or not [ YES | NO ] Classification Problem Machine Learning, Provers, etc. Textual Entailment 2. Provide the paragraph (ID+Text) that answers the question Aim To leave a question unanswered has more value than to give a wrong answer

16 Assessments R: The question is answered correctly W: The question is answered incorrectly NoA: The question is not answered NoA R: NoA, but the candidate answer was correct NoA W: NoA, and the candidate answer was incorrect Noa Empty: NoA and no candidate answer was given Evaluation measure: Extension of the traditional accuracy (as proportion of questions correctly answered) Considering unanswered questions

17 Evaluation measure n: Number of questions n R : Number of correctly answered questions n U : Number of unanswered questions

18 Evaluation measure If n U = 0 then R /n  Accuracy If n R = 0 then If n U = n then Leave a question unanswered gives value only if this avoids to return a wrong answer Accuracy The added value is the performance shown with the answered questions: Accuracy

19 List of Participants SystemTeam elixELHUYAR-IXA, SPAIN iciaRACAI, ROMANIA iiitSearch & Info Extraction Lab, INDIA ilesLIMSI-CNRS-2, FRANCE isikISI-Kolkata, INDIA logaU.Koblenz-Landau, GERMAN miraMIRACLE, SPAIN nlelU. politecnica Valencia, SPAIN synaSynapse Developpment, FRANCE uaicAI.I.Cuza U. of IASI, ROMANIA unedUNED, SPAIN

20 Value of reducing wrong answers R #NoA W #NoA empty combination icia092roro icia091roro UAIC092roro UAIC091roro base092roro base091roro

21 Detecting wrong answers R #NoA W#NoA empty combination loga091dede loga092dede base092dede base091dede Maintaining the number of correct answers, the candidate answer was not correct for 83% of unanswered questions Very good step towards improving the system

22 IR important, not enough R#NoA W#NoA empty combination uned092enen uned091enen nlel091enen uaic092enen base092enen base091enen elix092enen uaic091enen elix091enen syna091enen isik091enen iiit091enen elix092euen elix091euen Feasible Task Perfect combination is 50% better than best system Many systems under the IR baselines

23 Comparison across languages Same questions Same documents Same baseline systems Strict comparison only affected by the variable of language But it is feasible to detect the most promising approaches across languages

24 Comparison across languages SystemROESENITDE icia nlel uned uned icia nlel uaic uaic loga loga Baseline Systems above the baselines Icia, Boolean + intensive NLP + ML- based validation & very good knowledge of the collection (Eurovoc terms…) Baseline, Okapi- BM25 tuned for paragraph retrieval

25 Comparison across languages SystemROESENITDE icia nlel uned uned icia nlel uaic uaic loga loga Baseline Systems above the baselines nlel092, ngram- based retrieval, combining evidence from several languages Baseline, Okapi- BM25 tuned for paragraph retrieval

26 Comparison across languages SystemROESENITDE icia nlel uned uned icia nlel uaic uaic loga loga Baseline Systems above the baselines Uned, Okapi-BM25 + NER + paragraph validation + ngram based re-ranking Baseline, Okapi- BM25 tuned for paragraph retrieval

27 Comparison across languages SystemROESENITDE icia nlel uned uned icia nlel uaic uaic loga loga Baseline Systems above the baselines nlel091, ngram-based paragraph retrieval Baseline, Okapi- BM25 tuned for paragraph retrieval

28 Comparison across languages SystemROESENITDE icia nlel uned uned icia nlel uaic uaic loga loga Baseline Systems above the baselines Baseline, Okapi- BM25 tuned for paragraph retrieval Loga, Lucene + deep NLP + Logic + ML- based validation

29 Conclusion Compare systems working in different languages Compare QA Tech. with pure IR Pay more attention to paragraph retrieval Old issue, late 90’s state of the art (English) Pure IR performance: Highest difference respect IR baselines: 0.44 – 0.68 Intensive NLP ML-based answer validation Introduce more types of questions Some types difficult to distinguish Any question that can be answered in a paragraph Analysis of results by question types (in progress)

30 Conclusion Introduce Answer Validation Tech. Evaluation measure: Value of reducing wrong answers Detecting wrong answers is feasible Feasible task 90% of questions have been answered Room for improvement: Best systems around 60% Even with less participants we have More comparison More analysis More learning ResPubliQA proposal for 2010 SC and breakout session

31 Interest on ResPubliQA 2010 GROUP 1Uni. "Al.I.Cuza" Iasi (Dan Cristea, Diana Trandabat) 2Linguateca (Nuno Cardoso) 3RACAI (Dan Tufis, Radu Ion) 4Jesus Vilares 5Univ. Koblenz-Landlau (Bjorn Pelzer) 6Thomson Reuters (Isabelle Moulinier) 7Gracinda Carvalho 8UNED (Alvaro Rodrigo) 9Uni. Politecnica Valencia (Paolo Rosso & Davide Buscaldi) 10Uni. Hagen (Ingo Glockner) 11Linguit (Jochen L. Leidner) 12Uni. Saarland (Dietrich Klakow) 13ELHUYAR-IXA (Arantxa Otegi) 14MIRACLE TEAM (Paloma Martínez Fernández) But we need more You have already a Gold Standard of 500 questions & answers to play with…