ResPubliQA 2010: QA on European Legislation Anselmo Peñas, UNED, Spain Pamela Forner, CELCT, Italy Richard Sutcliffe, U. Limerick, Ireland Alvaro Rodrigo,

Slides:



Advertisements
Similar presentations
SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
Advertisements

Question Answering for Machine Reading Evaluation Evaluation Campaign at CLEF 2011 Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner.
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
1 CLEF 2012, Rome QA4MRE, Question Answering for Machine Reading Evaluation Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner (CELCT,
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
CLEF 2008 Multilingual Question Answering Track UNED Anselmo Peñas Valentín Sama Álvaro Rodrigo CELCT Danilo Giampiccolo Pamela Forner.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
3rd Answer Validation Exercise ( AVE 2008) QA subtrack at Cross-Language Evaluation Forum 2008 UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks.
1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M.
Web Logs and Question Answering Richard Sutcliffe 1, Udo Kruschwitz 2, Thomas Mandl University of Limerick, Ireland 2 - University of Essex, UK 3.
Alicante, September, 22, Workshop Overview of the Multilingual Question Answering Track Danilo Giampiccolo.
CLEF 2007 Multilingual Question Answering Track Danilo Giampiccolo, CELCT Anselmo Peñas, UNED.
ResPubliQA IR baselines and UNED participation Álvaro Rodrigo Joaquín Pérez Anselmo Peñas Guillermo Garrido Lourdes Araujo nlp.uned.es.
Answer Validation Exercise Anselmo Peñas UNED NLP Group 2005 Breakout session.
With or without users? Julio Gonzalo UNEDhttp://nlp.uned.es.
 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.
Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
Impressions of 10 years of CLEF Donna Harman Scientist Emeritus National Institute of Standards and Technology.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( The Multiple Language Question Answering Track at CLEF 2003.
January 29, 2010ART Beach Retreat ART Beach Retreat 2010 Assessment Rubric for Critical Thinking First Scoring Session Summary ART Beach Retreat.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
D L T French Question Answering in Technical and Open Domains Aoife O’Gorman Documents and Linguistic Technology Group Univeristy of Limerick.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Overview of the KBP 2012 Slot-Filling Tasks Hoa Trang Dang (National Institute of Standards and Technology Javier Artiles (Rakuten Institute of Technology)
Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.
Evaluating Question Answering Validation Anselmo Peñas (and Alvaro Rodrigo) NLP & IR group UNED nlp.uned.es Information Science Institute Marina del Rey,
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Abstract Question answering is an important task of natural language processing. Unification-based grammars have emerged as formalisms for reasoning about.
Cross-Language Evaluation Forum (CLEF) IST Expected Kick-off Date: August 2001 Carol Peters IEI-CNR, Pisa, Italy Carol Peters: blabla Carol.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner 1, Danilo Giampiccolo 1, Bernardo Magnini 2, Anselmo Peñas 3, Álvaro Rodrigo 3,
CLEF 2007 Workshop Budapest, September 19, 2007  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. Comas (1),
LREC 2010 Malta, May 20, 2010  ELDA 1 Evaluation Protocol and Tools for Question-Answering on Speech Transcripts N. Moreau, O. Hamon, D. Mostefa ELDA/ELRA,
CLEF 2009 Workshop Corfu, September 30, 2009  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. R. Comas,TALP.
UNED at iCLEF 2008: Analysis of a large log of multilingual image searches in Flickr Victor Peinado, Javier Artiles, Julio Gonzalo and Fernando López-Ostenero.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
Thomas Mandl: GeoCLEF Track Overview Cross-Language Evaluation Forum (CLEF) Thomas Mandl, (U. Hildesheim) 8 th Workshop.
QA Pilot Task at CLEF 2004 Jesús Herrera Anselmo Peñas Felisa Verdejo UNED NLP Group Cross-Language Evaluation Forum Bath, UK - September 2004.
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
LREC Marrakech, May 29, 2008 Question Answering on Speech Transcriptions: the QAST evaluation in CLEF L. Lamel 1, S. Rosset 1, C. Ayache 2, D. Mostefa.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.
CLEF 2008 Workshop Aarhus, September 17, 2008  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. Comas (1), L.
Thomas Mandl: Robust CLEF Overview 1 Cross-Language Evaluation Forum (CLEF) Thomas Mandl Information Science Universität Hildesheim
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
A Trainable Multi-factored QA System Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia, Verginica Barbu-Mititelu Research Institute for.
CLEF Budapest1 Measuring the contribution of Word Sense Disambiguation for QA Proposers: UBC: Agirre, Lopez de Lacalle, Otegi, Rigau, FBK: Magnini.
Presentation 王睿.
Patrick Staes and Ann Stoffels
What is the Entrance Exams Task
UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…
Machine Reading.
CLEF 2008 Multilingual Question Answering Track
Presentation transcript:

ResPubliQA 2010: QA on European Legislation Anselmo Peñas, UNED, Spain Pamela Forner, CELCT, Italy Richard Sutcliffe, U. Limerick, Ireland Alvaro Rodrigo, UNED, Spain 1

Outline  The Multiple Language Question Answering Track at CLEF – a bit of History  ResPubliQA this year –What is new  Participation, Runs and Languages  Assessment and Metrics  Results  Conclusions ResPubliQA 2010, 22 September, Padua, Italy 2

Multiple Language Question Answering at CLEF ResPubliQA 2010, 22 September, Padua, Italy 3 Era I: Era II: Era III: Ungrouped mainly factoid questions asked against monolingual newspapers; Exact answers returned Grouped questions asked against newspapers and Wikipedia; Exact answers returned ResPubliQA - Ungrouped questions against multilingual parallel- aligned EU legislative documents; Passages returned Started in 2003: eighth year

ResPubliQA 2010 – Second Year  Key points: – same set of questions in all languages – same document collections: parallel aligned documents  Same objectives: – to move towards a domain of potential users – to allow the direct comparison of performances across languages – to allow QA technologies to be evaluated against IR approaches – to promote use of Validation technologies ResPubliQA 2010, 22 September, Padua, Italy 4 But also some novelties…

What’s new 1.New Task (Answer Selection) 2.New document collection (EuroParl) 3.New question types 4.Automatic Evaluation ResPubliQA 2010, 22 September, Padua, Italy 5

The Tasks ResPubliQA 2010, 22 September, Padua, Italy 6  Paragraph Selection (PS) – to extract a relevant paragraph of text that satisfies completely the information need expressed by a natural language question  Answer Selection (AS) – to demarcate the shorter string of text corresponding to the exact answer supported by the entire paragraph NEW

The Collections  Subset of JRC-Acquis (10,700 docs per lang) – EU treaties, EU legislation, agreements and resolutions – Between 1950 and 2006 – Parallel-aligned at the doc level (not always at paragraph) – XML-TEI.2 encoding  Small subset of EUROPARL (~ 150 docs per lang) – Proceedings of the European Parliament translations into Romanian from January 2009 Debates (CRE) from 2009 and Texts Adopted (TA) from 2007 – Parallel-aligned at the doc level (not always at paragraph) – XML encoding ResPubliQA 2010, 22 September, Padua, Italy 7 NEW

EuroParl Collection  is compatible with Acquis domain  allows to widen the scope of the questions  Unfortunately – small number of texts documents are not fully translated ResPubliQA 2010, 22 September, Padua, Italy 8 The specific fragments of JRC-Acquis and Europarl used by ResPubliQA is available at

Questions  two new question categories: – OPINION What did the Council think about the terrorist attacks on London? – OTHER What is the e-Content program about?  Reason and Purpose categories merged together Why was Perwiz Kambakhsh sentenced to death?  And also Factoid, Definition, Procedure ResPubliQA 2010, 22 September, Padua, Italy 9

ResPubliQA Campaigns ResPubliQA 2010, 22 September, Padua, Italy 10 Task Registered groups Participant groups Submitted Runs Organizing people ResPubliQA (baseline runs) 9 ResPubliQA (42 PS and 7 AS) 6 (+ 6 additional translators/ assessors) More participants and more submissions

ResPubliQA 2010 Participants ResPubliQA 2010, 22 September, Padua, Italy 11 System name TeamReference bpacSZTAKI, HUNGARYNemeskey dict Dhirubhai Ambani Institute of Information and Communication Technology, INDIASabnani et al elixUniversity of Basque Country, SPAINAgirre et al iciaRACAI, ROMANIAIon et al ilesLIMSI-CNRS, FRANCETannier et al ju_cJadavpur University, INDIAPakray et al logaUniversity Koblenz, GERMANYGlöckner and Pelzer nlelU. Politecnica Valencia, SPAINCorrea et al pribPriberam, PORTUGAL- uaicAl.I.Cuza\ University of Iasi, ROMANIAIftene et al uc3mUniversidad Carlos III de Madrid, SPAINVicente-Díez et al uiirUniversity of Indonesia, INDONESIAToba et al unedUNED, SPAINRodrigo et al 13 participants 8 countries 4 new participants

Submissions by Task and Language Target language Source languages DEENESFRITPTROTotal DE4 (4,0) EN19 (16,3)2 (2,0)21 (18,3) ES7 (6,1) EU2 (2,0) FR7 (5,2) IT3 (2,1) PT1 (1,0) RO4 (4,0) Total4 (4,0)21 (18,3)7 (6,1)7 (5,2)3 (2,1)1 (1,0)6 (6,0)49 (42,7) ResPubliQA 2010, 22 September, Padua, Italy 12

System Output  Two options: – Give an answer (paragraph or exact answer) – Return NOA as response = no answer is given The system is not confident about the correctness of its answer  Objective: – avoid to return an incorrect answer – reduce only the portion of wrong answers ResPubliQA 2010, 22 September, Padua, Italy 13

Evaluation Measure ResPubliQA 2010, 22 September, Padua, Italy 14 n R : number of questions correctly answered n U : number of questions unanswered n: total number of questions (200 this year) If n U = 0 then R /n  Accuracy

Assessment Two steps: 1)Automatic evaluation o responses automatically compared against the Gold Standard manually produced – answers that exactly match with the GoldStandard, are given the correct value (R) – correctness of a response: exact match of Document identifier, Paragraph identifier, and the text retrieved by the system with respect to those in the GoldStandard 2)Manual assessment o Non-matching paragraphs/ answers judged by human assessors o anonymous and simultaneous for the same question ResPubliQA 2010, 22 September, Padua, Italy 15 31% of the answers automatically marked as correct

Assessment for Paragraph Selection (PS)  binary assessment: – Right (R) – Wrong (W)  NOA answers: – automatically filtered and marked as U (Unanswered) – discarded candidate answers were also evaluated NoA R: NoA, but the candidate answer was correct NoA W: NoA, and the candidate answer was incorrect Noa Empty: NoA and no candidate answer was given  evaluators were guided by the initial “gold” paragraph – only a hint ResPubliQA 2010, 22 September, Padua, Italy 16

Assessment for Answer Selection (AS) R (Right): the answer-string consists of an exact and correct answer, supported by the returned paragraph; X (ineXact): the answer-string contains either part of a correct answer present in the returned paragraph or it contains all the correct answer plus unnecessary additional text; M (Missed): the answer-string does not contain a correct answer even in part but the returned paragraph in fact does contain a correct answer; W (Wrong): the answer-string does not contain a correct answer and moreover the returned paragraph does not contain it either; or it contains an unsupported answer ResPubliQA 2010, 22 September, Padua, Italy 17

Monolingual Results for PS ResPubliQA 2010, 22 September, Padua, Italy 18 systemDEENESFRITPTRO Combination uiir dict bpac loga loga prib nlel bpac elix IR baseline (uned) uned uc3m uc3m dict uiir uned elix nlel ju_c iles uaic uaic icia

Improvement in the Performance ResPubliQA 2010, 22 September, Padua, Italy 19 BESTAVERAGE ResPubliQA ResPubliQA Monolingual PS Task: 2010 CollectionsBESTAVERAGE JRC-Acquis EuroParl

Cross-language Results for PS ResPubliQA 2010, 22 September, Padua, Italy 20 systemDEENESFRITPTRO elix102euen0.36 elix101euen0.33 icia101enro0.29 icia102enro0.29 In comparison to ResPubliQA 2009: – More cross-language runs (+ 2) – Improvement in the best performance: from 0.18 to 0.36

Results for the AS Task ResPubliQA 2010, 22 September, Padua, Italy 21 R #NoA W #NoA M #NoA X #NoA empty combination ju_c101ASenen iles101ASenen iles101ASfrfr nlel101ASenen nlel101ASeses nlel101ASitit nlel101ASfrfr

Conclusions  Successful continuation of ResPubliQA 2009  AS task: few groups and poor results  Overall improvement of results  New document collection and new question types  evaluation metric encourages the use of validation module ResPubliQA 2010, 22 September, Padua, Italy 22

More on System Analyses and Approaches MLQA’10 Workshop on Wednesday 14:30 – 18:00 ResPubliQA 2010, 22 September, Padua, Italy 23

ResPubliQA 2010: QA on European Legislation Thank you! 24