UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

CLEF QA, September 21, 2006, Synapse Développement, D. LAURENT Why not 100% ?
SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.
Question Answering for Machine Reading Evaluation Evaluation Campaign at CLEF 2011 Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner.
Recognizing Textual Entailment Challenge PASCAL Suleiman BaniHani.
ResPubliQA 2010: QA on European Legislation Anselmo Peñas, UNED, Spain Pamela Forner, CELCT, Italy Richard Sutcliffe, U. Limerick, Ireland Alvaro Rodrigo,
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
UNIT-III By Mr. M. V. Nikum (B.E.I.T). Programming Language Lexical and Syntactic features of a programming Language are specified by its grammar Language:-
1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy.
CLEF 2008 Multilingual Question Answering Track UNED Anselmo Peñas Valentín Sama Álvaro Rodrigo CELCT Danilo Giampiccolo Pamela Forner.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
3rd Answer Validation Exercise ( AVE 2008) QA subtrack at Cross-Language Evaluation Forum 2008 UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks.
UNED at PASCAL RTE-2 Challenge IR&NLP Group at UNED nlp.uned.es Jesús Herrera Anselmo Peñas Álvaro Rodrigo Felisa Verdejo.
Alicante, September, 22, Workshop Overview of the Multilingual Question Answering Track Danilo Giampiccolo.
CLEF 2007 Multilingual Question Answering Track Danilo Giampiccolo, CELCT Anselmo Peñas, UNED.
ResPubliQA IR baselines and UNED participation Álvaro Rodrigo Joaquín Pérez Anselmo Peñas Guillermo Garrido Lourdes Araujo nlp.uned.es.
Answer Validation Exercise Anselmo Peñas UNED NLP Group 2005 Breakout session.
Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Overview of the Fourth Recognising Textual Entailment Challenge NIST-Nov. 17, 2008TAC Danilo Giampiccolo (coordinator, CELCT) Hoa Trang Dan (NIST)
Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( The Multiple Language Question Answering Track at CLEF 2003.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.
L’età della parola Giuseppe Attardi Dipartimento di Informatica Università di Pisa ESA SoBigDataPisa, 24 febbraio 2015.
Applying the KISS Principle with Prior-Art Patent Search Walid Magdy Gareth Jones Dublin City University CLEF-IP, 22 Sep 2010.
Evaluating Question Answering Validation Anselmo Peñas (and Alvaro Rodrigo) NLP & IR group UNED nlp.uned.es Information Science Institute Marina del Rey,
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
A Language Independent Method for Question Classification COLING 2004.
Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences.
CLEF 2009 Workshop Corfu, September 30, 2009  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. R. Comas,TALP.
Introduction to Dialogue Systems. User Input System Output ?
CLEF 2008 Workshop September 17-19, 2008 Aarhus, Denmark.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
QA Pilot Task at CLEF 2004 Jesús Herrera Anselmo Peñas Felisa Verdejo UNED NLP Group Cross-Language Evaluation Forum Bath, UK - September 2004.
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.
LREC Marrakech, May 29, 2008 Question Answering on Speech Transcriptions: the QAST evaluation in CLEF L. Lamel 1, S. Rosset 1, C. Ayache 2, D. Mostefa.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
CLEF 2008 Workshop Aarhus, September 17, 2008  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. Comas (1), L.
1 CPA: Where do we go from here? Research Institute for Information and Language Processing, University of Wolverhampton; UPF Barcelona; University of.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Correcting Misuse of Verb Forms John Lee, Stephanie Seneff Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge ACL 2008.
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 1 Introduction to the Monolingual and Domain-Specific Tasks of the Cross-language.
A Trainable Multi-factored QA System Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia, Verginica Barbu-Mititelu Research Institute for.
CLEF Budapest1 Measuring the contribution of Word Sense Disambiguation for QA Proposers: UBC: Agirre, Lopez de Lacalle, Otegi, Rigau, FBK: Magnini.
Measuring Monolinguality
Semantic Parsing for Question Answering
Statistical NLP: Lecture 13
Irion Technologies (c)
Social Knowledge Mining
Donna M. Gates Carnegie Mellon University
Automatic Detection of Causal Relations for Question Answering
Text Mining & Natural Language Processing
Text Mining & Natural Language Processing
Sogeti: User support in numbers
What is the Entrance Exams Task
Sno Unit testing tool MaNIK BALI NOAA/NESDIS/STAR.
Extracting Why Text Segment from Web Based on Grammar-gram
Machine Reading.
Active AI Projects at WIPO
CLEF 2008 Multilingual Question Answering Track
Presentation transcript:

3rd Answer Validation Exercise (AVE 2008) QA subtrack at Cross-Language Evaluation Forum 2008 UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to… Main task QA organizing committee

Answer Validation Exercise 2008 Validate the correctness of real systems answers Question Question Question Answering Candidate answer Answer Validation Answer is correct Supporting Text Answer is not correct or not enough evidence

Candidate answers grouped by question Collections Candidate answers grouped by question <q id="116" lang="EN"> <q_str>What is Zanussi?</q_str> <a id="116_1" value=""> <a_str>was an Italian producer of home appliances</a_str> <t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str> </a> <a id="116_2" value=""> <a_str>who had also been in Cassibile since August 31</a_str> <t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str> <a id="116_4" value=""> <a_str>3</a_str> <t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str> </q> - Accept or Reject all answers - Select one of the accepted answers

Collections Remove duplicated answers inside the same question group Discard NIL answers, void answers and answers with too long supporting snippet This processing lead to a reduction in the number of answers to be validated

AVE Collections 2008 (# answers to validate) Testing (% Correct) Development % Correct Spanish 1528 10% 551 23% English 1055 7.5% 195 10.8% German 1027 264 25.4% Portuguese 1014 20.5% 148 42.8% Dutch 228 19.3% 78 15.8% French 199 26.1% 171 49.7% Romanian 119 10.5% 82 43.7% Basque 104 7.2% - Bulgarian 27 44.4% Italian 100 16% Available for CLEF participants at nlp.uned.es/clef-qa/ave/

Evaluation Not balanced collections (real world) Approach: Detect if there is enough evidence to accept an answer Measures: Precision, recall and F over ACCEPTED answers Baseline system: Accept all answers

Participants and runs DE EN ES FR RO Tot Fernuniversität in Hagen 2 LIMSI U. Iasi 4 DFKI 1 INAOE U. Alicante 3 UNC U. Jaén (UJA) 6 LINA Total 8 5 24

Precision, Recall and F measure over correct answers for English Evaluation: P, R, F Precision, Recall and F measure over correct answers for English Group System F Precision Recall DFKI Ltqa 0.64 0.54 0.78 UA Ofe 0.49 0.35 0.86 UNC Jota_2 0.21 0.13 0.56 IASI Uaic_2 0.19 0.11 0.85 Jota_1 0.17 0.09 0.94 Uaic_1 0.76 100% VALIDATED 0.14 0.08 1 UJA Magc_2(bbr) 0.02 0.01 Magc_1(timbl)

Additional measures Compare AVE systems with QA systems performance Count the answers SELECTED correctly Reward the detection of groups in which all answers are incorrect Allows a new justified attempt to answer the question new

Additional measures new new new

Evaluation: estimated performance System type estimated_ qa_performance qa_accuracy (% best combination) qa_rej_ accuracy qa_ accuracy_max Perfect selection 0,56 0,34 (100%) 0,66 1 ltqa AV 0,34 0,24 (70,37%) 0,44 0,68 ofe 0,27 0,19 (57,41%) 0,4 0,59 uaic_2 0,24 0,01 0,25 wlvs081roen QA 0,21 0,21 (62,96%) uaic_1 0,19 jota_2 0,17 0,16 (46,30%) 0,1 0,26 dfki081deen 0,17 (50%) jota_1 0,16 dcun081deen 0,10 0,10 (29,63%) Random 0,09 0,09 (25,25%) nlel081enen 0,06 0,06 (18,52%) nlel082enen 0,05 0,05 (14,81%) ilkm081nlen 0,04 0,04 (12,96%) magc_2(bbr) 0,01 (1,85%) 0,64 0,65 dcun082deen magc_1(timbl) 0 (0%) 0,63

Comparing AV systems performance with QA systems (English)

Techniques reported at AVE 2007 & 2008 10 reports (2007) 9 reports (2008) Syntactic similarity 4 Functions (sub, obj, etc) 3 Syntactic transformations 1 2 Word-sense disambiguation Semantic parsing Semantic role labeling First order logic representation Theorem prover Semantic similarity Generates hypotheses 6 2 Wordnet 3 5 Chunking 4 n-grams, longest common Subsequences Phrase transformations NER 7 Num. expressions Temp. expressions Coreference resolution Dependency analysis

Conclusion (of AVE) Three years of evaluation in a real environment Real systems outputs -> AVE input Developed methodologies Build collections from QA responses Evaluate in chain with a QA Track Compare results with QA systems Introduction of RTE techniques in QA More NLP More Machine Learning New testing collections for the QA and RTE communities In 8 languages, not only English

Many Thanks!! CLEF AVE QA Organizing Committee AVE participants UNED team