QA Pilot Task at CLEF 2004 Jesús Herrera Anselmo Peñas Felisa Verdejo UNED NLP Group Cross-Language Evaluation Forum Bath, UK - September 2004.

Slides:

Advertisements

Similar presentations

Hillsmeade Primary School Term Teacher Professional Leave These PD and focus group sessions are designed to assist all staff to gain an understanding.

Advertisements

Evaluating Hierarchical Clustering of Search Results Departamento de Lenguajes y Sistemas Informáticos UNED, Spain Juan Cigarrán Anselmo Peñas Julio Gonzalo.

Question Answering for Machine Reading Evaluation Evaluation Campaign at CLEF 2011 Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner.

Fit to Learn Using the Employability Skills Framework to improve your performance at College The Employability Skills Framework has been developed by business.

1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy.

CLEF 2008 Multilingual Question Answering Track UNED Anselmo Peñas Valentín Sama Álvaro Rodrigo CELCT Danilo Giampiccolo Pamela Forner.

Vikas BhardwajColumbia University NLP for the Web – Spring 2010 Improving QA Accuracy by Question Inversion Prager et al. IBM T.J. Watson Res. Ctr. 02/18/2010.

APPRAISAL An evaluation into the progress and work of an employee over a period of time. Appraisal is part of the job for any person who is responsible.

Evaluating Search Engine

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

3rd Answer Validation Exercise ( AVE 2008) QA subtrack at Cross-Language Evaluation Forum 2008 UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks.

UNED at PASCAL RTE-2 Challenge IR&NLP Group at UNED nlp.uned.es Jesús Herrera Anselmo Peñas Álvaro Rodrigo Felisa Verdejo.

CLEF 2007 Multilingual Question Answering Track Danilo Giampiccolo, CELCT Anselmo Peñas, UNED.

Answer Validation Exercise Anselmo Peñas UNED NLP Group 2005 Breakout session.

Team Composition and Team Role Allocation in Agile Project Teams Brian Turrel 30 March 2015.

Task analysis 1 © Copyright De Montfort University 1998 All Rights Reserved Task Analysis Preece et al Chapter 7.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

With or without users? Julio Gonzalo UNEDhttp://nlp.uned.es.

Setting the success criteria to evaluate project success Tiina Lell

Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,

Evaluation and analysis of the application of interactive digital resources in a blended-learning methodology for a computer networks subject F.A. Candelas,

AWARE PROJECT – AGEING WORKFORCE TOWARDS AN ACTIVE RETIREMENT Alberto Ferreras-Remesal Institute of Biomechanics of Valencia IFA 2012 – Prague – May 31th.

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.

CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.

CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( The Multiple Language Question Answering Track at CLEF 2003.

JULIO GONZALO, VÍCTOR PEINADO, PAUL CLOUGH & JUSSI KARLGREN CLEF 2009, CORFU iCLEF 2009 overview tags : image_search, multilinguality, interactivity, log_analysis,

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.

Cross-Language Evaluation Forum (CLEF) IST Expected Kick-off Date: August 2001 Carol Peters IEI-CNR, Pisa, Italy Carol Peters: blabla Carol.

Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner 1, Danilo Giampiccolo 1, Bernardo Magnini 2, Anselmo Peñas 3, Álvaro Rodrigo 3,

Assessing the Quality of Research

Assessment Workshop Title of the Project (date). Project Title Assessment Workshop October 25, 2015© Company Name All rights reserved2 Agenda Purpose.

Classroom Assessment for Student Learning March 2009 Assessment Critiquing.

Reevaluation Using PSM/RTI Processes, PLAFP, and Exit Criteria How do I do all this stuff?

Goal Setting By Lottie Scarr. Goal setting is an effective way of controlling anxiety levels. This method often allows performers to direct his or her.

Self-assessment Accuracy: the influence of gender and year in medical school self assessment Elhadi H. Aburawi, Sami Shaban, Margaret El Zubeir, Khalifa.

UNED at iCLEF 2008: Analysis of a large log of multilingual image searches in Flickr Victor Peinado, Javier Artiles, Julio Gonzalo and Fernando López-Ostenero.

How robust is CLIR? Proposal for a new robust task at CLEF Thomas Mandl Information Science Universität Hildesheim 6 th Workshop.

Splitting Complex Temporal Questions for Question Answering systems ACL 2004.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.

Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.

What’s happening in iCLEF? (the iCLEF Flickr Challenge) Julio Gonzalo (UNED), Paul Clough (U. Sheffield), Jussi Karlgren (SICS), Javier Artiles (UNED),

Departamento de Lenguajes y Sistemas Informáticos Cross-language experiments with IR-n system CLEF-2003.

The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.

Acceso a la información mediante exploración de sintagmas Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos UNED III.

SOFTWARE PROCESS IMPROVEMENT

AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.

Ensuring progression for all young people Tony Gallagher HMI.

Thomas Mandl: Robust CLEF Overview 1 Cross-Language Evaluation Forum (CLEF) Thomas Mandl Information Science Universität Hildesheim

Improving QA Accuracy by Question Inversion John Prager, Pablo Duboue, Jennifer Chu-Carroll Presentation by Sam Cunningham and Martin Wintz.

The CLEF 2005 interactive track (iCLEF) Julio Gonzalo 1, Paul Clough 2 and Alessandro Vallin Departamento de Lenguajes y Sistemas Informáticos, Universidad.

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

The importance of engaging in Health systems strengthening to ensure Nutrition interventions are truly delivered within the health system TECHNICAL MEETING.

Raising standards improving lives The revised Learning and Skills Common Inspection Framework: AELP 2011.

PLANNING, MATERIALITY AND ASSESSING THE RISK OF MISSTATEMENT

F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo

Job design & job satisfaction

Coursework: The Use of Generic Application Software for Task Solution

IR Theory: Evaluation Methods

Assessment Workshop Title of the Project (date)

Evaluating STD Free! Processes

5th International Conference on ELT in China, May 2007 Motivation and motivating Chinese students in the language classroom – Transition to UK Higher.

What is the Entrance Exams Task

L1 - Market Research introduction

UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…

European Institute of Public Administration (NL)

Job design & job satisfaction

Machine Reading.

CLEF 2008 Multilingual Question Answering Track

Presentation transcript:

QA Pilot Task at CLEF 2004 Jesús Herrera Anselmo Peñas Felisa Verdejo UNED NLP Group Cross-Language Evaluation Forum Bath, UK - September 2004

Objectives of the Pilot Task Difficult questions Inspect systems performance in answering : Conjunctive lists Disjunctive lists Questions with temporal restrictions Make some inference Self-scoring Study systems capability to give accurate confidence score New evaluation measures Correlation between confidence score and correctness Generate evaluation and training resources Feedback for the Main Tasks

Pilot Task definition Usual methodology Carried out simultaneously with Main Track Same guidelines except: Source and target language: Spanish Number and type of questions: –Factoid: 18 (2 NIL) –Definition: 2 –Conjunctive list: 20 –Temporally restricted: 60 Number of answers per question: unlimited –Several correct and distinct answers per question (disjunctive list) –Context dependant or evolving in time –Reward correct and distinct answers and “punish” incorrect ones Evaluation Measures

Temporally restricted questions 3 Moments with regard to the restriction Before During After 3 Types of restrictions: Restricted by Date (20 questions): day, month, year, etc. –¿Qué sistema de gobierno tenía Andorra hasta mayo de 1993? Restricted by Period (20 questions) –¿Quién gobernó en Bolivia entre julio de 1980 y agosto de 1981? Restricted by Event (nested question) (20 questions) –¿Quién fue el rey de Bélgica inmediatamente antes de la coronación de Alberto II? Inspect several documents to answer a question

Evaluation measures Considerations “Do it better” versus “How to get better results?” Systems are tuned according the evaluation measure Criteria. Reward systems that give: Answer to more questions More different correct answers to each question Less incorrect answers to each question Higher confidence score to correct answers Lower confidence score to incorrect answers Answer to questions with less known answers “Punish” incorrect answers Users prefer void answers rather than incorrect ones Promote answer validation and accurate self-scoring Unlimited number of answers is permited  self-regulation

K-measure score(r): confidence self-scoring [0,1  R(i): number of different known answers to question i answered(sys,i): number of answers given by sys to question i K(sys)  -1,1  Baseline: K(sys)=0   r.score(r)=0 (System without knowledge) Evaluation measure

Self-scoring and correctness Correlation coefficient (r) –Correctness (human assessment): assess(sys,r)  {0,1} 0: incorrect 1: correct –Self-scoring score(sys,r)  [0,1] –r  [-1,1] 0: no correlation 1: perfect correlation -1: inverse correlation

Results at the Pilot Task Only one participant: U. Alicante Splitting of nested questions (Saquete et al., 2004) Correctly answered: 15% (factoid: 22%) Correctly answered in Main Track: 32% Evaluated over TERQAS obtain better results Questions too difficult Correlation between assessment and self- scoring: Further work on improving self-scoring K= –0.086 k < 0

Case of study Are systems able to give an accurate confidence score? Do K-measure reward it better than others? Study the ranking of the 48 participant systems at the Main Track number of correct answers CWS K1, variant of K-measure when just 1 answer per question is requested

K1>0K1=0K1<0 Systems ordered according to number of correct answers Re-ranking with K1 vs. CWS CWS r = ptue041ptpt r = – dfki041deen r = fuha041dede

Case of study CWS reward some systems with very bad confidence self-scoring For example: fuha041dede, r= –CWS: 1 st position –K1: 27 th position Strategy oriented to obtain better CWS –Convert answers with low confidence to NIL with score=1 ensures 20 correct answers in the top of the ranking (the 20 NIL questions) However, it shows very good self-knowledge –Giving score=0 to its NIL answers: r= –K1: 1 st position

Conclusions Some systems are able to give an accurate self-scoring: r up to 0.7 K-measures reward good confidence self- scoring better than CWS But not only good self-scoring (high r) –A system with a perfect score (r=1) would need to answer correctly more than 40 questions to reach 1 st position –Find a good balance Promote answer validation and accurate self- scoring

Conclusions “Difficult” questions still remain a challenge Some specialisation should be expected QA Main Track shows that different systems answer correctly different subsets of questions K-measures permit Some specialisation Pose new types of questions Leave the door open to new teams “Just give score=0 to the things you don’t K-now”

Start thinking about promoting fully multilingual systems –Too soon for a unique task with several target languages? (Multilingual collection) –Join Bilingual Subtasks with the same target language into a Multilingual task? (Multilingual set of questions) Allow bilingual, promote multilingual (help transition) ~50 questions in each different language Systems could answer with score=0 to the questions in source languages they don’t manage Systems that manage several source languages would be rewarded (transition could be expected) And, What about Multilinguality?

Jesús Herrera Anselmo Peñas Felisa Verdejo UNED NLP Group Cross-Language Evaluation Forum Bath, UK - September 2004 Thanks! QA Pilot Task at CLEF 2004