UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…

3rd Answer Validation Exercise (AVE 2008) QA subtrack at Cross-Language Evaluation Forum 2008
UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to… Main task QA organizing committee

Answer Validation Exercise 2008
Validate the correctness of real systems answers Question Question Question Answering Candidate answer Answer Validation Answer is correct Supporting Text Answer is not correct or not enough evidence

Candidate answers grouped by question
Collections Candidate answers grouped by question <q id="116" lang="EN"> <q_str>What is Zanussi?</q_str> <a id="116_1" value=""> <a_str>was an Italian producer of home appliances</a_str> <t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str> </a> <a id="116_2" value=""> <a_str>who had also been in Cassibile since August 31</a_str> <t_str doc="en/p29/ xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str> <a id="116_4" value=""> <a_str>3</a_str> <t_str doc=" xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str> </q> - Accept or Reject all answers - Select one of the accepted answers

Collections Remove duplicated answers inside the same question group
Discard NIL answers, void answers and answers with too long supporting snippet This processing lead to a reduction in the number of answers to be validated

AVE Collections 2008 (# answers to validate)
Testing (% Correct) Development % Correct Spanish 1528 10% 551 23% English 1055 7.5% 195 10.8% German 1027 264 25.4% Portuguese 1014 20.5% 148 42.8% Dutch 228 19.3% 78 15.8% French 199 26.1% 171 49.7% Romanian 119 10.5% 82 43.7% Basque 104 7.2% - Bulgarian 27 44.4% Italian 100 16% Available for CLEF participants at nlp.uned.es/clef-qa/ave/

Evaluation Not balanced collections (real world)
Approach: Detect if there is enough evidence to accept an answer Measures: Precision, recall and F over ACCEPTED answers Baseline system: Accept all answers

Participants and runs DE EN ES FR RO Tot Fernuniversität in Hagen 2
LIMSI U. Iasi 4 DFKI 1 INAOE U. Alicante 3 UNC U. Jaén (UJA) 6 LINA Total 8 5 24

Precision, Recall and F measure over correct answers for English
Evaluation: P, R, F Precision, Recall and F measure over correct answers for English Group System F Precision Recall DFKI Ltqa 0.64 0.54 0.78 UA Ofe 0.49 0.35 0.86 UNC Jota_2 0.21 0.13 0.56 IASI Uaic_2 0.19 0.11 0.85 Jota_1 0.17 0.09 0.94 Uaic_1 0.76 100% VALIDATED 0.14 0.08 1 UJA Magc_2(bbr) 0.02 0.01 Magc_1(timbl)

Additional measures Compare AVE systems with QA systems performance
Count the answers SELECTED correctly Reward the detection of groups in which all answers are incorrect Allows a new justified attempt to answer the question new

Additional measures new new new

Evaluation: estimated performance
System type estimated_ qa_performance qa_accuracy (% best combination) qa_rej_ accuracy qa_ accuracy_max Perfect selection 0,56 0,34 (100%) 0,66 1 ltqa AV 0,34 0,24 (70,37%) 0,44 0,68 ofe 0,27 0,19 (57,41%) 0,4 0,59 uaic_2 0,24 0,01 0,25 wlvs081roen QA 0,21 0,21 (62,96%) uaic_1 0,19 jota_2 0,17 0,16 (46,30%) 0,1 0,26 dfki081deen 0,17 (50%) jota_1 0,16 dcun081deen 0,10 0,10 (29,63%) Random 0,09 0,09 (25,25%) nlel081enen 0,06 0,06 (18,52%) nlel082enen 0,05 0,05 (14,81%) ilkm081nlen 0,04 0,04 (12,96%) magc_2(bbr) 0,01 (1,85%) 0,64 0,65 dcun082deen magc_1(timbl) 0 (0%) 0,63

Comparing AV systems performance with QA systems (English)

Techniques reported at AVE 2007 & 2008
10 reports (2007) 9 reports (2008) Syntactic similarity 4 Functions (sub, obj, etc) 3 Syntactic transformations 1 2 Word-sense disambiguation Semantic parsing Semantic role labeling First order logic representation Theorem prover Semantic similarity Generates hypotheses 6 2 Wordnet 3 5 Chunking 4 n-grams, longest common Subsequences Phrase transformations NER 7 Num. expressions Temp. expressions Coreference resolution Dependency analysis

Conclusion (of AVE) Three years of evaluation in a real environment
Real systems outputs -> AVE input Developed methodologies Build collections from QA responses Evaluate in chain with a QA Track Compare results with QA systems Introduction of RTE techniques in QA More NLP More Machine Learning New testing collections for the QA and RTE communities In 8 languages, not only English

Many Thanks!! CLEF AVE QA Organizing Committee AVE participants
UNED team

UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…

Similar presentations

Presentation on theme: "UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…

Similar presentations

Presentation on theme: "UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…"— Presentation transcript:

Similar presentations

About project

Feedback