3rd Answer Validation Exercise (AVE 2008) QA subtrack at Cross-Language Evaluation Forum 2008 UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to… Main task QA organizing committee
Answer Validation Exercise 2008 Validate the correctness of real systems answers Question Question Question Answering Candidate answer Answer Validation Answer is correct Supporting Text Answer is not correct or not enough evidence
Candidate answers grouped by question Collections Candidate answers grouped by question <q id="116" lang="EN"> <q_str>What is Zanussi?</q_str> <a id="116_1" value=""> <a_str>was an Italian producer of home appliances</a_str> <t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str> </a> <a id="116_2" value=""> <a_str>who had also been in Cassibile since August 31</a_str> <t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str> <a id="116_4" value=""> <a_str>3</a_str> <t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str> </q> - Accept or Reject all answers - Select one of the accepted answers
Collections Remove duplicated answers inside the same question group Discard NIL answers, void answers and answers with too long supporting snippet This processing lead to a reduction in the number of answers to be validated
AVE Collections 2008 (# answers to validate) Testing (% Correct) Development % Correct Spanish 1528 10% 551 23% English 1055 7.5% 195 10.8% German 1027 264 25.4% Portuguese 1014 20.5% 148 42.8% Dutch 228 19.3% 78 15.8% French 199 26.1% 171 49.7% Romanian 119 10.5% 82 43.7% Basque 104 7.2% - Bulgarian 27 44.4% Italian 100 16% Available for CLEF participants at nlp.uned.es/clef-qa/ave/
Evaluation Not balanced collections (real world) Approach: Detect if there is enough evidence to accept an answer Measures: Precision, recall and F over ACCEPTED answers Baseline system: Accept all answers
Participants and runs DE EN ES FR RO Tot Fernuniversität in Hagen 2 LIMSI U. Iasi 4 DFKI 1 INAOE U. Alicante 3 UNC U. Jaén (UJA) 6 LINA Total 8 5 24
Precision, Recall and F measure over correct answers for English Evaluation: P, R, F Precision, Recall and F measure over correct answers for English Group System F Precision Recall DFKI Ltqa 0.64 0.54 0.78 UA Ofe 0.49 0.35 0.86 UNC Jota_2 0.21 0.13 0.56 IASI Uaic_2 0.19 0.11 0.85 Jota_1 0.17 0.09 0.94 Uaic_1 0.76 100% VALIDATED 0.14 0.08 1 UJA Magc_2(bbr) 0.02 0.01 Magc_1(timbl)
Additional measures Compare AVE systems with QA systems performance Count the answers SELECTED correctly Reward the detection of groups in which all answers are incorrect Allows a new justified attempt to answer the question new
Additional measures new new new
Evaluation: estimated performance System type estimated_ qa_performance qa_accuracy (% best combination) qa_rej_ accuracy qa_ accuracy_max Perfect selection 0,56 0,34 (100%) 0,66 1 ltqa AV 0,34 0,24 (70,37%) 0,44 0,68 ofe 0,27 0,19 (57,41%) 0,4 0,59 uaic_2 0,24 0,01 0,25 wlvs081roen QA 0,21 0,21 (62,96%) uaic_1 0,19 jota_2 0,17 0,16 (46,30%) 0,1 0,26 dfki081deen 0,17 (50%) jota_1 0,16 dcun081deen 0,10 0,10 (29,63%) Random 0,09 0,09 (25,25%) nlel081enen 0,06 0,06 (18,52%) nlel082enen 0,05 0,05 (14,81%) ilkm081nlen 0,04 0,04 (12,96%) magc_2(bbr) 0,01 (1,85%) 0,64 0,65 dcun082deen magc_1(timbl) 0 (0%) 0,63
Comparing AV systems performance with QA systems (English)
Techniques reported at AVE 2007 & 2008 10 reports (2007) 9 reports (2008) Syntactic similarity 4 Functions (sub, obj, etc) 3 Syntactic transformations 1 2 Word-sense disambiguation Semantic parsing Semantic role labeling First order logic representation Theorem prover Semantic similarity Generates hypotheses 6 2 Wordnet 3 5 Chunking 4 n-grams, longest common Subsequences Phrase transformations NER 7 Num. expressions Temp. expressions Coreference resolution Dependency analysis
Conclusion (of AVE) Three years of evaluation in a real environment Real systems outputs -> AVE input Developed methodologies Build collections from QA responses Evaluate in chain with a QA Track Compare results with QA systems Introduction of RTE techniques in QA More NLP More Machine Learning New testing collections for the QA and RTE communities In 8 languages, not only English
Many Thanks!! CLEF AVE QA Organizing Committee AVE participants UNED team