Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008
UNED nlp.uned.es Content 1. Context and motivation Question Answering at CLEF Answer Validation Exercise at CLEF 2. Evaluating the validation of answers 3. Evaluating the selection of answers Correct selection Correct rejection 4. Analysis and discussion 5. Conclusion
UNED nlp.uned.es Evolution of the CLEF-QA Track Target languages UE Official Collections News 1994+News Wikipedia Nov JRC-Acquis Type of questions 200 Factoid + Temporal restrictions + Definitions - Type of question + Lists + Linked questions + Closed lists Factoid Definition Motive Purpose Procedure Supporting information DocumentSnippetParagraph Pilots and Exercises Temporal restriction Lists AVE Real Time WiQA AVE QAST AVE QAST WSDQ A GikiCLEF QAST
UNED nlp.uned.es Evolution of Results (Spanish) Overall Best result <60% Definitions Best result >80% NOT IR approach
UNED nlp.uned.es Pipeline Upper Bounds Use Answer Validation to break the pipeline Question Answer Question analysis Passage Retrieval Answer Extraction Answer Ranking xx= Not enough evidence
UNED nlp.uned.es Results in CLEF-QA 2006 (Spanish) Perfect combination 81% Best system 52,5% Best with ORGANIZATION Best with PERSON Best with TIME
UNED nlp.uned.es Collaborative architectures Diferent systems response better different types of questions Specialisation Collaboration QA sys 1 QA sys 2 QA sys 3 QA sys n Question Candidate answers Answer Validation & Selection Answer Evaluation Framwork
UNED nlp.uned.es Collaborative architectures How to select the good answer? Redundancy Voting Confidence score Performance history Why not deeper analysis?
UNED nlp.uned.es Answer Validation Exercise (AVE) Objective Validate the correctness of the answers Given by real QA systems......the participants at CLEF QA
UNED nlp.uned.es Answer Validation Exercise (AVE) Question Answering Question Candidate answer Supporting Text Textual Entailment Answer is not correct or not enough evidence Automatic Hypothesis Generation Question Hypothesis Answer is correct AVE 2006 AVE Answer Validation
UNED nlp.uned.es Techniques in AVE 2007 Overview AVE 2007 Generates hypotheses 6 Wordnet 3 Chunking 3 n-grams, longest common Subsequences 5 Phrase transformations 2 NER 5 Num. expressions 6 Temp. expressions 4 Coreference resolution 2 Dependency analysis 3 Syntactic similarity 4 Functions (sub, obj, etc) 3 Syntactic transformations 1 Word-sense disambiguation 2 Semantic parsing 4 Semantic role labeling 2 First order logic representation 3 Theorem prover 3 Semantic similarity 2
UNED nlp.uned.es Evaluation linked to main QA task Question Answering Track Systems’ answers Systems’ Supporting Texts Answer Validation Exercise Questions Systems’ Validation (YES, NO) Human Judgements (R,W,X,U) QA Track results Mapping (YES, NO) Evaluation AVE Track results Reuse human assessments
UNED nlp.uned.es Content 1. Context and motivation 2. Evaluating the validation of answers 3. Evaluating the selection of answers 4. Analysis and discussion 5. Conclusion
UNED nlp.uned.es QA sys 1 QA sys 2 QA sys 3 QA sys n Question Candidate answers Answer Validation & Selection Answer Participant systems in a CLEF – QA Evaluation of Answer Validation & Selection Evaluation Proposed
UNED nlp.uned.es Collections What is Zanussi? was an Italian producer of home appliances Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought who had also been in Cassibile since August 31 Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August (1985) 3 Out of 5 Live (1985) What Is This?
UNED nlp.uned.es Evaluating the Validation Validation Decide if each candidate answer is correct or not YES | NO Not balanced collections Approach: Detect if there is enough evidence to accept an answer Measures: Precision, recall and F over correct answers Baseline system: Accept all answers
UNED nlp.uned.es Evaluating the Validation Correct Answer Incorrect Answer Answer Accepted n CA n WA Answer Rejected n CR n WR
UNED nlp.uned.es Evaluating the Selection Quantify the potential gain of Answer Validation in Question Answering Compare AV systems with QA systems Develop measures more comparable to QA accuracy
UNED nlp.uned.es Evaluating the selection Given a question with several candidate answers Two options: Selection Select an answer ≡ try to answer the question Correct selection: answer was correct Incorrect selection: answer was incorrect Rejection Reject all candidate answers ≡ leave question unanswered Correct rejection: All candidate answers were incorrect Incorrect rejection: Not all candidate answers were incorrect
UNED nlp.uned.es Evaluating the Selection n questions n= n CA + n WA + n WS + n WR + n CR Question with Correct Answer Question without Correct Answer Question Answered Correctly (One Answer Selected) n CA - Question Answered Incorrectly n WA n WS Question Unanswered (All Answers Rejected) n WR n CR Not comparable to qa_accuracy
UNED nlp.uned.es Evaluating the Selection n questions n= n CA + n WA + n WS + n WR + n CR Question with Correct Answer Question without Correct Answer Question Answered Correctly (One Answer Selected) n CA - Question Answered Incorrectly n WA n WS Question Unanswered (All Answers Rejected) n WR n CR
UNED nlp.uned.es Evaluating the Selection Rewards rejection (not balanced cols) Interpretation for QA: all questions correctly rejected by AV will be answered correctly
UNED nlp.uned.es Evaluating the Selection Interpretation for QA: questions correctly rejected by AV will be answered correctly in qa_accuracy proportion
UNED nlp.uned.es Content 1. Context and motivation 2. Evaluating the validation of answers 3. Evaluating the selection of answers 4. Analysis and discussion 5. Conclusion
UNED nlp.uned.es Analysis and discussion (AVE 2007 English) Validation Selection QA_acc correlated to R “Estimated” adjusts it
UNED nlp.uned.es Multi-stream QA performance (AVE 2007 English)
UNED nlp.uned.es Analysis and discussion (AVE 2007 Spanish) Validation Selection Comparing AV & QA
UNED nlp.uned.es Conclusion Evaluation framework for Answer Validation & Selection systems Measures that reward not only Correct Selection but also Correct Rejection Promote improvement of QA systems Allow comparison between AV and QA systems In what conditions multi-stream perform better Room for improvement just using multi-stream-QA Potential gain that AV systems can provide to QA
Thanks! Acknowledgement: EU project T-CLEF (ICT )
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008