Download presentation
Presentation is loading. Please wait.
Published byMiles Welch Modified over 9 years ago
1
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008
2
UNED nlp.uned.es Content 1. Context and motivation Question Answering at CLEF Answer Validation Exercise at CLEF 2. Evaluating the validation of answers 3. Evaluating the selection of answers Correct selection Correct rejection 4. Analysis and discussion 5. Conclusion
3
UNED nlp.uned.es Evolution of the CLEF-QA Track 2003200420052006200720082009 Target languages 37891011UE Official Collections News 1994+News 1995 + Wikipedia Nov. 2006 JRC-Acquis Type of questions 200 Factoid + Temporal restrictions + Definitions - Type of question + Lists + Linked questions + Closed lists Factoid Definition Motive Purpose Procedure Supporting information DocumentSnippetParagraph Pilots and Exercises Temporal restriction Lists AVE Real Time WiQA AVE QAST AVE QAST WSDQ A GikiCLEF QAST
4
UNED nlp.uned.es Evolution of Results 2003 - 2006 (Spanish) Overall Best result <60% Definitions Best result >80% NOT IR approach
5
UNED nlp.uned.es Pipeline Upper Bounds Use Answer Validation to break the pipeline Question Answer Question analysis Passage Retrieval Answer Extraction Answer Ranking 1.00.8 0.64xx= Not enough evidence
6
UNED nlp.uned.es Results in CLEF-QA 2006 (Spanish) Perfect combination 81% Best system 52,5% Best with ORGANIZATION Best with PERSON Best with TIME
7
UNED nlp.uned.es Collaborative architectures Diferent systems response better different types of questions Specialisation Collaboration QA sys 1 QA sys 2 QA sys 3 QA sys n Question Candidate answers Answer Validation & Selection Answer Evaluation Framwork
8
UNED nlp.uned.es Collaborative architectures How to select the good answer? Redundancy Voting Confidence score Performance history Why not deeper analysis?
9
UNED nlp.uned.es Answer Validation Exercise (AVE) Objective Validate the correctness of the answers Given by real QA systems......the participants at CLEF QA
10
UNED nlp.uned.es Answer Validation Exercise (AVE) Question Answering Question Candidate answer Supporting Text Textual Entailment Answer is not correct or not enough evidence Automatic Hypothesis Generation Question Hypothesis Answer is correct AVE 2006 AVE 2007 - 2008 Answer Validation
11
UNED nlp.uned.es Techniques in AVE 2007 Overview AVE 2007 Generates hypotheses 6 Wordnet 3 Chunking 3 n-grams, longest common Subsequences 5 Phrase transformations 2 NER 5 Num. expressions 6 Temp. expressions 4 Coreference resolution 2 Dependency analysis 3 Syntactic similarity 4 Functions (sub, obj, etc) 3 Syntactic transformations 1 Word-sense disambiguation 2 Semantic parsing 4 Semantic role labeling 2 First order logic representation 3 Theorem prover 3 Semantic similarity 2
12
UNED nlp.uned.es Evaluation linked to main QA task Question Answering Track Systems’ answers Systems’ Supporting Texts Answer Validation Exercise Questions Systems’ Validation (YES, NO) Human Judgements (R,W,X,U) QA Track results Mapping (YES, NO) Evaluation AVE Track results Reuse human assessments
13
UNED nlp.uned.es Content 1. Context and motivation 2. Evaluating the validation of answers 3. Evaluating the selection of answers 4. Analysis and discussion 5. Conclusion
14
UNED nlp.uned.es QA sys 1 QA sys 2 QA sys 3 QA sys n Question Candidate answers Answer Validation & Selection Answer Participant systems in a CLEF – QA Evaluation of Answer Validation & Selection Evaluation Proposed
15
UNED nlp.uned.es Collections What is Zanussi? was an Italian producer of home appliances Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought who had also been in Cassibile since August 31 Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31. 3 (1985) 3 Out of 5 Live (1985) What Is This?
16
UNED nlp.uned.es Evaluating the Validation Validation Decide if each candidate answer is correct or not YES | NO Not balanced collections Approach: Detect if there is enough evidence to accept an answer Measures: Precision, recall and F over correct answers Baseline system: Accept all answers
17
UNED nlp.uned.es Evaluating the Validation Correct Answer Incorrect Answer Answer Accepted n CA n WA Answer Rejected n CR n WR
18
UNED nlp.uned.es Evaluating the Selection Quantify the potential gain of Answer Validation in Question Answering Compare AV systems with QA systems Develop measures more comparable to QA accuracy
19
UNED nlp.uned.es Evaluating the selection Given a question with several candidate answers Two options: Selection Select an answer ≡ try to answer the question Correct selection: answer was correct Incorrect selection: answer was incorrect Rejection Reject all candidate answers ≡ leave question unanswered Correct rejection: All candidate answers were incorrect Incorrect rejection: Not all candidate answers were incorrect
20
UNED nlp.uned.es Evaluating the Selection n questions n= n CA + n WA + n WS + n WR + n CR Question with Correct Answer Question without Correct Answer Question Answered Correctly (One Answer Selected) n CA - Question Answered Incorrectly n WA n WS Question Unanswered (All Answers Rejected) n WR n CR Not comparable to qa_accuracy
21
UNED nlp.uned.es Evaluating the Selection n questions n= n CA + n WA + n WS + n WR + n CR Question with Correct Answer Question without Correct Answer Question Answered Correctly (One Answer Selected) n CA - Question Answered Incorrectly n WA n WS Question Unanswered (All Answers Rejected) n WR n CR
22
UNED nlp.uned.es Evaluating the Selection Rewards rejection (not balanced cols) Interpretation for QA: all questions correctly rejected by AV will be answered correctly
23
UNED nlp.uned.es Evaluating the Selection Interpretation for QA: questions correctly rejected by AV will be answered correctly in qa_accuracy proportion
24
UNED nlp.uned.es Content 1. Context and motivation 2. Evaluating the validation of answers 3. Evaluating the selection of answers 4. Analysis and discussion 5. Conclusion
25
UNED nlp.uned.es Analysis and discussion (AVE 2007 English) Validation Selection QA_acc correlated to R “Estimated” adjusts it
26
UNED nlp.uned.es Multi-stream QA performance (AVE 2007 English)
27
UNED nlp.uned.es Analysis and discussion (AVE 2007 Spanish) Validation Selection Comparing AV & QA
28
UNED nlp.uned.es Conclusion Evaluation framework for Answer Validation & Selection systems Measures that reward not only Correct Selection but also Correct Rejection Promote improvement of QA systems Allow comparison between AV and QA systems In what conditions multi-stream perform better Room for improvement just using multi-stream-QA Potential gain that AV systems can provide to QA
29
Thanks! http://nlp.uned.es/clef-qa/ave http://www.clef-campaign.org Acknowledgement: EU project T-CLEF (ICT-1-4-1 215231)
30
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.