A Data Driven Approach to Query Expansion in Question Answering Leon Derczynski, Robert Gaizauskas, Mark Greenwood and Jun Wang Natural Language Processing.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Natural Language Processing Group Department of Computer Science University of Sheffield, UK IR4QA: An Unhappy Marriage Mark A. Greenwood.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Modern Information Retrieval
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.
Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.
Information Retrieval
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Assumption of Homoscedasticity
Overview of Search Engines
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
New Advanced Higher Subject Implementation Events Health and Food Technology: Advanced Higher Course Assessment.
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
Chapter 8: Systems analysis and design
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Search Engines and Information Retrieval Chapter 1.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Chapter 12 Evaluating Products, Processes, and Resources.
Inductive Generalizations Induction is the basis for our commonsense beliefs about the world. In the most general sense, inductive reasoning, is that in.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.
Experimental Research Methods in Language Learning Chapter 9 Descriptive Statistics.
Grading and Analysis Report For Clinical Portfolio 1.
Chapter 16 Data Analysis: Testing for Associations.
SOFTWARE METRICS. Software Process Revisited The Software Process has a common process framework containing: u framework activities - for all software.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
CONCLUSIONS & CONTRIBUTIONS Ground-truth dataset, simulated search tasks environment Multiple everyday applications (MS Word, MS PowerPoint, Mozilla Browser)
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Changes to assessment and reporting of children’s attainment A guide for Parents and Carers Please use the SPACE bar to move this slideshow at your own.
Holymead Primary School Information for parents on Key Stage 2 SATS 2016.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Using Semantic Relations to Improve Information Retrieval
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
Psychometrics: Exam Analysis David Hope
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
1 Software Engineering Muhammad Fahad Khan Software Engineering Muhammad Fahad Khan University Of Engineering.
SOFTWARE TESTING AND QUALITY ASSURANCE. Software Testing.
Software Design and Development Development Methodoligies Computing Science.
Questionnaire-Part 2. Translating a questionnaire Quality of the obtained data increases if the questionnaire is presented in the respondents’ own mother.
Information Retrieval in Practice
Information Retrieval in Practice
Queensland University of Technology
Relevance and Reinforcement in Interactive Browsing
Deep Learning for the Soft Cutoff Problem
Preparing students for assessments Janet Strain Ann Jakeman
Presentation transcript:

A Data Driven Approach to Query Expansion in Question Answering Leon Derczynski, Robert Gaizauskas, Mark Greenwood and Jun Wang Natural Language Processing Group Department of Computer Science University of Sheffield, UK

Summary Introduce a system for QA Find that its IR component limits system performance Explore alternative IR components Identify which questions cause IR to stumble Using answer lists, find extension words that make these questions easier Show how knowledge of these words can make rapidly accelerate the development of query expansion methods Show why one simple relevance feedback technique cannot improve IR for QA

How we do QA Question answering system follows a linear procedure to get from question to answers Pre-processing Text retrieval Answer Extraction Performance at each stage affects later results

Measuring QA Performance Overall metrics Coverage Redundancy TREC provides answers Regular expressions for matching text IDs of documents deemed helpful Ways of assessing correctness Lenient: the document text contains an answer Strict: further, the document ID is listed by TREC

Assessing IR Performance Low initial system performance Analysed each component in the system Question pre-processing correct Coverage and redundancy checked in IR part

IR component issues Only 65% of questions generate any text to be prepared for answer extraction IR failings cap the entire system performance Need to balance the amount of information retrieved for AE Retrieving more text boosts coverage, but also introduces excess noise

Initial performance Lucene statistics Using strict matching, at paragraph level Question yearCoverageRedundancy % % %1.18

Potential performance inhibitors IR Engine Is Lucene causing problems? Profile some alternative engines Difficult questions Identify which questions cause problems Examine these: Common factors How can they be made approachable?

Information Retrieval Engines AnswerFinder uses a modular framework, including an IR plugin for Lucene Indri and Terrier are two public domain IR engines, which have both been adapted to perform TREC tasks Indri – based on the Lemur toolkit and INQUERY engine Terrier – developed in Glasgow for dealing with terabyte corpora Plugins are created for Indri and Terrier, which are then used as replacement IR components Automated testing of overall QA performance done using multiple IR engines

IR Engine performance EngineCoverageRedundancy Indri55.2%1.15 Lucene56.8%1.18 Terrier49.3%1.00 With n=20; strict retrieval; TREC 2006 question set; paragraph-level texts. Performance between engines does not seem to vary significantly Non-QA-specific IR Engine tweaking possibly not a great avenue for performance increases

Identification of difficult questions Coverage of 56.8% indicates that for over 40% of questions, no documents are found. Some questions are difficult for all engines How to define a “difficult” question? Calculate average redundancy (over multiple engines) for each question in a set Questions with average redundancy less than a certain threshold are deemed difficult A threshold of zero is usually enough to find a sizeable dataset

Examining the answer data TREC answer data provides hints to what documents an IR engine ideal for QA should retrieve Helpful document lists Regular expressions of answers Some questions are marked by TREC as having no answer; these are excluded from the difficult question set

Making questions accessible Given the answer bearing documents and answer text, it’s easy to extract words from answer-bearing paragraphs For example, where the answer is “baby monitor”: The inventor of the baby monitor found this device almost accidentally These surrounding words may improve coverage when used as query extensions How can we find out which extension words are most helpful?

Rebuilding the question set Only use answerable difficult questions For each question: Add original question to the question set as a control Find target paragraphs in “correct” texts Build a list of all words in that paragraph, except: answers, stop words, and question words For each word: Create a sub-question which consists of the original question, extended by that word

Rebuilding the question set Example: Single factoid question: Q + E How tall is the Eiffel tower? + height Question in a series: Q + T + E Where did he play in college? + Warren Moon + NFL

Do data-driven extensions help? Base performance is at or below the difficult question threshold (typically zero) Any extension that brings performance above zero is deemed a “helpful word” From the set of difficult questions, 75% were made approachable by using a data-driven extension If we can add these terms accurately to questions, the cap on answer extraction performance is raised

Do data-driven extensions help? QuestionWhere did he play in college? TargetWarren Moon Base redundancy is zero Extensions FootballRedundancy: 1 NFLRedundancy: 2.5 Adding some generic related words improves performance

Do data-driven extensions help? QuestionWho was the nominal leader after the overthrow? TargetPakistani government overthrown in 1999 Base redundancy is zero Extensions IslamabadRedundancy: 2.5 PakistanRedundancy: 4 KashmirRedundancy: 4 Location based words can raise redundancy

Do data-driven extensions help? QuestionWho have commanded the division? Target82 nd Airborne Division Base redundancy is zero Question expects a list of answers Extensions ColRedundancy: 2 GenRedundancy: 3 officerRedundancy: 1 decimatedRedundancy: 1 The proper names for ranks help; this can be hinted at by “Who” Events related to the target may suggest words Possibly not a victorious unit!

Observations on helpful words Inclusion of pertainyms has a positive effect on performance, agreeing with more general observations in Greenwood (2004) Army ranks stood out highly Use of an always-include list Some related words help, though there’s often no deterministic relationship between them and the questions

Measuring automated expansion Known helpful words are also the target set of words that any expansion method should aim for Once the target expansions are known, measuring automated expansion becomes easier No need to perform IR for every candidate expanded query (some runs over AQUAINT took up to 14 hours on a 4-core 2.3GHz system) Rapid evaluation permits faster development of expansion techniques

Relevance feedback in QA Simple RF works by using features of an initial retrieval to alter a query We picked the highest frequency words in the “initially retrieved texts”, and used them to expand a query The size of the IRT set is denoted r Previous work (Monz 2003) looked at relevance feedback using a small range of values for r Different sizes of initial retrievals are used, between r=5 and r=50

Rapidly evaluating RF Three metrics show how a query expansion technique performs: Percentage of all helpful words found in IRT This shows the intersection between words in initially retrieved texts, and the helpful words. Percentage of texts containing helpful words If this is low, then the IR system does not retrieve many documents containing helpful words, given the initial query Percentage of expansion terms that are helpful This is a key statistic; the higher this is, the better performance is likely to be

Relevance feedback predictions Less than 35% of the documents used in relevance feedback actually contain helpful words Picking helpful words out from initial retrievals is not easy, when there’s so much noise Due to the small probability of adding helpful words, relevance feedback is likely not to make difficult questions accessible. Adding noise to the query will drown out otherwise helpful documents for non-difficult questions Helpful words found in IRT4.2%18.6%8.9% IRT containing helpful words10.0%33.3%34.3% RF words that are “helpful”1.25%1.67%5.71% RF selects some words to be added on to a query, based on an initial search.

Relevance feedback results Only 1.25% % of the words that relevance feedback chose were actually helpful; the rest only add noise Performance using TF-based relevance feedback is consistently lower than the baseline Hypothesis of poor performance is supported Coverage at n docsr=5r=50Baseline %28.4%43.4% %39.8%55.3%

Conclusions IR engine performance for QA does not vary wildly Identifying helpful words provides a tool for assessing query expansion methods TF-based relevance feedback cannot be generally effective in IR for QA Linguistic relationships exist that can help in query expansion

Any questions?