An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research.

Slides:

Advertisements

Similar presentations

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Group Members: Satadru Biswas ( ) Tanmay Khirwadkar ( ) Arun Karthikeyan Karra (05d05020) CS Course Seminar Group-2 Question Answering.

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.

Evaluating Search Engine

Answer Extraction Ling573 NLP Systems and Applications May 19, 2011.

Information Retrieval in Practice

Question-Answering: Shallow & Deep Techniques for NLP Ling571 Deep Processing Techniques for NLP March 9, 2011 Examples from Dan Jurafsky)

Search Engines and Information Retrieval

Question-Answering: Overview Ling573 Systems & Applications March 31, 2011.

The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.

Information Retrieval in Practice

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

1 Question-Answering via the Web: the AskMSR System Note: these viewgraphs were originally developed by Professor Nick Kushmerick, University College Dublin,

Overview of Search Engines

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Information Retrieval in Practice

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Search Engines and Information Retrieval Chapter 1.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

1 Statistical NLP: Lecture 10 Lexical Acquisition.

Hang Cui et al. NUS at TREC-13 QA Main Task 1/20 National University of Singapore at the TREC- 13 Question Answering Main Task Hang Cui Keya Li Renxu Sun.

A Technical Seminar on Question Answering SHRI RAMDEOBABA COLLEGE OF ENGINEERING & MANAGEMENT Presented By: Rohini Kamdi Guided By: Dr. A.J.Agrawal.

1 The BT Digital Library A case study in intelligent content management Paul Warren

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

Question Answering From Zero to Hero Elena Eneva 11 Oct 2001 Advanced IR Seminar.

Querying Structured Text in an XML Database By Xuemei Luo.

A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Natural Language Based Reformulation Resource and Web Exploitation for Question Answering Ulf Hermjakob, Abdessamad Echihabi, Daniel Marcu University of.

A Language Independent Method for Question Classification COLING 2004.

21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Question Answering over Implicitly Structured Web Content

Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.

A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Automatic Question Answering  Introduction  Factoid Based Question Answering.

Supertagging CMSC Natural Language Processing January 31, 2006.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

公司標誌 Question Answering System Introduction to Q-A System 資訊四 B 張弘霖資訊四 B 王惟正.

Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.

Shallow & Deep QA Systems Ling 573 NLP Systems and Applications April 9, 2013.

1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.

Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.

Automatic Question Answering Beyond the Factoid Radu Soricut Information Sciences Institute University of Southern California Eric Brill Microsoft Research.

Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.

AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

Information Retrieval in Practice

Linguistic Graph Similarity for News Sentence Searching

Evaluation of IR Systems

CS246: Information Retrieval

Statistical NLP: Lecture 10

Presentation transcript:

An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

From Proceedings of the EMNLP Conference, 2002

Goals Evaluate contributions of components Explore strategies for predicting when answers are incorrect

AskMSR – What Sets It Apart Dependency on data redundancy No sophisticated linguistic analyses – Of questions – Of answers

TREC Question Answering Track Fact-based, short-answer questions – How many calories are there in a Big Mac? – Who killed Abraham Lincoln? – How tall is Mount Everest? 562 – In case you’re wondering Motivation for much of recent work in QA

Other Approaches POS tagging Parsing Named Entity extraction Semantic relations Dictionaries WordNet

AskMSR Approach Web – “gigantic data repository” Different from other systems using web – Simplicity & efficiency No complex parsing No entity extraction For queries or best matching web pages No local caching Claim: techniques used in approach to short- answer tasks are more broadly applicable

Some QA Difficulties Single, small information source – Likely only 1 answer exists Source with small # of answer formulations – Complex relations between Q & A Lexical, syntactic, semantic relations Anaphora, synonymy, alternate syntactic formulations, indirect answers make this difficult

Answer Redundancy Greater answer redundancy in source – More likely: simple relation between Q & A exists – Less likely: need to deal with difficulties facing NLP systems

System Architecture

Query Reformulation Rewrite question – Substring of declarative answer – Weighted – “when was the paper clip invented?”  “the paper clip was invented” Produce less precise rewrites – Greater chance of matching – Backoff to simple ANDing of non-stop words

Query Reformulation (cont.) String based manipulations No parser No POS tagging Small lexicon for possible POS and morphological variants Created rewrite rules by hand Chose associated weights by hand

N-gram Mining Formulate rewrite for search engine Collect and analyze page summaries Why use summaries? – Efficiency – Contain search terms, plus some context N-grams collected from summaries

N-gram Mining (Cont.) Extract 1-, 2-, 3-grams from summary – Score by weight of rewrite that retrieved it Sum scores across all summaries with n- gram No frequency within summary Final score for n-gram – Weights associated with rewrite rules – # of unique summaries it is in

N-gram Filtering Use handwritten filter rules Question type assignment – e.g. who, what, how Choose set of filters based on q-type Rescore n-grams based on presence of features relevant to filters

N-gram Filtering (Cont.) 15 simple filters – Based on human knowledge Question types Answer domain – Surface string features Capitalization Digits Handcrafted regular expression patterns

N-gram Tiling Merge similar answers Create longer answers from overlapping smaller answer fragments – “A B C”, “B C D”  “A B C D” Greedy algorithm – Start w/ top-scoring n-gram, check lower scoring n-grams for tiling potential If can be tiled, replace higher-scoring n-gram with tiled n-gram, remove lower-scoring n-gram – Stop when can no longer tile

Experiments First 500 TREC-9 queries Use scoring patterns provided by NIST – Modified some patterns to accommodate web answers not in TREC – More specific answers allowed Edward J. Smith vs. Edward Smith – More general answers not allowed Smith vs. Edward Smith – Simple substitutions allowed 9 months vs. nine months

Experiments (cont.) Time differences between Web & TREC – “Who is the president of Bolivia?” – Did NOT modify answer key – Would make comparison w/earlier TREC results impossible (instead of difficult?) Changes influence absolute scores, not relative performance

Experiments (cont.) Automatic runs – Start w/queries – Generate ranked list of 5 answers Use Google as search engine – Query-relevant summaries for n-gram mining efficiency Answers are max. of 50 bytes long – Typically shorter

“Basic” System Performance Backwards notion of basic – Current system, all modules implemented – Default settings Mean Reciprocal Rank (MRR) – % of questions answered correctly Average answer length – 12 bytes Impossible to compare precisely with TREC-9 groups, but still very good performance

Component Contributions

Query Rewrite Contribution More precise queries – higher weights All rewrites equal – MRR drops 3.6% Only backoff AND – MRR drops 11.2% Rewrites capitalize on web redundancy Could use more specific regular expression matching

N-gram Filtering Contribution 1-, 2-, 3-grams from 100 best-matching summaries Filter by question type “How many dogs pull a sled in the Iditarod?” Question prefers a number Run, Alaskan, dog racing, many mush ranked lower than pool of 16 dogs (correct answer) No filtering – MRR drops 17.9%

N-gram Tiling Contribution Benefits of tiling – Substrings take up only 1 answer slot e.g. San, Francisco, San Francisco – Longer answers can never be found with only tri-grams e.g. “light amplification by [stimulated] emission of radiation” No tiling – MRR drops 14.2%

Component Combinations Only weighted sum of occurrences of 1-, 2-, 3-grams – MRR drops 47.5% Simple statistical system – No linguistic knowledge or processing – Only AND queries – Filtering – no, (statistical) tiling – yes – MRR drops 33% to 0.338

Component Combinations Statistical system –good performance? – Reasonable on absolute scale? – One TREC-9 50 byte run performed better All components contribute to accuracy – Precise weights of rewrites unimportant – N-gram tiling – a “poor man’s named-entity recognizer” – Biggest contribution from filters/selection

Component Combinations Claim: “Because of the effectiveness of our tiling algorithm…we do not need to use any named entity recognition components.” – By having filters with capitalization info (section 2.3, 2 nd paragraph), aren’t they doing some NE recognition?

Component Problems

Component Problems (cont.) No correct answer in top 5 hypotheses 23% of errors – not knowing units – How fast can Bill’s Corvette go? mph or k/h 34% (Time, Correct) – time problems or answer not in TREC-9 answer key 16% from shortcomings in n-gram tiling Number retrieval (5%) – query limitation

Component Problems (cont.) 12% - beyond current system paradigm – Can’t be fixed with minor enhancements – Is this really so? or have they been easy on themselves in error attribution? 9% - no discussion

Knowing When… Some cost for answering incorrectly System can choose to not answer instead of giving incorrect answer – How likely hypothesis is correct? TREC – no distinction between wrong answer and no answer Deploy real system – trade-off between precision & recall

Knowing When…(cont.) Answer is ad-hoc combination of hand tuned weights Is it possible to induce useful precision- recall (ROC) curve when answers don’t have meaningful probabilities? What is an ROC (Receiver Operating Characteristic) curve?

ROC From (Hinrich Schütze, co-author of Foundations of Statistical Natural Language Processing)

ROC (cont.)

Determining Likelihood Ideal – determine likelihood of correct answer based only on question If possible, can skip such questions Use decision tree based on set of features from question string 1-, 2-grams, type sentence length, longest word length # capitalized words, # stop words Ratio of stop words to non-stop words

Decision Tree/Diagnostic Tool Performs worst on how questions Performs best on short who questions w/many stop words Induce ROC curve from decision tree – Sort leaf nodes from highest probability of being correct to lowest – Gain precision by not answering questions with highest probability of error

Decision Tree–Query

Decision Tree–Query Results Decision Tree trained on TREC-9 Tested on TREC-10 Overfits training data – insufficient generalization

Decision Tree–Query Training

Decision Tree–Query Test

Answer Correctness/Score Ad-hoc score based on – # of retrieved passages n-gram occurs in – weight of rewrite used to retrieve passage – what filters apply – effects of n-gram tiling Correlation between whether answer appears in top 5 output and…

Correct Answer In Top 5 …and score of system’s first ranked answer – Correlation coefficient: – No time-sensitive q’s: …and score of first ranked answer minus second – Correlation coefficient: 0.270

Answer #1 Score - Train

Answer #1 Score – Test

Other Likelihood Indicators Snippets gathered for each question – AND queries – More refined exact string match rewrites MRR and snippets – All snippets from AND: – 11 to 100 from non-AND: – 100 to 400 from non-AND: But wasn’t MRR for “base” system 0.507?

Another Decision Tree Features of first DT, plus – Score of #1 answer – State of system in processing Total # of matching passages # of non-AND matching passages Filters applied Weight of best rewrite rule yielding matching passages Others

Decision Tree–All features

Decision Tree–All Train

Decision Tree–All Test

Decision Tree–All Gives useful ROC curve on test data Outperformed by Answer #1 Score Though outperformed by simpler ad-hoc technique, still useful as diagnostic tool

Conclusions Novel approach to QA Careful analysis of contributions of major system components Analysis of factors behind errors Approach for learning when system is likely to answer incorrectly – Allowing system designers to decide when to trade recall for precision

My Conclusions Claim: techniques used in approach to short-answer tasks are more broadly applicable Reality: “We are currently exploring whether these techniques can be extended beyond short answer QA to more complex cases of information access.”

My Conclusions (cont.) “…we do not need to use any named entity recognition components.” – Filters w/capitalization info = NE recognition 12% of errors beyond system paradigm – Still wonder–is this really so? 9% of errors–no discussion Ad hoc method outperforms Decision Tree – Did they merely do a good job of designing system, of assigning weights, etc.? – Did they get lucky?