Presentation is loading. Please wait.

Presentation is loading. Please wait.

Answering Definition Questions Using Multiple Knowledge Sources Wesley Hildebrandt, Boris Katz, and Jimmy Lin MIT Computer Science and Artificial Intelligence.

Similar presentations


Presentation on theme: "Answering Definition Questions Using Multiple Knowledge Sources Wesley Hildebrandt, Boris Katz, and Jimmy Lin MIT Computer Science and Artificial Intelligence."— Presentation transcript:

1 Answering Definition Questions Using Multiple Knowledge Sources Wesley Hildebrandt, Boris Katz, and Jimmy Lin MIT Computer Science and Artificial Intelligence Laboratory 32 Vassar Street, Cambridge, MA 02139 {wes,boris,jimmylin}@csail.mit.edu

2 Abstract Definition questions represent a largely unexplored area of question answering. Multi-strategy approach:  a database constructed offline with surface patterns  a Web-based dictionary  an off-the-shelf document retriever Results are from:  component-level evaluation  end-to-end evaluation of the system at the TREC 2003 Question Answering Track

3 Answering Definition Questions 1: extract the target term (target) 2: lookup in a database created from the AQUAINT corpus 3: lookup in a Web dictionary followed by answer projection *4: lookup directly in the AQUAINT corpus with an IR engine 5: answers from 2~4 are then merged to produce the final system output

4 Target Extraction A simple pattern-based parser to extract the target term using regular expressions The extractor was tested on all definition questions from the TREC-9 and TREC-10 QA Track testsets and performed with 100% accuracy But, several instances were not correctly extracted from the definition questions in TREC 2003

5 Database Lookup Surface patterns for answer extraction  An effective strategy for factoid question  but often suffers from low recall To boost recall  applying the set of surface patterns offline  “precompile” from the AQUAINT corpus a list of nuggets about every entity  construct an immense relational database containing nuggets distilled from every article in the corpus. The task then becomes a simple lookup for the relevant term

6 Database Lookup Surface patterns operated both at the word and part-of-speech level.  Rudimentary chunking, such as marking the boundaries of noun phrases, was performed by grouping words based on their part-of-speech tags. Contextual information results in higher-quality answers 11 surface patterns in Table1, with examples in Table2

7 11 Surface Patterns

8 Surface Patterns with examples

9 Dictionary Lookup TREC evaluations requires pairs of (ans, doc)  Answer projection techniques to (Brill, 2001) Dictionary-Lookup Approach:  Keywords from the target term’s dictionary definition and the target itself  used as the query to Lucene  Top 100 documents returned and tokenized into individual sentences, discarding sentences without target term  remaining sentences are scored by keyword overlap with the dictionary definition, weighted by the idf of each keyword  non-zero-score sentences are retained and shortened to 100 characters centered around the target term, if necessary.

10 Document Lookup This approach is adopted If no answers were found by the previous two techniques (Database & Dictionary Lookup) Using traditional IR technique

11 Answer Merging Redundancy removal  this problem is especially severe since every entity in the entire AQUAINT corpus was precompiled  Simple heuristic: if two responses share more than 60% of their keywords, one of them is randomly discarded Expected accuracy  all responses are ordered by EA to extract nugget # of responses to be returned  Given n total responses, the number is

12 Component Evaluation 160 definition questions from the TREC-9 and TREC-10 QA Track testsets  Database lookup: 8 nuggets per question at accuracy 38.37  Dictionary lookup: 1.5 nuggets per question at accuracy 45.23 Recall of the techniques is extremely hard to measure directly The results: represent baseline for the performance of each technique The focus: not on perfecting each individual pattern and the dictionary matching algorithm, but on building a complete working system

13 Component Evaluation

14 TREC 2003 Results This system performed well, ranking 8th out of 25 groups that participated in TREC 2003 QA Track Official results for the definition sub-task are shown in Table 4 The formula used to calculate the F- measure is given in Figure 1

15 TREC 2003 Results

16 BBN (Xu et al., 2003)  The best run, with an F-measure of 0.555  use many of the same techniques described here  one important exception—they did not precompile nuggets into a database  they also cited recall as a major cause of bad performance  IR baseline achieved an F-measure of 0.493 which beat all other runs Because the F-measure heavily favored recall over precision, simple IR techniques worked extremely well

17 TREC 2003 Results

18 Evaluation Reconsidered The Scoring Metric Variations in Judgment

19 The Scoring Metric Recall( R ) = r/R  a system that returned every non-vital nugget but no vital nuggets would receive a score of zero The distinction between vital and non-vital nuggets is itself somewhat arbitrary.  For example, “What is Bausch & Lomb?” world’s largest eye care company -> vital about 12000 employees -> vital in 50 countries -> vital approx. $1.8 billion annual revenue -> vital based in Rochester, New York -> non-vital

20 The Scoring Metric What is the proper value of β?  If β=1, the difference in performance between our system and that of the top system is virtually indistinguishable  The advantages of surface patterns, linguistic processing, answer fusion, and other techniques become more obvious if the F-measure is not as heavily biased towards recall

21 Variations in Judgment Human naturally have differing opinions These differences of opinion are not mistakes, but legitimate variations in what assessors consider to be acceptable Different assessors may judge nuggets differently, contributing to detectable variations in score

22 Variations in Judgment For the assessors’ nugget list, it should satisfy:  Atomic — each nugget should ideally represent an atomic concept  Uniqueness — nuggets should be unique, not only in their text but also in their meaning  Completeness — many relevant items of information returned did not make it onto the assessors’ nugget list (even as non-vital nuggets) Evaluating answers to definition questions is a challenging task. Consistent, repeatable, and meaningful scoring guidelines are critical to the field

23 Examples of Atomic & Unique Atomic  Harlem civil rights leader: provide one fact  Alexander Pope is “English poet”: two separate facts Uniqueness

24 Future Work Robust named-entity extractor for  Target extraction: key, non-trivial capability critical to the success of a system  Database lookup: works only if the relevant target terms are identified and indexed while preprocessing the corpus  For example, able to identify specialized names (e.g., “Bausch & Lomb”, “Destiny’s Child”, “Akbar the Great”)

25 Future Work More accurate surface patterns  expanding the context on which these patterns operate to reduce false matches  As an example, consider e1_is pattern: over 60% of irrelevant nuggets were cases where the target is the object of a preposition and not the subject of the copular verb immediately following it For example, Question “What is mold?” to Sentence “tools you need to look for mold are...” Good-nugget Predictor  separate “good” from “bad” nuggets  using machine learning techniques

26 Conclusion A novel set of strategies from multiple sources  Database, Dictionary & Documents Smoothly integrate the derived answers to produce a final set of answers The analyses show:  Difficulty of evaluating definition questions  Inability of present metrics to accurately capture the information needs of real-world users


Download ppt "Answering Definition Questions Using Multiple Knowledge Sources Wesley Hildebrandt, Boris Katz, and Jimmy Lin MIT Computer Science and Artificial Intelligence."

Similar presentations


Ads by Google