Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Slides:



Advertisements
Similar presentations
Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.
Advertisements

1 Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University.
1 Opinion Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization (UIUC at TAC 2008 Opinion Summarization Pilot) Nov 19,
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
CS 430 / INFO 430 Information Retrieval
A Review of “Adaptive fingerprint image enhancement with fingerprint image quality analysis”, by Yun & Cho Malcolm McMillan.
© 2007 Pearson Education 8- 1 Managing Quality Integrating the Supply Chain S. Thomas Foster Chapter 8 Data Analyses Using Pivot Tables 10/11 – 5:30PM.
Information Retrieval in Practice
Learning Surface Text Patterns for a Question Answering System Deepak Ravichandran Eduard Hovy Information Sciences Institute University of Southern California.
Use of Patterns for Detection of Answer Strings Soubbotin and Soubbotin.
Overview of Search Engines
Sequencing a genome and Basic Sequence Alignment
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
A Web-based Question Answering System Yu-shan & Wenxiu
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
A Pattern Based Approach to Answering Factoid, List and Definition Questions Mark A. Greenwood and Horacio Saggion Natural Language Processing Group Department.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Building a Simple Question Answering System Mark A. Greenwood Natural Language Processing Group Department of Computer Science University of Sheffield,
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Natural Language Based Reformulation Resource and Web Exploitation for Question Answering Ulf Hermjakob, Abdessamad Echihabi, Daniel Marcu University of.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
A Language Independent Method for Question Classification COLING 2004.
An Intelligent Analyzer and Understander of English Yorick Wilks 1975, ACM.
Sequencing a genome and Basic Sequence Alignment
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Bootstrapping Regular-Expression Recognizer to Help Human Annotators Tae Woo Kim.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
SEMANTIC VERIFICATION IN AN ONLINE FACT SEEKING ENVIRONMENT DMITRI ROUSSINOV, OZGUR TURETKEN Speaker: Li, HueiJyun Advisor: Koh, JiaLing Date: 2008/5/1.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
IAVI/DIT BSc (Hons) in Property Studies How to complete the Dissertation Proposal Form.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Comparing Brand Rivalries Download from course website: data (then extract it) 3-4Starter.py.
Progression in Calculations + - ÷ x St. Mary’s School September 2010.
Chapter 14 Genetic Algorithms.
Introduction Task: extracting relational facts from text
Publication Output on the Topical Area of "Energy" and Real Estate (Education) Bob Martens.
Presentation transcript:

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield, UK

April 10th 2003Dialogue and Question Answering Reading Group Outline of Presentation Introduction and overview The basic approach Limitations of the approach Generalising using an NE tagger Results to date Discussion and Conclusions

April 10th 2003Dialogue and Question Answering Reading Group Introduction and Overview The question answering system outlined in this talk is a relatively simple system in the same vein as those developed by Soubbotin and Soubbotin (2001) and Ravichandran and Hovy (2002).  These systems use a single resource consisting of a large number of surface matching text patterns.  These patterns are derived using simple machine learning techniques from a set of question-answer pairs and a corpus of documents in which the answer is known to occur. While agreeing that these systems have enormous potential we aim to highlight, and provided a solution to, a problem that can be caused by overly specific patterns:  They mainly result in systems which may be unable to answer specific types of questions.

April 10th 2003Dialogue and Question Answering Reading Group The Basic Approach Learning patterns which can be used to find answers involves a two stage process:  The first stage is to learn a set of patterns from a set of question-answer pairs.  The second stage involves assigning a precision to each pattern and discarding those patterns which are tied to a specific question-answer pair. To explain the process we will use questions of the form “When was X born?”:  As a concrete example we will use “When was Mozart born?”.  For which the question-answer pair is: Mozart 1756

April 10th 2003Dialogue and Question Answering Reading Group The Basic Approach The first stage is to learn a set of patterns from the question- answer pairs for a specific question type:  For each example the question and answer terms are submitted to Google and the top ten documents are downloaded.  Each document then has the question and answer terms replaced by AnCHoR and AnSWeR respectively.  Those sentences which contain both AnCHoR and AnSWeR are retained and joined together to create a single document, with the sentences separated by a #.  This generated document is then used to build a token-level suffix tree, from which repeated strings containing both AnCHoR and AnSWeR and which do not span a sentence boundary (i.e. no #) are extracted as patterns.

April 10th 2003Dialogue and Question Answering Reading Group The Basic Approach The result of the first stage is a set of patterns, for questions of the form “When was X born?” these may include: AnCHor ( AnSWeR – From AnCHoR ( AnSWeR – 1969 ) AnCHor ( AnSWeR Unfortunately some of these patterns are specific to one or more of the questions used to generate them. So the second stage of the approach is concerned with filtering out these specific patterns to produce a set which can be used to answer unseen questions.

April 10th 2003Dialogue and Question Answering Reading Group The Basic Approach The second stage of the approach requires a different set of question-answer pairs to those used in the first stage:  Within each of the top ten documents returned by Google, using only the question term: the question term is replaced by AnCHoR and the answer (if it is present) with AnSWeR.  Those sentences which contain AnCHoR are retained.  Each of the patterns from the first stage are converted to a regular expression by replacing AnSWeR with ([^ ]+).  Each regular expression is then matched against each sentence and along with each pattern two counts are maintained: C a which is the total number of times this pattern has matched and C c which counts the number of matches which correctly selected AnSWeR.  After a pattern has been matched against every sentence if C c is less than 5 then it is discarded otherwise it’s precision is calculated as C c /C a and the pattern is retained only if the precision is greater than 0.1.

April 10th 2003Dialogue and Question Answering Reading Group The Basic Approach The result of assigning precision to patterns in this way is a set of precisions and regular expressions such as: 0.967: AnCHoR \( ([^ ]+) : AnCHoR \( ([^ ]+) 0.263: AnCHoR ([^ ]+) – These patterns can then be used to answer unseen questions:  The question term is submitted to Google and the top ten returned documents have the question term replaced with AnCHoR.  Those sentences which contain AnCHoR are extracted and combined to make a single document.  Each pattern is then applied to each sentence to extract possible answers.  All the answers found are sorted based firstly on the precision of the pattern which selected it and secondly on the number of times the same answer was found.

April 10th 2003Dialogue and Question Answering Reading Group Limitations of the Approach Most systems of this kind use questions of the form “When was X born?”, often the concrete example “When was Mozart born?”, which can be answered from the following text: “Mozart ( ) was a musical genius who…” A similar question which is also answerable from the same text snippet is “When did Mozart die?”. Unfortunately the system just presented is unable to answer this question:  The first stage of the algorithm will generate a pattern such as: AnCHoR ( 1756 – AnSWeR )  The second stage is, however, unable to assign a precision to this pattern due to the presence of a specific year of birth. What is needed is an approach that can generalise this form of pattern approach to eliminate some or all of the specific text.

April 10th 2003Dialogue and Question Answering Reading Group Generalising Using an NE Tagger Many of the question specific words which could appear within the patterns are dates, names, organisations and locations. These are exactly the types of words which can be recognised using gazetteers and named entity taggers. The approach taken to generalise the answer patterns is therefore to combine the approach with these NLP techniques:  A gazetteer and named entity tagger are applied to the documents after the question and answer terms have been replaced with AnCHoR and AnSWeR.  Text highlighted as a named entity is then replaced with a tag representing its type, i.e. a date would be replaced by DatE. Processing then continues as before.  When using the patterns to answer questions more work is required as any tags selected as answers have to be expanded to their original values.

April 10th 2003Dialogue and Question Answering Reading Group Generalising Using an NE Tagger This results in a system which can now produce generalised patterns, with associated precisions, for questions of the form “When did X die?” : 1.000: AnCHoR \( DatE – ([^ ]+) \) : AnCHoR \( DatE – ([^ ]+) \) 0.889: AnCHoR DatE – ([^ ]+) This approach also gives rise to more patterns being generated for questions of the form “When was X born?”: 0.941: AnCHoR \( ([^ ]+) - DatE \) 0.556: AnCHoR ([^ ]+) - DatE

April 10th 2003Dialogue and Question Answering Reading Group Results to Date A small set of experiments were carried out to see the effect of extending the system in this way: Three different question types were used: When was X born? When did X die? What is the capital of X? Six experiments were undertaken, namely the basic and extended system applied to each of the three question types.  For each experiment twenty questions were used for stage 1, twenty questions for stage 2 and the patterns were then tested using a further 100 questions. For each experiment the mean reciprocal rank and confidence weighted scores were calculated as a measure of performance over and above the simple metric of percentage of questions correctly answered.

April 10th 2003Dialogue and Question Answering Reading Group Results to Date QuestionSystem% CorrectlyMRRConfidence Type AnsweredScoreWeighted Born Basic52% Extended53% Died Basic0% Extended53% Capital City Basic24% Extended24%

April 10th 2003Dialogue and Question Answering Reading Group Results to Date

April 10th 2003Dialogue and Question Answering Reading Group Discussion and Conclusions Currently only a small set of experiments have been carried out using this technique, but for these experiments:  Generalising the patterns benefited question types which could not be answered by the basic approach.  Generalising the patterns had no detrimental effect on question types which could be answered by the basic approach. A small number of example questions can be used to generate sets of surface matching text patterns, although increasing the number of examples does seem to increase the performance of the system. There are other ways of generalising the surface matching text patterns which need to be investigated before this method can be hailed as the holy grail of question answering.