Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

1 Opinion Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization (UIUC at TAC 2008 Opinion Summarization Pilot) Nov 19,
Improved TF-IDF Ranker
QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
Information Retrieval in Practice
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
Learning Surface Text Patterns for a Question Answering System Deepak Ravichandran Eduard Hovy Information Sciences Institute University of Southern California.
Use of Patterns for Detection of Answer Strings Soubbotin and Soubbotin.
Overview of Search Engines
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
A Web-based Question Answering System Yu-shan & Wenxiu
A Pattern Based Approach to Answering Factoid, List and Definition Questions Mark A. Greenwood and Horacio Saggion Natural Language Processing Group Department.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Building a Simple Question Answering System Mark A. Greenwood Natural Language Processing Group Department of Computer Science University of Sheffield,
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.
Natural Language Based Reformulation Resource and Web Exploitation for Question Answering Ulf Hermjakob, Abdessamad Echihabi, Daniel Marcu University of.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
©2003 Paula Matuszek CSC 9010: Information Extraction Dr. Paula Matuszek (610) Fall, 2003.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
A Language Independent Method for Question Classification COLING 2004.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
COLING 2012 Extracting and Normalizing Entity-Actions from Users’ comments Swapna Gottipati, Jing Jiang School of Information Systems, Singapore Management.
A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
IAVI/DIT BSc (Hons) in Property Studies How to complete the Dissertation Proposal Form.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Information Retrieval in Practice
Language Identification and Part-of-Speech Tagging
Presentation transcript:

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield, UK

April 14th 2003NLP4QA an EACL 2003 Workshop Outline of Presentation Introduction and overview The basic approach Limitations of the approach Generalising using an NE tagger Results to date Discussion and Conclusions Future Work

April 14th 2003NLP4QA an EACL 2003 Workshop Introduction and Overview The basic question answering system outlined in this talk is similar to those developed by Soubbotin and Soubbotin (2001) and Ravichandran and Hovy (2002).  These systems use a single resource consisting of a large number of surface matching text patterns.  These patterns are derived using simple machine learning techniques from a set of question-answer pairs and a corpus of documents in which the answer is known to occur. While agreeing that these systems have enormous potential we aim to highlight, and provided a solution to, a problem that can be caused by overly specific patterns:  They mainly result in systems which may be unable to answer specific types of questions.

April 14th 2003NLP4QA an EACL 2003 Workshop The Basic Approach Learning patterns which can be used to find answers involves a two stage process:  The first stage is to learn a set of patterns from a set of question-answer pairs.  The second stage involves assigning a precision to each pattern and discarding those patterns which are tied to a specific question-answer pair. To explain the process we will use questions of the form “When was X born?”:  As a concrete example we will use “When was Mozart born?”.  For which the question-answer pair is: Mozart 1756

April 14th 2003NLP4QA an EACL 2003 Workshop The Basic Approach The first stage is to learn a set of patterns from the question- answer pairs for a specific question type:  For each example the question and answer terms are submitted to Google and the top ten documents are downloaded.  Each document then has the question and answer terms replaced by AnCHoR and AnSWeR respectively.  Those sentences which contain both AnCHoR and AnSWeR are retained and joined together to create a single document.  This generated document is then used to build a token-level suffix tree, from which repeated strings containing both AnCHoR and AnSWeR and which do not span a sentence boundary are extracted as patterns.

April 14th 2003NLP4QA an EACL 2003 Workshop The Basic Approach The result of the first stage is a set of patterns. For questions of the form “When was X born?” these may include: AnCHor ( AnSWeR – From AnCHoR ( AnSWeR – 1791 ) AnCHor ( AnSWeR Unfortunately some of these patterns are specific to the question used to generate them. So the second stage of the approach is concerned with filtering out these specific patterns to produce a set which can be used to answer unseen questions.

April 14th 2003NLP4QA an EACL 2003 Workshop The Basic Approach The second stage of the approach requires a different set of question-answer pairs to those used in the first stage:  Within each of the top ten documents returned by Google, using only the question term: the question term is replaced by AnCHoR and the answer (if it is present) with AnSWeR.  Those sentences which contain AnCHoR are retained.  All of the patterns from the first stage are converted to regular expressions designed to capture the token which appears in place of AnSWeR.  Each regular expression is then matched against each sentence and along with each pattern two counts are maintained: C a which is the total number of times this pattern has matched and C c which counts the number of times AnSWeR was selected as the answer.  After a pattern has been matched against every sentence if C c is less than 5 then it is discarded otherwise it’s precision is calculated as C c /C a and the pattern is retained only if the precision is greater than 0.1.

April 14th 2003NLP4QA an EACL 2003 Workshop The Basic Approach The result of assigning precision to patterns in this way is a set of precisions and regular expressions such as: 0.967: AnCHoR \( ([^ ]+) : AnCHoR \( ([^ ]+) 0.263: AnCHoR ([^ ]+) – These patterns can then be used to answer unseen questions:  The question term is submitted to Google and the top ten returned documents have the question term replaced with AnCHoR.  Those sentences which contain AnCHoR are extracted and combined to make a single document.  Each pattern is then applied to each sentence to extract possible answers.  All the answers found are sorted based firstly on the precision of the pattern which selected it and secondly on the number of times the same answer was found.

April 14th 2003NLP4QA an EACL 2003 Workshop Limitations of the Approach Most systems of this kind use questions of the form “When was X born?”, often the concrete example “When was Mozart born?”, which can be answered from the following text: “Mozart ( ) was a musical genius who…” A similar question which is also answerable from the same text snippet is “When did Mozart die?”. Unfortunately the system just presented is unable to answer this question:  The first stage of the algorithm will generate a pattern such as: AnCHoR ( 1756 – AnSWeR )  The second stage is, however, unable to assign a precision to this pattern due to the presence of a specific year of birth. What is needed is an approach that can generalise this form of pattern approach to eliminate some or all of the specific text.

April 14th 2003NLP4QA an EACL 2003 Workshop Generalising Using an NE Tagger Many of the question specific words which could appear within the patterns are dates, names, organisations and locations. These are exactly the types of words which can be recognised using gazetteers and named entity taggers. The approach taken to generalise the answer patterns is therefore to combine the approach with these NLP techniques:  A gazetteer and named entity tagger are applied to the documents after the question and answer terms have been replaced with AnCHoR and AnSWeR.  Text highlighted as a named entity is then replaced with a tag representing its type, i.e. a date would be replaced by DatE. Processing then continues as before.  When using the patterns to answer questions more work is required as any tags selected as answers have to be expanded to their original values.

April 14th 2003NLP4QA an EACL 2003 Workshop Generalising Using an NE Tagger This results in a system which can now produce generalised patterns, with associated precisions, for questions of the form “When did X die?” : 1.000: AnCHoR \( DatE – ([^ ]+) \) : AnCHoR \( DatE – ([^ ]+) \) 0.889: AnCHoR DatE – ([^ ]+) It should be clear that these patterns are capable of extracting the answer from the text we saw earlier: “Mozart ( ) was a musical genius who…” This approach also gives rise to more patterns being generated for questions of the form “When was X born?”: 0.941: AnCHoR \( ([^ ]+) - DatE \) 0.556: AnCHoR ([^ ]+) - DatE

April 14th 2003NLP4QA an EACL 2003 Workshop Results to Date A small set of experiments were carried out to see the effect of extending the system in this way: Three different question types were used: When was X born? When did X die? What is the capital of X? Six experiments were undertaken, namely the basic and extended system applied to each of the three question types.  For each experiment twenty questions were used for stage 1, twenty questions for stage 2 and the patterns were then tested using a further 100 questions. For each experiment the mean reciprocal rank and confidence weighted scores were calculated as a measure of performance over and above the simple metric of percentage of questions correctly answered.

April 14th 2003NLP4QA an EACL 2003 Workshop Results to Date QuestionSystem% CorrectlyMRRConfidence Type AnsweredScoreWeighted Born Basic52% Extended53% Died Basic0% Extended53% Capital City Basic24% Extended24%

April 14th 2003NLP4QA an EACL 2003 Workshop Discussion and Conclusions Currently only a small set of experiments have been carried out using this technique, but for these experiments:  Generalising the patterns benefited question types which could not be answered by the basic approach.  Generalising the patterns had no detrimental effect on question types which could be answered by the basic approach. A small number of example questions can be used to generate sets of surface matching text patterns – only 40 examples were used to generate and assign precisions to the patterns.  Clearly experiments should be carried out to see if there is a performance advantage in using more examples. There are other ways of generalising the surface matching text patterns which should be investigated.

April 14th 2003NLP4QA an EACL 2003 Workshop Future Work Preliminary work on varying the number of examples used to generate the patterns and their precisions, suggests that using more examples leads to better performance:  This should be continued to try and determine the optimum number of examples to use. We intend to investigate other NLP techniques to generalise and improve the patterns, including:  Name matching and coreference  Noun and verb phrase chunking  Part of speech tagging We also intend to improve the second stage by experimenting with the document retrieval query  Include keywords such along with the question term, e.g. “capital” for questions of the form “What is the capital of X?”.

Any Questions? Copies of the paper and these slides can be found at: