Semantic Search via XML Fragments: A High-Precision Approach to IR Jennifer Chu-Carroll, John Prager, David Ferrucci, and Pablo Duboue IBM T.J. Watson.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst Information Semantics Command & Control Center July 17, 2007 Ontologies Can't Help Records Management Or Can They?
Advertisements

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
TextMap: An Intelligent Question- Answering Assistant Project Members:Ulf Hermjakob Eduard Hovy Chin-Yew Lin Kevin Knight Daniel Marcu Deepak Ravichandran.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Search Engines and Information Retrieval
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
INFO 624 Week 3 Retrieval System Evaluation
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Information Retrieval in Practice
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Search Engines and Information Retrieval Chapter 1.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
1 The BT Digital Library A case study in intelligent content management Paul Warren
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
PERSONALIZED SEARCH Ram Nithin Baalay. Personalized Search? Search Engine: A Vital Need Next level of Intelligent Information Retrieval. Retrieval of.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
1 Query Operations Relevance Feedback & Query Expansion.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Structured Use of External Knowledge for Event-based Open Domain Question Answering Hui Yang, Tat-Seng Chua, Shuguang Wang, Chun-Keat Koh National University.
Planning a search strategy.  A search strategy may be broadly defined as a conscious approach to decision making to solve a problem or achieve an objective.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
The Business Model of Google MBAA 609 R. Nakatsu.
An Architecture for Emergent Semantics Sven Herschel, Ralf Heese, and Jens Bleiholder Humboldt-Universität zu Berlin/ Hasso-Plattner-Institut.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Splitting Complex Temporal Questions for Question Answering systems ACL 2004.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
Oxygen Indexing Relations from Natural Language Jimmy Lin, Boris Katz, Sue Felshin Oxygen Workshop, January, 2002.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
MedKAT Medical Knowledge Analysis Tool December 2009.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
AQUAINT IBM PIQUANT ARDACYCORP Subcontractor: IBM Question Answering Update piQuAnt ARDA/AQUAINT December 2002 Workshop This work was supported in part.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
What Does the User Really Want ? Relevance, Precision and Recall.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Improving QA Accuracy by Question Inversion John Prager, Pablo Duboue, Jennifer Chu-Carroll Presentation by Sam Cunningham and Martin Wintz.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Using Semantic Relations to Improve Information Retrieval
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Improving Data Discovery Through Semantic Search
Introduction to Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Semantic Search via XML Fragments: A High-Precision Approach to IR Jennifer Chu-Carroll, John Prager, David Ferrucci, and Pablo Duboue IBM T.J. Watson Research Center Krzysztof Czuba Google Inc. SIGIR 2006

Introduction  high precision search strategy return a small set of documents  highly focused and relevant to the user’s information need  automatic named entity and relation recognizers pre-process text corpora identify the semantic information

Introduction (Con.)  four frequently-occurring query-time semantic needs specifying target information type, disambiguating keywords, specifying search term context, and relating select query terms  three XML fragment operations that can be applied to a query conceptualize, restrict, or relate terms in the query  [section 4] demonstrates how they can be utilized to address our query-time semantic needs

XML Fragments Overview  sufficiently expressive to address most information needs in multiple domains  phrase operator “ White House ”  numeric comparison operators 1999 retrieves books published in or after 1999

Two Sample Book Documents

Example  Bill Clinton allows Bill and Clinton to appear anywhere inside a Book tag retrieves both documents  Bill Clinton retrieves only the first document  Bill +Clinton requires that all books retrieved have titles containing Clinton  + retrieves only those books with publication dates in their records

Corpus Analysis for XML Retrieval  majority of electronic documents available today are significantly less structured  XML search can be more broadly applicable if existing unannotated documents can be automatically processed to include useful semantic information which can subsequently be used to improve search results

Example annotated with named entities and relations

Using XML Fragments for Semantic Search  focus on enriching query expressiveness to address four query-time semantic needs to yield higher precision search results

Three XML Fragment Operations  can be applied to a query to conceptualize, restrict, or relate terms in the query

Conceptualization Operation  generalizes a lexical string to an appropriate concept in the type system represented by that string  query: “ animal ” returns documents containing the word “ animal ”  conceptual query: retrieves documents containing the annotation Animal, e.g., lion, owl, and salmon.

Restriction Operation  constrains the XML tags in which keywords must appear to be considered relevant  query: bass returns documents in which the literal “ bass ” is used in its fish sense  query: bass retrieves those where it represents a musical instrument  the span of the specified annotation needs to contain the span of the keywords query: bass  will match striped bass

Relation Operation  annotation represents a relation that holds between terms covered by the annotation  syntactic Unabomber kill  semantic Unabomber  pragmatic Clinton war on Iraq  allows for nesting of relation and entity annotations John Victoria

Four Query-time Semantic Needs 1. to specify target information type 2. to disambiguate keywords 3. to specify search term context 4. to specify relations between select terms

Target Information Type Specification  conceptualization operation - enables semantic search queries that explicitly specify the semantic type of the information the user is seeking  example: find the zip code for the White House query: “ white house ”  retrieve a large number of documents but few mention its zip code adding “ zip code ” to the query  no improvement  The White House, 1600 Pennsylvania Avenue NW, Washington, DC “ white house ” +  successfully retrieve assuming was annotated as a Zipcode

Utilizing in Two Applications  SAW (Semantic Analysis Workbench) user can include desired concepts in the query to focus search by  by directly typing “ ” as part of the query  selecting tag from a dropdown list of available types  open-domain question answering (QA) system “ What is the telephone number for the University of Kentucky? ” generate query:  + + “ University of Kentucky ”

Search Term Disambiguation  restriction operation bass disambiguate terms based on their word senses  “ Victoria ” can be annotated as City, County, State, Person, Royalty, or Deity generating the query: Victoria  Can inclusion of extra disambiguating keywords yield similar search results? Victoria versus +Victoria +<>city state county

Utilizing in Two Applications  SAW useful in this end user application refine the original query and formulate a disambiguating query using the restriction operator  QA “ When was George Washington born? ” + bear + +George +Washington eliminates matches with  George Washington University  George Washington Bridge

Search Term Context Specification  restriction operation bass semantic tags are specify the context in which keywords should occur  XML tags surrounding a keyword specify meta information syntactic information (Subject/Object) semantic information (Agent/Patient) discourse information (Request/Suggest)  “ war on Iraq ”

Utilizing on AQUAINT 2005 Opinion Pilot  observation: many opinions in documents are expressed as direct quotes from a person ’ s speech  annotated by named entity recognizer with the semantic type Quotation  “ What was the Pentagon panel ’ s position with respect to the dispute over the US Navy training range in the island of Vieques? ” +Pentagon +panel + +US +Navy training range island +Vieques

Relation Specification  relation operation express relations that must hold between query terms  shifts the burden of synonym expansion as a query-time process performed by the user or search engine to the relation annotator  +Iraq + +own expands into +Iraq + +<> own have possess

Utilizing in Two Applications  SAW user can formulate XML Fragment queries containing relations  QA provides the capability of automatically generating XML Fragment queries containing relations from natural language questions “ When did Nixon visit China? ”  + +Nixon +China +

Evaluation – Target Information Type Specification  corpus 3GB AQUAINT corpus contains just over one million news articles between pre-processed with a named entity recognizer that identifies about 100 entity types  test set 50 questions and relevant judgments in the TREC 2005 QA track document task  QA system processes questions to identify answer types and a set of salient keywords, then generates query

Baseline & Results – Target Information Type Specification  baseline run constructed with keywords alone  the run using semantic search the presence of the concept + that corresponds to the semantic type of the answer 2.9%4.1%37.5% improvement

Evaluation – Search Term Disambiguation  same corpus, index, and test set as in the previous experiment  QA system reconfigured to generate queries that include disambiguating tags for query terms

Baseline & Result – Search Term Disambiguation  baseline run semantic search run in previous section (?)  out of the 50 questions, only 2 questions resulted in different queries  improvement is not statistically significant due to the small sample size improvement4.3%1.1%

Evaluation – Search Term Context Specification  same corpus and index  test set 46 out of the 50 questions in the AQUAINT 2005 opinion pilot general form: ” What does OpinionHolder think about SubjectMatter? ”  query: +OpinionHolder + +SubjectMatter

Baseline & Result – Search Term Context Specification  baseline run merely added + as the target information type  retrieved twice as many “ vital ” nuggets as baseline run

Evaluation – Relation Specification  corpus national intelligence domain from CNS contains over 37,000 documents and is about 75MB in size  resulting annotations were indexed along with the lemmatized terms as in the AQUAINT corpus  relation recognizer identifies 10 relations  test set constructed 2-3 semantic queries for each of 10 relations, resulting in a total of 25 queries +Russia + (biological weapons produced by Russia)

Baseline & Result – Relation Specification  baseline run replacing the relation in each query with one or more keywords +Russia + +produce  suitable for applications in which high recall is not initially an important factor improvement6.3%-0.9% 73%

Conclusions  work on semantic search using the XML Fragments query language  the use of three operators to express different query-time semantic needs places additional constraints on the query leads to more focused and more precise results  these uses of XML Fragment operators have been implemented and evaluated in multiple systems  useful in the precision-centric class of applications