Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic Search via XML Fragments: A High-Precision Approach to IR Jennifer Chu-Carroll, John Prager, David Ferrucci, and Pablo Duboue IBM T.J. Watson.

Similar presentations


Presentation on theme: "Semantic Search via XML Fragments: A High-Precision Approach to IR Jennifer Chu-Carroll, John Prager, David Ferrucci, and Pablo Duboue IBM T.J. Watson."— Presentation transcript:

1 Semantic Search via XML Fragments: A High-Precision Approach to IR Jennifer Chu-Carroll, John Prager, David Ferrucci, and Pablo Duboue IBM T.J. Watson Research Center Krzysztof Czuba Google Inc. SIGIR 2006

2 Introduction  high precision search strategy return a small set of documents  highly focused and relevant to the user’s information need  automatic named entity and relation recognizers pre-process text corpora identify the semantic information

3 Introduction (Con.)  four frequently-occurring query-time semantic needs specifying target information type, disambiguating keywords, specifying search term context, and relating select query terms  three XML fragment operations that can be applied to a query conceptualize, restrict, or relate terms in the query  [section 4] demonstrates how they can be utilized to address our query-time semantic needs

4 XML Fragments Overview  sufficiently expressive to address most information needs in multiple domains  phrase operator “ White House ”  numeric comparison operators 1999 retrieves books published in or after 1999

5 Two Sample Book Documents

6 Example  Bill Clinton allows Bill and Clinton to appear anywhere inside a Book tag retrieves both documents  Bill Clinton retrieves only the first document  Bill +Clinton requires that all books retrieved have titles containing Clinton  + retrieves only those books with publication dates in their records

7 Corpus Analysis for XML Retrieval  majority of electronic documents available today are significantly less structured  XML search can be more broadly applicable if existing unannotated documents can be automatically processed to include useful semantic information which can subsequently be used to improve search results

8 Example annotated with named entities and relations

9 Using XML Fragments for Semantic Search  focus on enriching query expressiveness to address four query-time semantic needs to yield higher precision search results

10 Three XML Fragment Operations  can be applied to a query to conceptualize, restrict, or relate terms in the query

11 Conceptualization Operation  generalizes a lexical string to an appropriate concept in the type system represented by that string  query: “ animal ” returns documents containing the word “ animal ”  conceptual query: retrieves documents containing the annotation Animal, e.g., lion, owl, and salmon.

12 Restriction Operation  constrains the XML tags in which keywords must appear to be considered relevant  query: bass returns documents in which the literal “ bass ” is used in its fish sense  query: bass retrieves those where it represents a musical instrument  the span of the specified annotation needs to contain the span of the keywords query: bass  will match striped bass

13 Relation Operation  annotation represents a relation that holds between terms covered by the annotation  syntactic Unabomber kill  semantic Unabomber  pragmatic Clinton war on Iraq  allows for nesting of relation and entity annotations John Victoria

14 Four Query-time Semantic Needs 1. to specify target information type 2. to disambiguate keywords 3. to specify search term context 4. to specify relations between select terms

15 Target Information Type Specification  conceptualization operation - enables semantic search queries that explicitly specify the semantic type of the information the user is seeking  example: find the zip code for the White House query: “ white house ”  retrieve a large number of documents but few mention its zip code adding “ zip code ” to the query  no improvement  The White House, 1600 Pennsylvania Avenue NW, Washington, DC 20500 + “ white house ” +  successfully retrieve assuming 20050 was annotated as a Zipcode

16 Utilizing in Two Applications  SAW (Semantic Analysis Workbench) user can include desired concepts in the query to focus search by  by directly typing “ ” as part of the query  selecting tag from a dropdown list of available types  open-domain question answering (QA) system “ What is the telephone number for the University of Kentucky? ” generate query:  + + “ University of Kentucky ”

17 Search Term Disambiguation  restriction operation bass disambiguate terms based on their word senses  “ Victoria ” can be annotated as City, County, State, Person, Royalty, or Deity generating the query: Victoria  Can inclusion of extra disambiguating keywords yield similar search results? Victoria versus +Victoria +<>city state county

18 Utilizing in Two Applications  SAW useful in this end user application refine the original query and formulate a disambiguating query using the restriction operator  QA “ When was George Washington born? ” + bear + +George +Washington eliminates matches with  George Washington University  George Washington Bridge

19 Search Term Context Specification  restriction operation bass semantic tags are specify the context in which keywords should occur  XML tags surrounding a keyword specify meta information syntactic information (Subject/Object) semantic information (Agent/Patient) discourse information (Request/Suggest)  “ war on Iraq ”

20 Utilizing on AQUAINT 2005 Opinion Pilot  observation: many opinions in documents are expressed as direct quotes from a person ’ s speech  annotated by named entity recognizer with the semantic type Quotation  “ What was the Pentagon panel ’ s position with respect to the dispute over the US Navy training range in the island of Vieques? ” +Pentagon +panel + +US +Navy training range island +Vieques

21 Relation Specification  relation operation express relations that must hold between query terms  shifts the burden of synonym expansion as a query-time process performed by the user or search engine to the relation annotator  +Iraq + +own expands into +Iraq + +<> own have possess

22 Utilizing in Two Applications  SAW user can formulate XML Fragment queries containing relations  QA provides the capability of automatically generating XML Fragment queries containing relations from natural language questions “ When did Nixon visit China? ”  + +Nixon +China +

23 Evaluation – Target Information Type Specification  corpus 3GB AQUAINT corpus contains just over one million news articles between 1996-2000 pre-processed with a named entity recognizer that identifies about 100 entity types  test set 50 questions and relevant judgments in the TREC 2005 QA track document task  QA system processes questions to identify answer types and a set of salient keywords, then generates query

24 Baseline & Results – Target Information Type Specification  baseline run constructed with keywords alone  the run using semantic search the presence of the concept + that corresponds to the semantic type of the answer 2.9%4.1%37.5% improvement

25 Evaluation – Search Term Disambiguation  same corpus, index, and test set as in the previous experiment  QA system reconfigured to generate queries that include disambiguating tags for query terms

26 Baseline & Result – Search Term Disambiguation  baseline run semantic search run in previous section (?)  out of the 50 questions, only 2 questions resulted in different queries  improvement is not statistically significant due to the small sample size improvement4.3%1.1%

27 Evaluation – Search Term Context Specification  same corpus and index  test set 46 out of the 50 questions in the AQUAINT 2005 opinion pilot general form: ” What does OpinionHolder think about SubjectMatter? ”  query: +OpinionHolder + +SubjectMatter

28 Baseline & Result – Search Term Context Specification  baseline run merely added + as the target information type  retrieved twice as many “ vital ” nuggets as baseline run

29 Evaluation – Relation Specification  corpus national intelligence domain from CNS contains over 37,000 documents and is about 75MB in size  resulting annotations were indexed along with the lemmatized terms as in the AQUAINT corpus  relation recognizer identifies 10 relations  test set constructed 2-3 semantic queries for each of 10 relations, resulting in a total of 25 queries +Russia + (biological weapons produced by Russia)

30 Baseline & Result – Relation Specification  baseline run replacing the relation in each query with one or more keywords +Russia + +produce  suitable for applications in which high recall is not initially an important factor improvement6.3%-0.9% 73%

31 Conclusions  work on semantic search using the XML Fragments query language  the use of three operators to express different query-time semantic needs places additional constraints on the query leads to more focused and more precise results  these uses of XML Fragment operators have been implemented and evaluated in multiple systems  useful in the precision-centric class of applications


Download ppt "Semantic Search via XML Fragments: A High-Precision Approach to IR Jennifer Chu-Carroll, John Prager, David Ferrucci, and Pablo Duboue IBM T.J. Watson."

Similar presentations


Ads by Google