Type-enabled Keyword Searches with Uncertain Schema Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen.

Type-enabled Keyword Searches with Uncertain Schema Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen

ICML 2005Chakrabarti2 Evolution of Web search  The first decade of Web search Crawling and indexing at massive scale Macroscopic whole-page connectivity analysis Very limited expression of information need  Exploiting entities and relations—clear trend Maintaining large type systems and ontologies Discovering mentions of entities and relations Deduplicating and canonicalizing mentions Forming uncertain, probabilistic E-R graphs Enhancing keyword or schema-aware queries

ICML 2005Chakrabarti3 Disambiguation Named entity tagging Relation tagging Raw corpus Annotated corpus Text index Annotation index Indexer Past query workload stats Ranking engine Question Answer type predictor Keyword match predictor Response snippets WordNet Wikipedia FrameNet KnowItAll Uniform lexical network provider 1 2 3 4

ICML 2005Chakrabarti4 Populating entity and relation tables  Hearst patterns (Hearst 1992) T such as x, x and other T, x is a T  DIPRE (Brin 1998)  Snowball (Agichtein+ 2000) [left] entity1 [middle] entity2 [right]  PMI-IR (Turney 2001) Recognize synonyms using Web stats  KnowItAll (Etzioni+ 2004)  C-PANKOW (Cimiano+ 2005) Is-a relations from Hearst patterns, lists, PMI

ICML 2005Chakrabarti5 DIPRE and Snowball Seed tuples Tag mentions in free text Generate extraction patterns Locate new tuples Augmented table … the Irving-based Exxon Corporation … locationorganization ℓmr Encoded as bag-of-words

ICML 2005Chakrabarti6 Scoring patterns and tuples  Pattern confidence = m + /(m + + m − ) over validation tuples  Soft-or tuple confidence =  Recent improvements: Urn model (Etzioni+ 2005) Uses 5-part encoding DIPRESnowball

ICML 2005Chakrabarti7 KnowItAll and C-PANKOW  A “propose-validate” approach Using existing patterns, generate queries For each web page w returned Extract potential fact e and assign confidence score Add fact to database if it has high enough score  Patterns use chunk info

ICML 2005Chakrabarti8 Exploiting answer types with PMI  From two word queries to two text boxes author; “Harry Potter” person; “Eiffel Tower” director; Swades movie city; India Pakistan cricket  Keywords  search engine  snippets  Every token/chunk in a snippet is a candidate Elimination hacks that we won’t discuss  Fire Hearst pattern queries between desired answer type and candidate token/chunk Keywords to match Answer type

ICML 2005Chakrabarti9 Information carnivores at work  “Garth Brooks is a country” [singer], “gift such as wall” [clock]  “person like Paris” [Hilton], “researchers like Michael Jordan” (which one?) KO :: India Pakistan Cricket Series KO :: India Pakistan Cricket Series A web site by Khalid Omar, sort of live from Karachi, Pakistan. “cities such as [probe]” “[probe] and other cities”, “[probe] is a city”, etc.

ICML 2005Chakrabarti10 Sample output  author; “Harry Potter” J K Rowling, Ron  person; “Eiffel Tower” Gustave, (Eiffel), Paris  director; Swades movie Ashutosh Gowariker, Ashutosh Gowarikar  What can search engines do to help? Cluster mentions and assign IDs Allow queries for IDs — expensive! “Harry Potter” context in “Ron is an author” Ambiguity and extremely skewed Web popularity

ICML 2005Chakrabarti12 Answer type (atype) prediction  Standard sub-problem in question answering  Increasingly important (but more difficult) for grammar-free Web queries (Broder 2002)  Current approaches Pattern matching, e.g. head of noun phrase adjacent to what or which; map when, who, where, directly to classes time, person, place Coupled perceptrons (Li and Roth, 2002) Linear SVM on bag-of-2grams (Hacioglu 2002) SVM with tree kernel on parse (Zhang and Lee, 2004): slim gains  Surely a parse tree holds more usable info

ICML 2005Chakrabarti13 Informer span  A short, contiguous span of question tokens reveals the anticipated answer type (atype)  Except in multi-function questions, one informer span is dominant and sufficient What is the weight of a rhino? How much does a rhino weigh? How much does a rhino cost? Who is the CEO of IBM?  Question  parse  informer span tagger  Learn atype label from informer + question

ICML 2005Chakrabarti14 Example  Pre-in-post Markov process produces question  Train a CRF with features derived from parse tree POS, attachments to neighboring chunks, multiple levels First noun chunk? Adjacent to second verb? WhatisthecapitalcityofJapan WPVBZDTNN INNNP NP PP NP VP SQ SBARQ WHNP 0 1 2 3 4 5 6  Level 123 What, is, the capital, city of, Japan (start)

ICML 2005Chakrabarti15 Atype guessing accuracy Question Trained CRF Filter Informer feature generator Ordinary feature generator Merge Linear SVM Feature vector Atype

ICML 2005Chakrabarti17 Scoring function for typed search  Instance of atype “near” keyword matches IR systems: “hard” proximity predicates Search engines: unknown reward for proximity XML+IR, XRank: “hard” word containment in subtree television was invented in 1925. Inventor John Baird was born person#n#1 IS-A Candidate Selectors Not closest Question: Who invented the television? Atype: person#n#1 Selectors: invent*, television Up to some maximum window

ICML 2005Chakrabarti18 Learning a scoring function  Assume parametric form for a ranking classifier Form of IDF, window size,  Can also choose among decay function forms  Question-answer pairs give partial orders (Joachims 2004)  Recall in top-50, mean reciprocal rank

ICML 2005Chakrabarti19 Indexing issues  Standard IR posting: word  {(doc,offsets)} word1 near word2 is standard instance-of(atype) near {word1, word2,…}  WordNet has 80000 atype nodes, 17000 internal, depth > 10 “horse” also indexed as mammal, animal, sports equipment, chess piece,… Original corpus 4GB, gzipped corpus 1.3GB, IR index 0.9GB, full atype index 4.3GB  XML structure indices not designed for fine- grain, word-as-element-node use

ICML 2005Chakrabarti20 Exploit skew in query atypes?  Index only a small registered set of atypes R  Relax query atype a to generalization g in R  Test a  response reachability and retain/discard  How to pick R? What is a good objective? Relaxed query and discarding steps cost extra time Rare atypes in what, which, and name questions— long-tailed distribution

ICML 2005Chakrabarti21 Approx objective and approach  Index space approx  Expected query time bloat is approx  Minimize approx index space with an upper bound on bloat (hard, as expected)  Sparseness: queryProb(a) observed to be zero for most a-s in a large taxonomy  Smooth using similarity between atypes

ICML 2005Chakrabarti22 Sample results  Index space approximation reasonable  Reasonable average query time bloat with small index space overheads Using a Using g Queries  Runtime 

ICML 2005Chakrabarti23 Summary  Entity and relation annotators Maturing technology Unlikely to be perfect for open-domain sources  The future: query paradigms that combine text and annotations End-user friendly selection and aggregation Allow uncertainty, exploit redundancy  Can we scale to terabytes of text?  Will centralized search engines be feasible?  How to federate annotation management?

Type-enabled Keyword Searches with Uncertain Schema Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen.

Similar presentations

Presentation on theme: "Type-enabled Keyword Searches with Uncertain Schema Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Type-enabled Keyword Searches with Uncertain Schema Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen.

Similar presentations

Presentation on theme: "Type-enabled Keyword Searches with Uncertain Schema Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen."— Presentation transcript:

Similar presentations

About project

Feedback