Download presentation
Presentation is loading. Please wait.
Published byAlban Edwards Modified over 9 years ago
1
Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington Asilomar, CA January 9, 2007
2
2 “Show me some people, what they invented, and the years they died” q(?a, ?b, ?c):- invented(?a, ?b), died-in(?a, ?c) Structured Queries, Unstructured Data abcprob Keplerlog books1630.7902 Heisenbergmatrix mechanics1976.7897 Galileotelescope1642.7395 Newtoncalculus1727.7366
3
3 ExDB We b …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedlight bulb0.97 Morganborn-in18370.85 TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middlewar e invented(Edison ?e, ?i) 1. Run extractors2. Populate data model3. Queries
4
4 ExDB We b …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedlight bulb0.97 Morganborn-in18370.85 TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middlewar e invented(Edison ?e, ?i) 1. Run extractors2. Populate data model3. Queries
5
5 Information Extraction Each concept has an IE mechanism ExampleDescriptionIE technique invented(Edison, phonograph) Arity-2 factTextRunner Einstein Type (hypernymy)KnowItAll has-invented = invented SynonymyDIRT invented discovered ID (troponymy)? FD: has-capital(x, y) has-capital(y) FD (rule)?
6
6 ExDB We b …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedlight bulb0.97 Morganborn-in18370.85 TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middlewar e invented(Edison ?e, ?i) 1. Run extractors2. Populate data model3. Queries
7
7 Populate Data Model Use extractions to fill tables Obj1PredObj2prob Edisoninventedlight bulb0.97 Morganborn-in18370.85 TypeInstanceprob scientistEinstein0.99 cityBoston0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 InclusionIncluderprob inventeddiscovered0.81 SeattleWashington0.65 LHSRHSprob capital(x, y)capital(y)0.77 Facts Types Synonyms IDs FDs It was big news when Edison invented the light bulb. He visited cities such as Boston and New York. We all know that Edison invented the light bulb. … In 1877 Edison created the light bulb.
8
8 ExDB We b …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedlight bulb0.97 Morganborn-in18370.85 TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middlewar e invented(Edison ?e, ?i) 1. Run extractors2. Populate data model3. Queries
9
9 For non-projecting queries, we can compute top-k queries Comb. fn is product of probabilities For projecting queries, we compute the disjunction of m probabilistic events In general NP-hard, so we approximate using the panel of experts Query Processing
10
10 Related Work Query Systems: CIMple (CIDR07), AVATAR (DEBul06) Liu, Dong, Halevy (WebDB06) Gubanov and Bernstein (WebDB06) Extraction: Sarawagi (VLDB06 and others), Etzioni (WWW04), … Probabilistic DBs: MYSTIQ, Trio, … Deep web, reference reconciliation, …
11
11 Web crawl: 90M pages Facts: 338M tuples, 102M objects Types: 6.6M instances Synonyms: 17k pairs No IDs or FDs yet Most queries in ~30 seconds Built on DB2 with custom middleware; we want to try a compressed C-store Our prototype
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.