Download presentation
Presentation is loading. Please wait.
Published byLucas Hubbard Modified over 9 years ago
1
Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko Presenter: Shahina Ferdous ID – 1000630375 Date – 03/23/10
2
Querying over Unstructured Data Web (Text Documents) Web (Text Documents) Contains vast amount Text Documents, which is: Unstructured Accessed by keywords Limited search quality
3
Querying over Unstructured Data Web Show me some people, what they invented, and the years they died Keyword-in Document-out
4
Querying over Unstructured Data Web List some Scientists with their invention and the years they died Keyword-in Document-out
5
Structured Querying of web Text “Show me some people, what they invented, and the years they died” ScientistInventionsYearProb Keplerlog books1630.7902 Heisenbergmatrix mechanics1976.7897 Galileotelescope1642.7395 Newtoncalculus1727.7366 In this paper, they proposed a structured Web query System called extraction databse, ExDB. ExDb uses information extraction (IE) system to extract Data. As the extracted Data can be erroneos, ExDB assigns Probability to the tuples.
6
ExDB Work Flow …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedphonogr aph 0.97 Morganborn-in18370.85 TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middleware invented(Edison ?e, ?i) 1. Run extractors2. Populate data model3. Query Processing & Applications Web
7
Information Extraction ExDB extracts several base-level concepts through combination of existing IE techniques: Objects are Data values in the system. Examples: Einstein, telephone, Boston, Light-bulb, etc. Predicates represents binary relation between pair of objects. Examples: discovered (Edison, phonograph), born-in (A. –Einstein, Switzerland) and sells (Amazon, PlayStation) etc. Semantic types represents unary relation of objects. Examples: city (Boston), city (New-York) and electronics (dvd-player) etc.
8
Information Extraction ExDB should also extract more series of relationships to make queries even easier for the user: Synonyms denote equivalent objects, predicates or types. Examples: Einstein and A. –Einstein almost certainly refer to same object. Also, invented and has-invented refer to same predicate. Inclusion Dependencies describes subset relationship between two predicates. Examples: invented (?x, ?y ) discovered (?x, ?y). Functional Dependencies are useful to answer query with negation or why an object is not an answer. For example, a probabilistic FD indicating a person can only be born in one Country: born-in(?x, ?y): ?x -> ?y p=0.95 “All Scientists born in Germany that taught at Princeton”. If after receiving the answers, they ask again to the system “Why Einstein is not an answer?”. Using the above FD, the system will answer: “As born-in (Einstein, Switzerland)” and FD tells a person can only born in one Country, therefore probability of born-in (Einstein, Germany) is very low.
9
Information Extraction ExampleDescriptionIE technique invented(Edison, phonograph) Arity-2 factTextRunner Einstein Type (hypernymy)KnowItAll has-invented = invented SynonymyDIRT invented discovered ID (troponymy)? FD: has-capital(x, y) has-capital(y) FD (rule)?
10
ExDB Work Flow …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedphonogra ph 0.97 Morganborn-in18370.85 TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middleware invented(Edison ?e, ?i) 1. Run extractors2. Populate data model3. Query Processing & Applications Web
11
Populate Data Model Obj1PredObj2prob Edisoninventedphonogr aph 0.97 Morganborn-in18370.85 TypeInstanceprob scientistEinstein0.99 cityBoston0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 InclusionIncluderprob inventeddiscovered0.81 SeattleWashington0.65 LHSRHSprob capital(x, y)capital(y)0.77 born-in(x)country(y)0.95 Facts Types Synonyms IDs FDs It was big news when Edison invented the phonograph… He visited cities such as Boston and New York. We all know that Edison did- invent the light bulb. … In 1877 Edison created the phonograph. Morgan was born-in 1837 into a prosperous mercantile-banking family… Einstein is one of the best known scientists and intellectuals of all time. For fact extraction ExDB uses unsupervised system called TextRunner. TextRunner generates a large set of extraction while running on entire corpus of text. Unlike other IE systems, it does not require a set of target predicates specified beforehand, rather it starts by using a heavy weight linguistic parser to generate high quality extraction triples. Later they use these high quality triples as the training set to generate a light weight extraction classifier that can run on entire web-scale corpus For fact extraction ExDB uses unsupervised system called TextRunner. TextRunner generates a large set of extraction while running on entire corpus of text. Unlike other IE systems, it does not require a set of target predicates specified beforehand, rather it starts by using a heavy weight linguistic parser to generate high quality extraction triples. Later they use these high quality triples as the training set to generate a light weight extraction classifier that can run on entire web-scale corpus TextRunner For type extraction ExDB uses the KnowItAll system. KnowItALL searches the entire corpus to extract hypernym or “is-a” relationships. For example: it extracts city (Boston) from “cities such as Seattle and Boston”. Assign each extraction a probability based on its frequency (or search engine hit count). For type extraction ExDB uses the KnowItAll system. KnowItALL searches the entire corpus to extract hypernym or “is-a” relationships. For example: it extracts city (Boston) from “cities such as Seattle and Boston”. Assign each extraction a probability based on its frequency (or search engine hit count). knowItAll ExDB uses DIRT algorithm to extract predicate synonyms. DIRT computes the degree to which the argument pairs of two predicates coincide. For example, invented and has-invented will overlap many argument pairs like Edison/Light-bulb or Einstein/theory-of-relativity. ExDB uses DIRT algorithm to extract predicate synonyms. DIRT computes the degree to which the argument pairs of two predicates coincide. For example, invented and has-invented will overlap many argument pairs like Edison/Light-bulb or Einstein/theory-of-relativity. DIRT
12
ExDB Work Flow …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedphonogr aph 0.97 Morganborn-in18370.85 TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middleware invented(Edison ?e, ?i) 1. Run extractors2. Populate data model 3. Query Processing & Applications Web
13
ExDB Queries ExDB proposes the users to query over the web Data model using Datalog-like notation. Example: q(?i) :- invented(Edison, ?i) returns all inventions by Edison. Example constranits: q(?x, ?y) :- died-in( ?x, 1955?y) Example query for locally available inexpensive electronics: q(?x, ?y, ?z) :- for-sale-in( ?x, Seattle ?y), costs (?x, ?z), (?z < 25) Another example can be: q(?x, ?y, ?z) :- invented( ?x, ?y), died-in (?x, ?z), (?z < 1900) Example of projection queries: q(?s) :- invented( ?s, ?i)
14
Query Processing Non-projecting queries Involves a series of join against tables in the Web Data Model Probability of a joined tuple is the product of the individual tuple’s probabilities Select top-k queries ranked by their probability as results. ObjectClass einsteinscientist bostoncity bohrscientist francecountry curiescientist Bugs bunnyscientist Object1PredicateObject2 einsteininventedrelativity 1848Was-year-ofrevolution edisoninventedphonograph dukakisvisitedboston einsteindied-in1955 humanshaveCold-fusion prob 0.99 0.98 0.95 0.92 0.91 prob 0.99 0.97 0.96 0.93 0.92 0.01 … … Types Facts Example: q(?x, ?y, ?z) :- invented ( ?x, ?y), died-in (?x, ?z). ScientistInventedDied-inprob einsteinrelativity19550.90 …
15
Projecting queries q (?s) :- invented ( ?s, ?i) rank scientists according to the probability of the scientist invented something without caring much about the actual invention. Need to compute a disjunction of m probabilistic events. A scientist Tesla appears in the output q, if the tuple invented (Tesla, I 0 ) is in the database. There can be many inventions I 1, …, I m for Tesla such as invented (Tesla, I i ). Any of these are sufficient to return Tesla as an answer for q. As m can be very large, a large number of very low probability extractions can unexpectedly result in a quite large probability. Therefore, try to abstract panel of experts, where an expert is a tuple with a score such as Invented (tesla, Fluroescent-Lighting), 0.95, which determine the probability of its appearing in q.
16
Result of Projecting Queries q(?s) :- invented( ?s, x)Scientist invented
17
ExDB Prototype Web crawl: 90M pages Facts: 338M tuples, 102M objects Types: 6.6M instances Synonyms: 17k pairs No IDs or FDs yet
18
Applications ExDB’s extracted Data are not meant to be examined directly, rather they are used to build topic-specific tables so that human user can appreciate. A synthetic table about scientists, generated by merging answers from Died-in( ?x, ?y), invented( ?x, ?y), published( ?x, ?y) and taught( ?x, ?y). If it is possible to automatically generate an ExDB query from keywords, it is possible to build a very powerful query system. It is possible to build web Data cube over the large amount of read only structured Data of ExDB.
19
Alternative Models Schema Extraction Model, intends to find out single best schema for the entire set of extractions to transform the web Text into a traditional relational database Three good criteria for schema extraction are: Simplicity (few tables). Completeness (All extractions appear in the output). Fullness ( output database has no NULLs).
20
Alternative Models Text Query Model does not perform any information extraction at all, rather offers a descriptive query language to generates answers for users query very quickly. Extract city/date tuples from band’s website. Indicate the city where she lives. Compute the dates when the band’s city and her own city are within 100 miles of each other. User’s Query
21
Questions? Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.