Presentation is loading. Please wait.

Presentation is loading. Please wait.

Structured Querying of Web Text A Technical Challenge Kulsawasd Jitkajornwanich University of Texas at Arlington CSE6339 Web Mining.

Similar presentations


Presentation on theme: "Structured Querying of Web Text A Technical Challenge Kulsawasd Jitkajornwanich University of Texas at Arlington CSE6339 Web Mining."— Presentation transcript:

1 Structured Querying of Web Text A Technical Challenge Kulsawasd Jitkajornwanich University of Texas at Arlington kulsawasdj@hotmail.com CSE6339 Web Mining | April 16, 2009 | 9:30 am by Cafarella, Re’, Suciu, Etzioni & Banko

2 Introduction What is structured-query? What is structured-query? 2 types of query: Structured-query & Unstructured- query 2 types of query: Structured-query & Unstructured- query 1. Structured-query 1. Structured-query Has “condition” in the query Has “condition” in the query Can make a complicated query Can make a complicated query ex. “SQL query” ex. “SQL query” List employee whose name start with ‘David’ and salary > 5000 SELECTE.name, E.salary SELECTE.name, E.salary FROM Employee E FROM Employee E WHERE E.name LIKE ‘David’, E. salary > 5000 WHERE E.name LIKE ‘David’, E. salary > 5000 2

3 Introduction What is structured-query? What is structured-query? 2. Unstructured-query 2. Unstructured-query ex. “Keyword Search” ex. “Keyword Search” no “condition” in the query no “condition” in the query simply do “string matching” simply do “string matching” 3

4 Introduction --> we just talked about type of query we just talked about type of query <-- What about type of data? What about type of data? 2 types of data: 2 types of data: 1. Structured-data 1. Structured-data ex. Relational tables ex. Relational tables 2. Unstructured-data 2. Unstructured-data ex. Web documents ex. Web documents 4

5 Introduction Objective of the paper: Objective of the paper: To propose a tool called ExDB to make a structured-query on web documents (unstructured-data) To propose a tool called ExDB to make a structured-query on web documents (unstructured-data) 5 Relational Database Web Text SQL Query ExDB Unstructured-query (Keyword Search) Structured-query (Complicated query like SQL-query) Search Engine Structured-data Unstructured-data

6 How it works: Big Picture of ExDB 6 Collection of web documents ExDB Extractor Fact Table Type Table Constraint Table User ExDB Complier q(?x,?y):- invented(?x,?y) RDBMS Database Resulting Table

7 How it works: Big Picture of ExDB 7 Collection of web documents ExDB Extractor Fact Table Type Table Constraint Table User ExDB Complier q(?x,?y):- invented(?x,?y) RDBMS Database Resulting Table

8 Outline 1 st Component: ExDB Extractor 1 st Component: ExDB Extractor What/How does it do in more detail? What/How does it do in more detail? 2 nd Component: ExDB Compiler 2 nd Component: ExDB Compiler What/How does it do in more detail? What/How does it do in more detail? Test your understanding!! Test your understanding!! Working on tasks Working on tasks Compare result ExDB & Google Compare result ExDB & Google Conclusion Conclusion 8

9 How ExDB Works 1 st Component: ExDB Extractor 1 st Component: ExDB Extractor What does it do? What does it do? To extract data from the web documents & put it into the tables To extract data from the web documents & put it into the tables 9

10 How ExDB Works 2 nd Component: ExDB Compiler 2 nd Component: ExDB Compiler What does it do? What does it do? To process the user’s structured-query on the tables from 1 st component ( ExDB Extractor ) and give the resulting table back to user To process the user’s structured-query on the tables from 1 st component ( ExDB Extractor ) and give the resulting table back to user ex. q(?x, ?y):- invented(?x, ?y) ex. q(?x, ?y):- invented(?x, ?y) 10

11 How it works: Big Picture of ExDB 11 RDBMS Database Collection of web documents … was surprising. In 1877, Edison invented the light bulb. Although he … ExDB Extractor Fact Table Type Table Constraint Table Make a query using ExDB syntax User: Make a query using ExDB syntax ExDB Complier 1 st Component: ExDB Extractor 2 nd Component: ExDB Compiler

12 1 st Component: ExDB Extractor What does it do? What does it do? To extract data from the web documents & put it into the tables To extract data from the web documents & put it into the tables There are 3 tables: There are 3 tables: 1. Fact Table 1. Fact Table 2. Type Table 2. Type Table 3. Constraint Table 3. Constraint Table Additional column: stores tuple probability Additional column: stores tuple probability Discussion: Why do need this column? Discussion: Why do need this column? 0<p<1,  p i = 1 0<p<1,  p i = 1 One way to assign probability: Counting occurrence frequency One way to assign probability: Counting occurrence frequency Assume Independence among tuples Assume Independence among tuples 12

13 1.1 Fact Table 1.1 Fact Table Stores fact information Stores fact information ex. “Edison invented light bulb” ex. “Edison invented light bulb” Uses TextRunner to extract Uses TextRunner to extract How is it look like? How is it look like? 13Predicate Object 1 Object 2 ProbabilityinventedEdison Light bulb 0.75 died-inEdison18770.55 ………… Fact Table Probability = no of occurrence / no of predicate occurrences 1 st Component: ExDB Extractor

14 Example1: shows how to get Fact table 14Predicate Object 1 Object 2 ProbabilityInventedEdison Light bulb 0.75 InventedEdisonPhonograph0.25 ………… Fact Table … was surprising. In 1877, Edison invented the light bulb. Although he … It was a big news when Edison invented the light bulb. … We all know that Edison invented light bulb. … not only that Edison also invented the phonograph. Probability = no of occurrences no of predicate occurrences Objec t Predicate TextRunnerTextRunner

15 Discussion: Discussion: What do you think might be a problem with this design of fact table? What do you think might be a problem with this design of fact table? Cannot support Ternary-predicate --> ex. David donates books to Child Organization. 15Predicate Object 1 Object 2 ProbabilityinventedEdison Light bulb 0.75 died-inEdison18770.55 ………… Fact Table 1 st Component: ExDB Extractor

16 1.2 Type Table 1.2 Type Table Stores object type information Stores object type information ex. Edison is a scientist. ex. Edison is a scientist. Uses KnowItAll to extract Uses KnowItAll to extract How is it look like? How is it look like? 16TypeObjectProbabilityScientistEdison0.73 CityBoston0.36 ……… Type Table Probability = no of occurrence / no of type occurences 1 st Component: ExDB Extractor

17 17TypeObjectProbabilityscientistEdison0.75 ScientistBenjamin0.25 ……… Type Table … As we know, Edison is a scientist. Although he … … there are many world-famous scientists such as Edison, … However, someone claim that Benjamin is also an scientist. … scientists such as Edison, … Probability = no of occurrences no of type occurrences Objec t Type KnowItAll Example2: shows how to get Type table

18 1.3 Constraint Table 1.3 Constraint Table Stores constraint information of objects or predicates Stores constraint information of objects or predicates There are 2 types of constraints discussed in this paper: Synonym and Inclusion Dependency There are 2 types of constraints discussed in this paper: Synonym and Inclusion Dependency Uses DIRT to extract Uses DIRT to extract 1. Synonym 1. Synonym example for predicate: did-invented = invented example for predicate: did-invented = invented example for object: Edison T. = Edison example for object: Edison T. = Edison 2. Inclusion Dependency 2. Inclusion Dependency example for predicate: be-guardian  be-parent example for predicate: be-guardian  be-parent example for object: relative  sister example for object: relative  sister 19 1 st Component: ExDB Extractor

19 example shows how DIRT works for Synonym constraint 11 … was surprising. In 1877, Edison invented the light bulb. Although he … Collection of web documents DIRT Thomas E. Edison T. Thomas Edison

20 example shows how DIRT works for Inclusion Dependency constraint 11 … was surprising. In 1877, Edison invented the light bulb. Although he … Collection of web documents DIRT Be-parent Be-guardianBe-babysitter

21 1.3 Constraint Table 1.3 Constraint Table How is it look like? How is it look like? 20Constraint Object 1 Object 2 ProbabilitySynonymEdison T. Edison 0.75 Inclusion Dependency Be-parentBe-guardian0.55 ………… Constraints Table 1 st Component: ExDB Extractor Superset Subset

22 Key point summary of 1 st component: (ExDB Extractor) Key point summary of 1 st component: (ExDB Extractor) 1. ExDB Extractor uses different kinds of existing extractor: TextRunner, KnowItAll and DIRT. 1. ExDB Extractor uses different kinds of existing extractor: TextRunner, KnowItAll and DIRT. 2. Probabilistic column is used to indicate the degree of correctness and deal with uncertainty problem. 2. Probabilistic column is used to indicate the degree of correctness and deal with uncertainty problem. 3. Drawback of fact table, only Binary Predicate is allowed. 3. Drawback of fact table, only Binary Predicate is allowed. 22 1 st Component: ExDB Extractor

23 How it works: Big Picture of ExDB 23 RDBMS Database Collection of web documents … was surprising. In 1877, Edison invented the light bulb. Although he … ExDB Extractor Fact Table Type Table Constraint Table Make a query using ExDB syntax User: Make a query using ExDB syntax ExDB Complier 1 st Component: ExDB Extractor 2 nd Component: ExDB Compiler

24 What does it do? What does it do? To process the user’s structured-query on the tables from 1 st component ( ExDB Extractor ) To process the user’s structured-query on the tables from 1 st component ( ExDB Extractor ) Result will be in table format and ranked by highest probability value. Result will be in table format and ranked by highest probability value. ex. q(?x, ?y):- invented(?x, ?y) ex. q(?x, ?y):- invented(?x, ?y) However, users are not expected to know the table schema. However, users are not expected to know the table schema. 24 2 nd Component: ExDB Compiler

25 ExDB syntax: ExDB syntax: ?x = variable x ?x = variable x w = constant value w w = constant value w q(?x,?y):- = define resulting table q consisting of column x and y q(?x,?y):- = define resulting table q consisting of column x and y invented(?x,?y) = return list of object x and y regarding predicate “invented” invented(?x,?y) = return list of object x and y regarding predicate “invented” invented( ?x,?y) = return list of object x whose type is and y regarding predicate “invented” invented( ?x,?y) = return list of object x whose type is and y regarding predicate “invented” This syntax is called “Datalog-like notation” This syntax is called “Datalog-like notation” Let’s try some examples! Let’s try some examples! 25 2 nd Component: ExDB Compiler q(?x, ?y):-invented(?x, ?y)

26 26 Make a Query example: example:Predicate Object 1 Object 2 ProbabilityinventedEdison Light bulb 0.55 inventedEdisonTelescope0.14 inventedEdisonPhonograph0.14 inventedJason Cell phone 0.14 died-in Mary T. 18770.05 Fact Table example4: list all inventions invented by Edison list all inventions invented by Edison answer : q(?i):- invented(Edison, ?i) q(?i):- invented(Edison, ?i)iProbability Light bulb 0.55 Telescope0.14 Phonograph0.14 q Table

27 27 Make a Query example: example:Predicate Object 1 Object 2 ProbinventedEdison Light bulb 0.70 died-inEdison19550.40 invented David A. Guitar0.30 died-inPeter19550.20 died-in Mary T. 18000.05 Fact Table example5: list all scientist died in 1955 list all scientist died in 1955TypeObjectProbscientistEdison0.50 scientistPeter0.15 scientist Mary T. 0.15 scientist David A. 0.10 cityBoston0.05 Type Table answer: q(?i):- died-in( ?i, 1955)

28 28 Make a Query example: example:Predicate Object 1 Object 2 ProbinventedEdison Light bulb 0.70 died-inEdison19550.40 invented David A. Guitar0.30 died-inPeter19550.20 died-in Mary T. 18000.05 Fact Table example5: list all scientist died in 1955 list all scientist died in 1955TypeObjectProbscientistEdison0.50 scientistPeter0.15 scientist Mary T. 0.15 scientist David A. 0.10 cityBoston0.05 Type Table TypeObjectPredicateObjectProbscientistEdisondied-in19550.20 scientistPeterdied-in19550.03 Joining Table 0.20 = 0.50 x 0.40 because we assume independence among tuples; i.e, P(t1, t2)=P(t1) * P(t2)?iProbEdison0.20 Peter0.03 q Table answer: q(?i):- died-in( ?i, 1955)

29 29 Make a Query example: example:Predicate Object 1 Object 2 ProbinventedEdison Light bulb 0.70 died-inEdison19550.40 invented David A. Guitar0.30 died-inPeter19550.20 died-in Mary T. 18000.05 Fact Table example6: list all scientist who died after 1900, their inventions and year they died list all scientist who died after 1900, their inventions and year they diedTypeObjectProbscientistEdison0.50 scientistPeter0.15 scientist Mary T. 0.15 scientist David A. 0.10 cityBoston0.05 Type Table answer : q(?x, ?y, ?z):- invented(?x, ?y), q(?x, ?y, ?z):- invented(?x, ?y), died-in( ?x, ?z), died-in( ?x, ?z), (z > 1900) (z > 1900)

30 30 Make a Query example: example:Predicate Object 1 Object 2 ProbinventedEdison Light bulb 0.70 died-inEdison19550.40 invented David A. Guitar0.30 died-inPeter19550.20 died-in Mary T. 18000.05 Fact Table example6: list all scientist who died after 1900, their inventions and year they died list all scientist who died after 1900, their inventions and year they diedTypeObjectProbscientistEdison0.50 scientistPeter0.15 scientist Mary T. 0.15 scientist David A. 0.10 cityBoston0.05 Type Table TypePredicatePredicateObjectObjectObjectProbscientistdied-ininventedEdison1955 light bulb 0.14 Joining Table 0.14 = 0.50 x 0.40 x 0.70?x?y?zProbEdison Light bulb 19550.14 q Table

31 31 Test Your Understanding! Predicate Object 1 Object 2 ProbinventedEdison Light bulb 0.70 playJohnGuitar0.40 invented David A. Guitar0.30 PlayJacksonPiano0.20 playJacksonGuitar0.05 Born-inJohn19900.05 Born-inJackson19800.05 Born-inBobby19800.05 Fact Table Problem1: list all singer who born in 1980, their instruments list all singer who born in 1980, their instrumentsTypeObjectProbSingerJohn0.50 instrumentGuitar0.15 instrumentpiano0.15 SingerJackson0.10 SingerBobby0.05 Type Table answer : q(?x, ?y):- play( ?x, ?y), q(?x, ?y):- play( ?x, ?y), born-in( ?x, 1980)

32 32 Test Your Understanding! Predicate Object 1 Object 2 ProbBeing-producerMattBobby0.70 Being-producerMattJackson0.40 Has-incomeBobby25000.30 Has-incomeJackson30000.20 Has-incomeMatt20000.05 Being-producerMattJohn0.05 Has-incomeJohn10000.05 Fact Table Problem2: list all singer who has income more than their producer list all singer who has income more than their producerTypeObjectProbSingerJohn0.50 ProducerMatt0.15 ProducerDavid0.15 SingerJackson0.10 SingerBobby0.05 Type Table answer : q(?x):- has-income( ?x, ?y), q(?x):- has-income( ?x, ?y), has-income( ?m, ?n), being-producer(?m, ?x), being-producer(?m, ?x), (?y > ?n)

33 26 Make a Query example: example:PredicateObj.1Obj.2ProbinventedEdison Light bulb 0.55 inventedEdisonTelescope0.14 inventedEdisonPhonograph0.14 inventedJason Cell phone 0.14 died-inMary18770.05 Fact Table example7: list all inventions discovered by Edison list all inventions discovered by Edison answer : q(?i):- discovered(Edison, ?i) q(?i):- discovered(Edison, ?i)iProbability Light bulb 0.55 x 0.5 Telescope 0.14 x 0.5 Phonograph q Table ConstObj.1Obj.2ProbIDInventeddiscovered0.50 SynEdison Edison T 0.15 SynEdison Thomas E 0.10 Constraint Table Discussion : In this case, What can we do to answer this query? In this case, What can we do to answer this query?

34 2 Make a Query Problem Scenario Problem Scenario 33 example8: (this example involves PROJECTION) list all name who invented something list all name who invented something?x?yProbEdison light bulb 0.34 Edisontelescope0.13 EdisonPhonograph0.13 TreeTable0.09 TreePen0.09 TreePaper0.09 TreeFruit0.09 TreeForest0.09 TreeEraser0.09 Treeruler0.09 Joining Table answer : q(?x):- invented(?x, ?y) q(?x):- invented(?x, ?y)?xProbTree0.63 Edison0.60 q Table 0.63 = 0.09 x 7 Discussion: Can you see something wrong in the resulting table?

35 Problem scenario caused by projection operation. Problem scenario caused by projection operation. Conventional Way: Conventional Way: newProb =  duplicateProb i newProb =  duplicateProb i New Way: using “Panel of Expert” technique New Way: using “Panel of Expert” technique principle: principle: 1.define number n of duplicate output ex. n=5 (meaning that if in total, there are 10 duplicate output, we will consider only 5 and eliminate other 5) to eliminate low quality output. 1.define number n of duplicate output ex. n=5 (meaning that if in total, there are 10 duplicate output, we will consider only 5 and eliminate other 5) to eliminate low quality output. 2.newProb = calculate by selecting the max value among those n duplicate output. 2.newProb = calculate by selecting the max value among those n duplicate output. newProb = max {duplicateProb i }; i  n newProb = max {duplicateProb i }; i  n 34 Solving Problem Scenario by using ‘ Panel of Expert’ technique

36 2 Make a Query Problem Scenario: Problem Scenario: 35 example8: (problem caused by projection operation) list all name who invented something list all name who invented something?x?yProbEdison light bulb 0.34 Edisontelescope0.13 EdisonPhonograph0.13 TreeWrongInfo10.09 TreeWrongInfo20.09 TreeWrongInfo30.09 TreeWrongInfo40.09 TreeWrongInfo50.09 TreeWrongInfo60.09 TreeWrongInfo70.09 Joining Table answer : q(?x):- invented(?x, ?y), q(?x):- invented(?x, ?y),?xProbEdison0.34 Tree0.09 q Table 0.63 = 0.09 x 7 Solved by “ Panel of Expert ” technique?xProbTree0.63 Edison0.60 q Table

37 Key points summary of 2 nd Component: (ExDB Compiler) 1. ExDB has its own syntax. 2. Result will be in table format. 3. Last column is probability value ranked by decreasing order of probability value. The assumption is that the higher probability, the more accurate. 4. Can implement top K to reduce time complexity (increase performance). 5. In case of JOIN table, the resulting probability the product of 2 joining table 6. In case of PROJECTION, use Panel of Expert to solve the problem. 7. In case that user’s query contains relation which does not exist in the Fact Table, we can use Constraint Table to answer such a query. 36 2 nd Component: ExDB Compiler

38 Working On Task#1 Synthetic Table Synthetic Table an additional feature to combine the result query q together an additional feature to combine the result query q together example: example: 37 Synthetic Table generated by MERGING answers from died-in(?x,?y),invented(?x,?y),published(?x,?y),taught(?x,?y)

39 Working On Task#2 Implementing with Google Search Engine Implementing with Google Search Engine 38 list all scientist, their inventions, who died before 1955 Search Textbox GO q(?x, ?y):- invented( ?x, ?y), died-in(?x, ?z), (?z < 1955)

40 Compare result ExDB & Google Test query: list all scientists who create something Test query: list all scientists who create something 39 Output from ExDB Output from Google Comments: ExDB performs much better than Google. ExDB performs much better than Google. For Google result, after investigating all the link, only 1 document comes close to the answer. For Google result, after investigating all the link, only 1 document comes close to the answer. For ExDB, although they have some redundancy, answer is still better. For ExDB, although they have some redundancy, answer is still better.

41 Conclusion Only Binary Predicate is allowed. Only Binary Predicate is allowed. Result will be in table format (different from Google search engine). Result will be in table format (different from Google search engine). How ExDB get answer makes more sense since they integrate all data together before we make a query on them. How ExDB get answer makes more sense since they integrate all data together before we make a query on them. Extractor has to run beforehand before allowing user to make a query. Extractor has to run beforehand before allowing user to make a query. IE involved in this paper are TextRunner, KnowItAll, DIRT. IE involved in this paper are TextRunner, KnowItAll, DIRT. User is not expected to know the schema of the table, instead, system itself will try to match as much as they can to answer the query (using synonym, inclusion independency). User is not expected to know the schema of the table, instead, system itself will try to match as much as they can to answer the query (using synonym, inclusion independency). 40

42 Question? 42 ?

43 References N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004. D. V. K. Reynold Cheng and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD, pages 551–562, 2003. 41


Download ppt "Structured Querying of Web Text A Technical Challenge Kulsawasd Jitkajornwanich University of Texas at Arlington CSE6339 Web Mining."

Similar presentations


Ads by Google