Seminar on Databases and the Internet Focus: Probabilistic and Uncertain Databases Credit: Slideshow is a hodge-podge of slides, some of which were taken.

Seminar on Databases and the Internet Focus: Probabilistic and Uncertain Databases Credit: Slideshow is a hodge-podge of slides, some of which were taken from Dan Suciu, Benny Kimelfeld, Gerhard Weikum 1

2 Traditionally, Databases are Deterministic and Queries are Boolean An item either is in the database or is not – Database represents a “complete world” A tuple either is in the query answer or is not The schema of the data is well known This applies to both relational data and XML (=tree structured data) Why is this so important? Does this capture real-world application scenarios?

Example 1: RFID Ecosystem at UW 3 [Welbourne’2007]

RFID Data is Uncertain RFID data = noisy – SIGHTING(tagID, antennaID, time) Derived data = Probabilistic – “John entered Room 524 at 9:15” prob=0.6 – “John carried laptop x77 at 11:03” prob=0.8 –... Queries – “Which people were in Room 478 yesterday ?” 4 Massive amounts of probabilistic data from RFIDs, sensors

Example 2 Information Extraction 5 ID House-NoStreetCityP 152Goregaon WestMumbai0.1 152-AGoregaon WestMumbai0.4 152GoregaonWest Mumbai0.2 152-AGoregaonWest Mumbai0.2 2.. 2.... [Gupta&Sarawagi’2006]...52 A Goregaon West Mumbai... Here probabilities are meaningful ≈20% of such extractions are correct

Example 3: Scanning Aerial Photography Find regions that include a factory building and a road … with a high probability

Analyzing a Region road (90%)road (60%)factory bldg. & wall (40%) / house & road (30%) house (50%) / factory bldg. (50%) factory bldg. (40%) / apt. building (50%) match (45%) match (36%) match (24%) match (36%) What is the probability that this region is an answer (i.e., includes a factory building and a road)? But specifying the probability of each match does not answer the question! The probability of each match can be significantly smaller than the probability that there is any match

Question: What must we rethink to deal with uncertain data? Ideas? Answer: Everything! – Data model – Query semantics (meaning) – Physical data storage – Efficient evaluation algorithms – Intuitive querying 8

Types of Uncertainty Uncertain Queries: – Data may (or may not) be deterministic – User wants to get approximate answers to his query (=fuzzy queries) Uncertain or Probalistic Data: – May have possibilities for data, without probabilities (=uncertain database) – May have probabilities of data (=probabilistic database) – Queries may or may not be exact 9

Fuzzy Queries 10

When are Fuzzy Queries Needed? When there are no answers When there are too many answers When the user has preferences over the data When the user is not sufficiently familiar with the structure of the database 11

12 The Empty Answers Problem Query is overspecified: no answers Example: try to buy a house in SF… SELECT * FROM Houses WHERE bedrooms = 3 AND style = ‘craftsman’ AND district = ‘Noe Valley’ AND price < 400000 SELECT * FROM Houses WHERE bedrooms = 3 AND style = ‘craftsman’ AND district = ‘Noe Valley’ AND price < 400000 [Agrawal,Chaudhuri,Das,Gionis 2003] … good luck !

Gerhard Weikum June 14, 2007 13/41 WHIRL: IR over Relations [W.W. Cohen: SIGMOD’98] Add text-similarity selection and join to relational algebra Example: Select * From Movies M, Reviews R Where M.Plot ~ ”fight“ And M.Year > 1990 And R.Rating > 3 And M.Title ~ R.Title Title Plot … Year Movies Title Comment … Rating Reviews Matrix Hero Matrix 1 Matrix Reloaded Matrix Eigenvalues Ying xiong aka. Hero Shrek 2 … matrix spectrum … orthonormal … … fight for peace … … sword fight … dramatic colors … … In ancient China … fights … sword fight … fights Broken Sword … In the near future … computer hacker Neo … … fight training … … cool fights … new techniques … … fights … and more fights … … fairly boring … 1999 2002 2004 In Far Far Away … our lovely hero fights with cat killer … 4 1 5 5 The record linkage problem!

bib journal article title name details author XML: Fighting Global Terrorism title Journal of XML E. Smith article A Survey of XML title aff Cal Tech name details author E. Smith aff U. of Chicago XML Document

bib journal article title name details author XML: Fighting Global Terrorism title Journal of XML E. Smith article A Survey of XML title aff Cal Tech name details author E. Smith aff U. of Chicago Query bib article title name Descendent Edge Child Edge

bib journal article title name details author XML: Fighting Global Terrorism title Journal of XML E. Smith article A Survey of XML title aff Cal Tech name details author E. Smith aff U. of Chicago Query author article title name article title email Cannot be satisfied: 1. Root nodes do not agree 2. No email available 3. Author should be below article, not above [Kanza,et.al.:PODS98,PODS01,] [Amer-Yahia,et.al.:SIGMOD04][Brodiansky,et.al.:CIKM07]

Another Type of Fuzzy Query: Keyword Proximity Search 17 Keyword Search The natural (and popular) option: Keyword Search Not easy to use traditional paradigms of querying (e.g., SQL, XQuery, SPARQL) and, moreover, they require a thorough understanding of the schema Goal: Enable users to instantly pose (inaccurate) queries without knowing the schema Nowadays… Exposure to many databases Different types (relational, XML, RDF…) Different schemas

Data have varying degrees of structure – Relational (w/ foreign keys), XML (w/ id-references) – Natural representation by a graph – Usually, data-centric rather than document-centric A query is a set of keywords − No structural constraints Keyword Proximity Search (KPS) The Goal: Extract meaningful parts of data w.r.t. the keywords Agrawal et al. ICDE’02 Hristidis et al., VLDB’02,03, ICDE’03 Bhalotia et al. VLDB’05 Kacholia al., VLDB’06 Ding et al., ICDE’07 Liu et al., SIGMOD’06 Wang et al., VLDB’06 Luo et al., SIGMOD’07 Golenberg SIGMOD’07 …

Example: Search in RDB IDName Population 22 Amsterdam 1101407 73Brussels951580 IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 search Belgium, Brussels CodeNameAreaCapital NL Netherlands 3733022 BBelgium3051073 CitiesOrganizations CountriesMemberships

IDName Population 22 Amsterdam 1101407 73Brussels951580 IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 search Belgium, Brussels CodeNameAreaCapital NL Netherlands 3733022 BBelgium3051073 CitiesOrganizations CountriesMemberships Brussels is the capital city of Belgium

IDName Population 22 Amsterdam 1101407 73Brussels951580 IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 CodeNameAreaCapital NL Netherlands 3733022 BBelgium3051073 CitiesOrganizations CountriesMemberships Brussels hosts EU and Belgium is a member search Belgium, Brussels

Example: Search in XML search Yannakakis, Approximation

Yannakakis wrote a paper about Approximation search Yannakakis, Approximation

Yannakakis is cited by a paper about Approximation search Yannakakis, Approximation

Some Open Problems A lot of the work done is applicable only for unsatisfiable queries, but not for queries with too many answers No user feedback to improve results of fuzzy queries Limited queries Little done when queries is fuzzy and data is uncertain – We look at uncertain data (with exact queries) now… 25

Uncertain Databases 26

Motivation Until now, we assumed that the data was certain, but the query was uncertain – query was answered approximately If data comes from many sources, of varying reliability, then data is uncertain – how is such uncertainty represented? – how do we query uncertain data?

Main Issues Possible worlds semantics Constructs/Models for uncertain data – completeness: can everything be modeled – closure: can query answers always be represented in model? – expressiveness Also: membership, uniqueness, equivalence, minimization [Das Sarma, Benjelloun, Halevy, Widom, ICDE06]

Example Volunteers record birds that they have seen Tables: – BirdInfo(birdname, color, size) – Sightings(observer, when, where, birdname) A volunteer may not be sure which bird she saw Explicit model: List all possible databases Implicit model: Use a compact representation for possible datanases

Attribute-Or Model Amy saw a jay, and either a crow or a raven ObserverWhenWhereBirdName Amy12/23/06Technionjay Mike12/23/06Technion{crow, raven} ObserverWhenWhereBirdName Amy12/23/06Technionjay Mike12/23/06Technioncrow ObserverWhenWhereBirdName Amy12/23/06Technionjay Mike12/23/06Technionraven Is this model complete? What should be the answer to SELECT * FROM Sightings WHERE Birdname=‘raven’ Is this model closed? Is this model complete? What should be the answer to SELECT * FROM Sightings WHERE Birdname=‘raven’ Is this model closed?

Maybe-Tuples A maybe-tuple is one that may or may not appear What are the instances represented by this relation? ObserverWhenWhereBirdName Amy12/23/06Technionjay Mike12/23/06Technion{crow, raven} ?

Tuple Constraints: Example What instances are represented here? Can it be represented by attribute-ors? Can it be represented by maybe-tuples? Can it be represented by attribute-ors with maybe-tuples? ObserverWhenWhereBirdName Amy12/23/06Technionjay Amy12/23/06Technioncrow t1t2t1t2 t 1 or t 2

Open Questions Solutions have been presented to achieve closure, completeness and efficiency in relational databases [Antova,Koch,Olteanu,ICDE07] [Benjelloun,Dar Sarma, Halevy,Theobald,Widom,J of VLDB,2008] How should one model an uncertain XML database? – Especially while taking into consideration closure, completeness, efficiency? 33

Probabilistic Databases 34

35 What is a Probabilistic Database ? “An item belongs to the database” is a probabilistic event – Tuple-existence uncertainty – Attribute-value uncertainty “A tuple is an answer to the query” is a probabilistic event Can be extended to all data models

36 Possible Worlds Semantics int, char(30), varchar(55), datetime Employee(name:varchar(55), dob:datetime, salary:int) Attribute domains: Relational schema: # values: 2 32, 2 120, 2 440, 2 64 # of tuples: 2 440 £ 2 64 £ 2 23 # of instances: 2 2 440 £ 2 64 £ 2 23 Employee(...), Projects(... ), Groups(...), WorksFor(...) Database schema: # of instances: N (= BIG but finite)

37 The Definition The set of all possible database instances: INST = {I 1, I 2, I 3,..., I N } Definition A probabilistic database I p is a probability distribution on INST s.t.  i=1,N Pr(I i ) = 1 Pr : INST  [0,1] Definition A possible world is I s.t. Pr(I) > 0 will use Pr or I p interchangeably

38 Example CustomerAddressProduct JohnSeattleGizmo JohnSeattleCamera SueDenverGizmo Pr(I 1 ) = 1/3 CustomerAddressProduct JohnBostonGadget SueDenverGizmo CustomerAddressProduct JohnSeattleGizmo JohnSeattleCamera SueSeattleCamera CustomerAddressProduct JohnBostonGadget SueSeattleCamera Pr(I 2 ) = 1/12 Pr(I 3 ) = 1/2 Pr(I 4 ) = 1/12 Possible worlds = {I 1, I 2, I 3, I 4 } I p =

39 Tuples as Events One tuple t: event t  I Two tuples t 1, t 2 : event t 1  I  t 2  I Pr(t) =  I: t  I Pr(I) Pr(t 1 t 2 ) =  I: t 1  I  t 2  I Pr(I)

40 Query Semantics Given a query Q and a probabilistic database I p, what is the meaning of Q(I p ) ?

41 Query Semantics Semantics 1: Possible Answers A probability distribution on sets of tuples  A. Pr(Q = A) =  I  INST. Q(I) = A Pr(I) Semantics 2: Possible Tuples A probability function on tuples  t. Pr(t  Q) =  I  INST. t  Q(I) Pr(I)

42 Example: Query Semantics NameCityProduct JohnSeattleGizmo JohnSeattleCamera SueDenverGizmo SueDenverCamera Pr(I 1 ) = 1/3 NameCityProduct JohnBostonGizmo SueDenverGizmo SueSeattleGadget NameCityProduct JohnSeattleGizmo JohnSeattleCamera SueSeattleCamera NameCityProduct JohnBostonCamera SueSeattleCamera Pr(I 2 ) = 1/12 Pr(I 3 ) = 1/2 Pr(I 4 ) = 1/12 SELECT DISTINCT x.product FROM Purchase p x, Purchase p y WHERE x.name = 'John' and x.product = y.product and y.name = 'Sue' SELECT DISTINCT x.product FROM Purchase p x, Purchase p y WHERE x.name = 'John' and x.product = y.product and y.name = 'Sue' Possible answers semantics: Answer setProbability Gizmo, Camera1/3Pr(I 1 ) Gizmo1/12Pr(I 2 ) Camera7/12P(I 3 ) + P(I 4 ) TupleProbability Camera11/12Pr(I 1 )+P(I 3 ) + P(I 4 ) Gizmo5/12Pr(I 1 )+Pr(I 2 ) Possible tuples semantics: Purchase p

43 Possible-Worlds Semantics: Summary Very powerful model – Complete: Can capture any instance distribution, any tuple correlations Intuitive, clean formal semantics for any SQL query – Translates to queries over deterministic instances

44 Possible Worlds Semantics: Summary (contd.) Possible answers semantics Precise Can be used to compose queries Difficult user interface Possible tuples semantics Less precise, but simple; sufficient for most apps Cannot be used to compose queries Simple user interface

45 Possible Worlds Semantics: Summary (contd.) Not very useful as a representation or implementation tool HUGE number of possible worlds! Need more effective representation formalisms Something that users can understand/explore Allow more efficient query execution – Avoid “possible worlds explosion” Perhaps giving up completeness

46 Explicit Independent Tuples Pr(I) =  t  I pr(t)   t  I (1-pr(t)) No restrictions pr : TUP  [0,1] Tuple independent probabilistic database INST = P (TUP) N = 2 M TUP = {t 1, t 2, …, t M } = all tuples

47 Tuple Prob. : Possible Worlds NameCitypr JohnSeattlep 1 = 0.8 SueBostonp 2 = 0.6 FredBostonp 3 = 0.9 I p = NameCity JohnSeattl SueBosto FredBosto NameCity SueBosto FredBosto NameCity JohnSeattl FredBosto NameCity JohnSeattl SueBosto NameCity FredBosto NameCity SueBosto NameCity JohnSeattl ; I1I1 (1-p 1 ) (1-p 2 ) (1-p 3 ) I2I2 p 1 (1-p 2 )(1-p 3 ) I3I3 (1-p 1 )p 2 (1-p 3 ) I4I4 (1-p 1 )(1-p 2 )p 3 I5I5 p 1 p 2 (1-p 3 ) I6I6 p 1 (1-p 2 )p 3 I7I7 (1-p 1 )p 2 p 3 I8I8 p1p2p3p1p2p3  = 1 J = E[ size(I p ) ] = 2.3 tuples

48 Tuple-Independent DBs are Incomplete NameAddresspr JohnSeattlep1p1 SueSeattlep2p2 NameAddress JohnSeattle SueSeattle NameAddress JohnSeattle p1p1 p1p2p1p2 =I p ; 1-p 1 - p 1 p 2 Very limited – cannot capture correlations across tuples Not Closed Query operators can introduce complex correlations!

49 Tuple Prob. : Query Evaluation NameCitypr JohnSeattlep1p1 SueBostonp2p2 FredBostonp3p3 CustomerProductDatepr JohnGizmo...q1q1 JohnGadget...q2q2 JohnGadget...q3q3 SueCamera...q4q4 SueGadget...q5q5 SueGadget...q6q6 FredGadget...q7q7 SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Customer and y.Product = ‘Gadget’ SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Customer and y.Product = ‘Gadget’ TupleProbability Seattle Boston 1-(1-q 2 )(1-q 3 )p 1 ( ) 1- (1- )  (1 - ) p 2 ( )1-(1-q 5 )(1-q 6 ) p 3 q 7 Query evaluation is usually intractable (#P-complete)

Other Models for Probabilistic Data There are many more models for relational probabilistic data (see papers on course site) There are many models for probabilistic XML – We will see one now 50

Rooted tree Mutually exclusive A ProTDB Document [Nierman & Jagadish 02] 0.8 0. 4 trackprivate 0. 5 0. 5 type vehicle neighborhood house m size s house aerial-photo 0. 7 5 0. 8 building factory 0. 8 park.lotheliport 0. 4 0. 3 region 2 types of nodes 2 types of distributions Ordinary Ordinary nodes Distributional Distributional nodes Independent

A ProTDB Document [Nierman & Jagadish 02] 0.8 trackprivate 0. 5 0. 5 type vehicle neighborhood house m size s house aerial-photo 0. 7 5 0. 8 building factory 0. 8 park.lotheliport 0. 4 0. 3 region A probability for each outgoing edge of a distributional node 0. 4

Instance Generation: Step 1 0.8 0. 4 trackprivate 0. 5 0. 5 type vehicle neighborhood house m size s house aerial-photo 0. 7 5 0. 8 building factory 0. 8 park.lotheliport 0. 4 0. 3 region Distributional nodes choose a set of children Traverse the tree top-down Choose children independently Drop unchosen children Choose children independently Choose at most one child

Instance Generation: Step 2 0. 4 track 0. 5 type vehicle neighborhood s size house aerial-photo 0. 7 5 factory 0. 8 heliport 0. 3 region Drop the distributional nodes

Instance Generation: Step 2 track type vehicle s size house aerial-photo factory heliport region neighborhood Connect each ordinary node to its closest ancestor Drop the distributional nodes

The Result: An Ordinary Document track type vehicle s size house aerial-photo factory heliport region neighborhood Many interesting queries are tractable! [Kimelfeld,Sagiv VLDB07][Kimelfeld,Sagiv,Cohen,PODS08] [Kimelfeld, Kosharovsky,Sagiv,SIGMOD08] Many interesting queries are tractable! [Kimelfeld,Sagiv VLDB07][Kimelfeld,Sagiv,Cohen,PODS08] [Kimelfeld, Kosharovsky,Sagiv,SIGMOD08]

Some Interesting Open Questions Possible answers semantics (instead of possible tuples semantics) for ProTDB Efficient physical storage External memory algorithms for query processing More expressive models (e.g., Bayesian correlations) Richer query languages Incremental update of results Many more… 57

Seminar on Databases and the Internet Focus: Probabilistic and Uncertain Databases Credit: Slideshow is a hodge-podge of slides, some of which were taken.

Similar presentations

Presentation on theme: "Seminar on Databases and the Internet Focus: Probabilistic and Uncertain Databases Credit: Slideshow is a hodge-podge of slides, some of which were taken."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Seminar on Databases and the Internet Focus: Probabilistic and Uncertain Databases Credit: Slideshow is a hodge-podge of slides, some of which were taken.

Similar presentations

Presentation on theme: "Seminar on Databases and the Internet Focus: Probabilistic and Uncertain Databases Credit: Slideshow is a hodge-podge of slides, some of which were taken."— Presentation transcript:

Similar presentations

About project

Feedback