Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Course on Probabilistic Databases

Similar presentations


Presentation on theme: "A Course on Probabilistic Databases"— Presentation transcript:

1 A Course on Probabilistic Databases
June, 2014 Probabilistic Databases - Dan Suciu A Course on Probabilistic Databases Dan Suciu University of Washington

2 Outline Motivating Applications The Probabilistic Data Model Chapter 2
June, 2014 Probabilistic Databases - Dan Suciu Outline Part 1 Motivating Applications The Probabilistic Data Model Chapter 2 Extensional Query Plans Chapter 4.2 The Complexity of Query Evaluation Chapter 3 Extensional Evaluation Chapter 4.1 Intensional Evaluation Chapter 5 Conclusions Part 2 Part 3 Part 4

3 Probabilistic Data Model: Outline
June, 2014 Probabilistic Databases - Dan Suciu Probabilistic Data Model: Outline Review: Relational Data Model Incomplete Databases Probabilistic Databases (PDB) Query semantics Block Independent Disjoint

4 Review: Relational Data Model
June, 2014 Probabilistic Databases - Dan Suciu Review: Relational Data Model Owner Location Data: stored in relations (= tables) Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18

5 Review: Relational Data Model
June, 2014 Probabilistic Databases - Dan Suciu Review: Relational Data Model Owner Location Data: stored in relations (= tables) Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Queries: SQL, Find all owners of objects in the Office -- SQL: e.g. postgres SELECT DISTINCT Owner.name FROM Owner, Location WHERE Owner.object = Location.object and Location.loc = ‘Office’

6 Review: Relational Data Model
June, 2014 Probabilistic Databases - Dan Suciu Review: Relational Data Model Owner Location Data: stored in relations (= tables) Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Queries: SQL, Unions of Conjunctive Queries Find all owners of objects in the Office -- SQL: e.g. postgres SELECT DISTINCT Owner.name FROM Owner, Location WHERE Owner.object = Location.object and Location.loc = ‘Office’ Q(z) = Owner(z,x), Location(x,t,y),y=‘Office’ Note that x,t, are existentially quantified: Q(z) = ∃x ∃t (Owner(z,x), Location(x,t,’Office’))

7 Review: Relational Data Model
June, 2014 Probabilistic Databases - Dan Suciu Review: Relational Data Model Owner Location Data: stored in relations (= tables) Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Queries: SQL, Unions of Conjunctive Queries Find all owners of objects in the Office -- SQL: e.g. postgres SELECT DISTINCT Owner.name FROM Owner, Location WHERE Owner.object = Location.object and Location.loc = ‘Office’ Q(z) = Owner(z,x), Location(x,t,y),y=‘Office’ Note that x,t, are existentially quantified: Q(z) = ∃x ∃t (Owner(z,x), Location(x,t,’Office’)) Name Joe Jim Query answer: Q=

8 Review: Relational Data Model
June, 2014 Probabilistic Databases - Dan Suciu Review: Relational Data Model Owner Location Data: stored in relations (= tables) Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Queries: SQL, Unions of Conjunctive Queries Find all owners of objects in the Office -- SQL: e.g. postgres SELECT DISTINCT Owner.name FROM Owner, Location WHERE Owner.object = Location.object and Location.loc = ‘Office’ Q(z) = Owner(z,x), Location(x,t,y),y=‘Office’ Note that x,t, are existentially quantified: Q(z) = ∃x ∃t (Owner(z,x), Location(x,t,’Office’)) Name Joe Jim Query answer: Q=

9 Review: Complexity of Query Evaluation
Query Q, database D Data complexity: fix Q, complexity = f(D) Query complexity: fix D, complexity = f(Q) Combined complexity: complexity = f(D,Q) This course Moshe Vardi Data complexity is the right metric for query language, and this is what I will discuss today. This is what allows database systems to scale to large data sets. ALL query languages that exists today in systems have PTIME data complexity: SQL, datalog, datalog with negation, xquery, sparql, etc etc. So this is the lens through which we will study query evaluation on probabilistic databases. Data complexity is unique to database research

10 June, 2014 Probabilistic Databases - Dan Suciu Incomplete Database Definition An Incomplete Database is a finite set of database instances W = (W1, W2, …, Wn) Each Wi is called a possible world

11 June, 2014 Probabilistic Databases - Dan Suciu Incomplete Database Definition An Incomplete Database is a finite set of database instances W = (W1, W2, …, Wn) Each Wi is called a possible world W1 W2 W3 W4 Owner Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Location Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18

12 June, 2014 Probabilistic Databases - Dan Suciu Incomplete Database Definition An Incomplete Database is a finite set of database instances W = (W1, W2, …, Wn) Each Wi is called a possible world W1 W2 W3 W4 Owner Owner Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Location Location Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Object Time Loc Book302 8:18 Office

13 June, 2014 Probabilistic Databases - Dan Suciu Incomplete Database Definition An Incomplete Database is a finite set of database instances W = (W1, W2, …, Wn) Each Wi is called a possible world W1 W2 W3 W4 Owner Owner Owner Owner Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Name Object Jim Laptop77 Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Location Location Location Location Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Object Time Loc Book302 8:18 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18

14 Incomplete Database: Query Semantics
June, 2014 Probabilistic Databases - Dan Suciu Incomplete Database: Query Semantics Definition Given query Q, incomplete database W: An answer t is certain, if ∀Wi, t ∈Q(Wi) An answer t is possible if ∃Wi, t ∈Q(Wi)

15 Incomplete Database: Query Semantics
June, 2014 Probabilistic Databases - Dan Suciu Incomplete Database: Query Semantics Definition Given query Q, incomplete database W: An answer t is certain, if ∀Wi, t ∈Q(Wi) An answer t is possible if ∃Wi, t ∈Q(Wi) Q(z) = Owner(z,x), Location(x,t,’Office’) W1 W2 W3 W4 Owner Owner Owner Owner Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Name Object Joe Laptop77 Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Location Location Location Location Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Object Time Loc Book302 8:18 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18

16 Incomplete Database: Query Semantics
June, 2014 Probabilistic Databases - Dan Suciu Incomplete Database: Query Semantics Definition Given query Q, incomplete database W: An answer t is certain, if ∀Wi, t ∈Q(Wi) An answer t is possible if ∃Wi, t ∈Q(Wi) Q(z) = Owner(z,x), Location(x,t,’Office’) W1 W2 W3 W4 Q= Owner Owner Owner Owner Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Name Object Joe Laptop77 Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Location Location Location Location Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Object Time Loc Book302 8:18 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Joe Jim Joe Joe Joe Jim

17 Incomplete Database: Query Semantics
June, 2014 Probabilistic Databases - Dan Suciu Incomplete Database: Query Semantics Definition Given query Q, incomplete database W: An answer t is certain, if ∀Wi, t ∈Q(Wi) An answer t is possible if ∃Wi, t ∈Q(Wi) Q(z) = Owner(z,x), Location(x,t,’Office’) Certain answers to Q: Joe Possible answers to Q: Joe, Jim W1 W2 W3 W4 Q= Owner Owner Owner Owner Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Name Object Joe Laptop77 Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Location Location Location Location Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Object Time Loc Book302 8:18 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Joe Jim Joe Joe Joe Jim

18 Probabilistic Database
June, 2014 Probabilistic Databases - Dan Suciu Probabilistic Database Definition A Probabilistic Database is (W, P), where W is an incomplete database, and P: W  [0,1] a probability distribution: Σi=1,n P(Wi) = 1

19 Probabilistic Database
June, 2014 Probabilistic Databases - Dan Suciu Probabilistic Database Definition A Probabilistic Database is (W, P), where W is an incomplete database, and P: W  [0,1] a probability distribution: Σi=1,n P(Wi) = 1 W1 W2 W3 W4 0.3 0.4 0.2 0.1 Owner Owner Owner Owner Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Name Object Jim Laptop77 Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Location Location Location Location Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Object Time Loc Book302 8:18 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18

20 Probabilistic Database: Query Semantics
June, 2014 Probabilistic Databases - Dan Suciu Probabilistic Database: Query Semantics Definition Given query Q, probabilistic database (W,P): The marginal probability of an answer t is: P(t) = Σ { P(Wi) | Wi ∈ W, t ∈Q(Wi) }

21 Probabilistic Database: Query Semantics
June, 2014 Probabilistic Databases - Dan Suciu Probabilistic Database: Query Semantics Definition Given query Q, probabilistic database (W,P): The marginal probability of an answer t is: P(t) = Σ { P(Wi) | Wi ∈ W, t ∈Q(Wi) } Q(z) = Owner(z,x), Location(x,t,’Office’) W1 W2 W3 W4 0.3 0.4 0.2 0.1 Owner Owner Owner Owner Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Name Object Jim Laptop77 Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Location Location Location Location Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Object Time Loc Book302 8:18 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18

22 Probabilistic Database: Query Semantics
June, 2014 Probabilistic Databases - Dan Suciu Probabilistic Database: Query Semantics Definition Given query Q, probabilistic database (W,P): The marginal probability of an answer t is: P(t) = Σ { P(Wi) | Wi ∈ W, t ∈Q(Wi) } Q(z) = Owner(z,x), Location(x,t,’Office’) W1 W2 W3 W4 0.3 0.4 0.2 0.1 Q= Owner Owner Owner Owner Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Name Object Joe Laptop77 Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Location Location Location Location Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Object Time Loc Book302 8:18 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Joe Jim Joe Joe Joe Jim

23 Probabilistic Database: Query Semantics
June, 2014 Probabilistic Databases - Dan Suciu Probabilistic Database: Query Semantics Definition Given query Q, probabilistic database (W,P): The marginal probability of an answer t is: P(t) = Σ { P(Wi) | Wi ∈ W, t ∈Q(Wi) } Q(z) = Owner(z,x), Location(x,t,’Office’) P(Joe) = 1.0 P(Jim) = 0.4 W1 W2 W3 W4 0.3 0.4 0.2 0.1 Q= Owner Owner Owner Owner Name Object Joe Book302 Laptop77 Jim Fred GgleGlass Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Name Object Joe Laptop77 Name Object Joe Book302 Jim Laptop77 Fred GgleGlass Location Location Location Location Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Object Time Loc Book302 8:18 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Object Time Loc Laptop77 5:07 Hall 9:05 Office Book302 8:18 Joe Jim Joe Joe Joe Jim

24 June, 2014 Probabilistic Databases - Dan Suciu Discussion Intuition: a probabilistic database says that the database can be in one of possible states, each with a probability Possible query answers: a set of answers annotated with probabilities: (t1, p1), (t2, p2), (t3, p3), … Usually: p1 ≥ p2 ≥ p3 ≥ … Problem: the number of possible world in a probabilistic database is astronomically large. To represent it, we impose some restrictions

25 Independent, Disjoint Tuples
June, 2014 Probabilistic Databases - Dan Suciu Independent, Disjoint Tuples Definition Given a probabilistic database (W, P). Two tuples t1, t2 are called: Independent, if: P(t1 t2) = P(t1) P(t2) Disjoint (or exclusive), if: P(t1t2) = 0

26 Independent, Disjoint Tuples
June, 2014 Probabilistic Databases - Dan Suciu Independent, Disjoint Tuples Definition Given a probabilistic database (W, P). Two tuples t1, t2 are called: Independent, if: P(t1 t2) = P(t1) P(t2) Disjoint (or exclusive), if: P(t1t2) = 0 Definition A probabilistic database is called Block-Independent-Disjoint (BID), if its tuples are grouped into blocks such that: Tuples from the same block are disjoint Tuples from different blocks are independent

27 Example: BID Table W={ } Object Time Loc P Laptop77 9:07 Rm444 p1 Hall
June, 2014 Probabilistic Databases - Dan Suciu Example: BID Table BID Table Object Time Loc P Laptop77 9:07 Rm444 p1 Hall p2 Book302 9:18 Office p3 p4 Lift p5 Disjoint Inde pen dent Disjoint W={ Object Time Loc Laptop77 9:07 Rm444 Book302 9:18 Office Object Time Loc Laptop77 9:07 Rm444 Book302 9:18 } Possible worlds Object Time Loc Laptop77 9:07 Rm444 Book302 9:18 Lift Object Time Loc Laptop77 9:07 Hall Book302 9:18 Office Object Time Loc Laptop77 9:07 Hall Book302 9:18 Rm444 Object Time Loc Laptop77 9:07 Hall Book302 9:18 Lift p1p3 Object Time Loc Laptop77 9:07 Rm444 p1p4 Object Time Loc Laptop77 9:07 Hall Object Time Loc Book302 9:18 Office Object Time Loc Book302 9:18 Rm444 Object Time Loc Book302 9:18 Lift p1(1- p3-p4-p5) Object Time Loc

28 The Query Evaluation Problem
June, 2014 Probabilistic Databases - Dan Suciu The Query Evaluation Problem Given: a BID database D, a query Q, and output tuple t Compute: P(t) Note: D has, say, tuples, while the number of possible worlds is Challenge: compute P(t) efficiently, in the size of D Data complexity: the complexity of P depends dramatically on Q

29 An Example P(Q) = 1- {1- } * p1*[ ] 1-(1-q1)*(1-q2) {1- } p2*[ ]
June, 2014 Probabilistic Databases - Dan Suciu An Example Boolean query SELECT DISTINCT ‘true’ FROM R, S WHERE R.x = S.x Q() = R(x), S(x,y) P(Q) = 1- { } * p1*[ ] 1-(1-q1)*(1-q2) { } p2*[ ] 1-(1-q3)*(1-q4)*(1-q5) One can compute P(Q) in PTIME in the size of the database D S x y P a1 b1 q1 b2 q2 a2 b3 q3 b4 q4 b5 q5 R x P a1 p1 a2 p2 a3 p3

30 Summary of the Probabilistic Data Model
June, 2014 Probabilistic Databases - Dan Suciu Summary of the Probabilistic Data Model Possible Worlds Semantics: very powerful but difficult to represent Block-Independent-Disjoint databases have efficient representation: D is stored in a traditional database Independent Databases: even simpler Main challenge: evaluate Q efficiently in the size of D This tutorial


Download ppt "A Course on Probabilistic Databases"

Similar presentations


Ads by Google