1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.

1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’2004. 2.Das Sarma et al. “Working models for uncertain data”, ICDE’2006.

2 What is a Probabilistic Database ? “An item belongs to the database” is a probabilistic event –Tuple-existence uncertainty –Attribute-value uncertainty “A tuple is an answer to the query” is a probabilistic event Can be extended to all data models; we discuss only probabilistic relational data

3 Possible Worlds Semantics The set of all possible database instances: INST = {I 1, I 2, I 3,..., I N } Definition A probabilistic database I p is a probability distribution on INST s.t.  i=1,N Pr(I i ) = 1 Pr : INST ! [0,1] Definition A possible world is I s.t. Pr(I) > 0

4 Query Semantics Given a query Q and a probabilistic database I p, what is the meaning of Q(I p ) ?

5 Query Semantics Semantics 1: Possible Answers A probability distribution on sets of tuples 8 A. Pr(Q = A) =  I 2 INST. Q(I) = A Pr(I) Semantics 2: Possible Tuples A probability function on tuples 8 t. Pr(t 2 Q) =  I 2 INST. t 2 Q(I) Pr(I)

6 Possible Worlds Query Semantics Possible answers semantics Precise Can be used to compose queries Difficult user interface Possible tuples semantics Less precise, but simple; sufficient for most apps Cannot be used to compose queries Simple user interface

7 Possible Worlds Semantics: Summary Complete model; Clean formal semantics for SQL queries Not very useful as a representation or implementation tool HUGE number of possible worlds! Need more effective representation formalisms Something that users can understand/explore Allow more efficient query execution –Avoid “possible worlds explosion” Perhaps giving up completeness

8 Representation Formalisms Problem Need a good representation formalism Will be interpreted as possible worlds Several formalisms exists, but no winner Main open problem in probabilistic db

9 Evaluation of Formalisms Completeness? What possible worlds can it represent? What probability distributions on worlds? Closure? Is it closed under evaluation of query operators?

10 A Complete Formalism: Intensional Databases Atomic event ids Intensional probabilistic database J: each tuple t has an event attribute t.E e 1, e 2, e 3, … p 1, p 2, p 3, … 2 [0,1] e 3 Æ (e 5 Ç : e 2 ) Probabilities: Event expressions: Æ, Ç, :

11 Intensional DB ) Possible Worlds NameAddressE JohnSeattle e 1 Æ (e 2 Ç e 3 ) SueDenver (e 1 Æ e 2 ) Ç (e 2 Æ e 3 ) J = IpIp e1e2e3=e1e2e3= 000001010011100101110111 JohnSeattle SueDenver JohnSeattleSueDenver ; (1-p 1 )(1-p 2 )(1-p 3 ) +(1-p 1 )(1-p 2 )p 3 +(1-p 1 )p 2 (1-p 3 ) +p 1 (1-p 2 )(1-p 3 p 1 (1-p 2 ) p 3 (1-p 1 )p 2 p 3 p 1 p 2 (1-p 3 ) +p 1 p 2 p 3

12 Possible Worlds ) Intensional DB NameAddressE JohnSeattle E 1 Ç E 2 JohnBoston E 1 Ç E 4 SueSeattle E 1 Ç E 2 Ç E 3 E 1 = e 1 E 2 = : e 1 Æ e 2 E 3 = : e 1 Æ : e 2 Æ e 3 E 4 = : e 1 Æ : e 2 Æ : e 3 Æ e 4 “Prefix code” Pr(e 1 ) = p 1 Pr(e 2 ) = p 2 /(1-p 1 ) Pr(e 3 ) = p 3 /(1-p 1 -p 2 ) Pr(e 4 ) = p 4 /(1-p 1 -p 2 -p 3 ) J = Intensional DBs are complete NameAddress JohnSeattle JohnBoston SueSeattle NameAddress JohnSeattle SueSeattle NameAddress SueSeattle NameAddress JohnBoston p1p1 p2p2 p3p3 p4p4 =I p

13 Closure Under Operators  vEvE £ v1v1 E1E1 v1v1 v2v2 E1 ÆE2E1 ÆE2 v2v2 E2E2  vE1E1 vE2E2 …… v E 1 Ç E 2 Ç.. One still needs to compute probability of event expression - vE1E1 v E 1 Æ: E 2 vE2E2

14 Summary on Intensional Databases Event expression for each tuple Possible worlds: any subset Probability distribution: any Complete… but impractical Evaluate the probability of long event expressions Important abstraction: consider restrictions Related to c-tables [Imilelinski&Lipski:1984]

15 A Restricted Formalism: Explicit Independent Tuples Pr(I) =  t 2 I pr(t) £  t  I (1-pr(t)) No restrictions pr : TUP ! [0,1] Tuple independent probabilistic database INST = P (TUP) N = 2 M TUP = {t 1, t 2, …, t M } = all tuples

16 Tuple Prob. ) Possible Worlds NameCitypr JohnSeattlep 1 = 0.8 SueBostonp 2 = 0.6 FredBostonp 3 = 0.9 I p = NameCity JohnSeattl SueBosto FredBosto NameCity SueBosto FredBosto NameCity JohnSeattl FredBosto NameCity JohnSeattl SueBosto NameCity FredBosto NameCity SueBosto NameCity JohnSeattl ; I1I1 (1-p 1 ) (1-p 2 ) (1-p 3 ) I2I2 p 1 (1-p 2 )(1-p 3 ) I3I3 (1-p 1 )p 2 (1-p 3 ) I4I4 (1-p 1 )(1-p 2 )p 3 I5I5 p 1 p 2 (1-p 3 ) I6I6 p 1 (1-p 2 )p 3 I7I7 (1-p 1 )p 2 p 3 I8I8 p1p2p3p1p2p3  = 1 J = E[ size(I p ) ] = 2.3 tuples

17 Tuple-Independent DBs are Incomplete NameAddresspr JohnSeattlep1p1 SueSeattlep2p2 NameAddress JohnSeattle SueSeattle NameAddress JohnSeattle p1p1 p1p2p1p2 =I p ; 1-p 1 - p 1 p 2 Very limited – cannot capture correlations across tuples Not Closed Query operators can introduce complex correlations!

18 Tuple Prob. ) Query Evaluation NameCitypr JohnSeattlep1p1 SueBostonp2p2 FredBostonp3p3 CustomerProductDatepr JohnGizmo...q1q1 JohnGadget...q2q2 JohnGadget...q3q3 SueCamera...q4q4 SueGadget...q5q5 SueGadget...q6q6 FredGadget...q7q7 SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Customer and y.Product = ‘Gadget’ SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Customer and y.Product = ‘Gadget’ TupleProbability Seattle Boston 1-(1-q 2 )(1-q 3 )p 1 ( ) 1- (1- ) £ (1 - ) p 2 ( )1-(1-q 5 )(1-q 6 ) p 3 q 7

19 Application: Similarity Predicates NameCityProfession JohnSeattlestatistician SueBostonmusician FredBostonphysicist CustProductCategory JohnGizmodishware JohnGadgetinstrument JohnGadgetinstrument SueCameramusicware SueGadgetmicrophone SueGadgetinstrument FredGadgetmicrophone SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ and x.profession ~ ‘scientist’ and y.category ~ ‘music’ SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ and x.profession ~ ‘scientist’ and y.category ~ ‘music’ Step 1: evaluate ~ predicates

20 Application: Similarity Predicates NameCityProfessionpr JohnSeattlestatisticianp 1 =0.8 SueBostonmusicianp 2 =0.2 FredBostonphysicistp 3 =0.9 CustProductCategorypr JohnGizmodishwareq 1 =0.2 JohnGadgetinstrumentq 2 =0.6 JohnGadgetinstrumentq 3 =0.6 SueCameramusicwareq 4 =0.9 SueGadgetmicrophoneq 5 =0.7 SueGadgetinstrumentq 6 =0.6 FredGadgetmicrophoneq 7 =0.7 SELECT DISTINCT x.city FROM Person p x, Purchase p y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ and x.profession ~ ‘scientist’ and y.category ~ ‘music’ SELECT DISTINCT x.city FROM Person p x, Purchase p y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ and x.profession ~ ‘scientist’ and y.category ~ ‘music’ TupleProbability Seattlep 1 (1-(1-q 2 )(1-q 3 )) Boston 1-(1-p 2 (1-(1-q 5 )(1-q 6 ))) £ (1-p 3 q 7 ) Step 1: evaluate ~ predicates Step 2: evaluate rest of query

21 Summary on Explicit Independent Tuples Independent tuples Possible worlds: subsets Probability distribution: restricted Closure: no

22 Query Evaluation on Probabilistic DBs Focus on possible tuple semantics –Compute likelihood of individual answer tuples Probability of Boolean expressions Complexity of query evaluation

23 Probability of Boolean Expressions E = X 1 X 3 Ç X 1 X 4 Ç X 2 X 5 Ç X 2 X 6 Randomly make each variable true with the following probabilities Pr(X 1 ) = p 1, Pr(X 2 ) = p 2,....., Pr(X 6 ) = p 6 What is Pr(E) ??? Answer: re-group cleverly E = X 1 (X 3 Ç X 4 ) Ç X 2 (X 5 Ç X 6 ) Pr(E)=1 - (1-p 1 (1-(1-p 3 )(1-p 4 ))) (1-p 2 (1-(1-p 5 )(1-p 6 ))) Needed for query processing

24 E = X 1 X 2 Ç X 1 X 3 Ç X 2 X 3 Pr(E)=(1-p 1 )p 2 p 3 + p 1 (1-p 2 )p 3 + p 1 p 2 (1-p 3 ) + p 1 p 2 p 3 Now let’s try this: No clever grouping seems possible. Brute force: X1X1 X2X2 X3X3 EPr 0000 0010 0100 0111(1-p 1 )p 2 p 3 1000 1011p 1 (1-p 2 )p 3 1101p 1 p 2 (1-p 3 ) 1111p1p2p3p1p2p3 Seems inefficient in general…

25 Complexity of Boolean Expression Probability Theorem [Valiant:1979] For a boolean expression E, computing Pr(E) is #P-complete NP = class of problems of the form “is there a witness ?” SAT #P = class of problems of the form “how many witnesses ?” #SAT The decision problem for 2CNF is in PTIME The counting problem for 2CNF is #P-complete [Valiant:1979]

26 Summary on Boolean Expression Probability #P-complete It’s hard even in simple cases: 2DNF Can approximate through Monte Carlo (MC) simulation

27 Query Complexity Data complexity of a query Q: Compute Q(I p ), for probabilistic database I p Simplest scenario only: Possible tuples semantics for Q Independent tuples for I p

28 Extensional Query Evaluation Relational ops compute probabilities  vpvp £ v1v1 p1p1 v1v1 v2v2 p 1 p 2 v2v2 p2p2  vp1p1 vp2p2 v1-(1-p 1 )(1-p 2 )… [Fuhr&Roellke:1997,Dalvi&Suciu:2004] - vp1p1 vp 1 (1-p 2 )vp2p2 Unlike intensional evaluation, data complexity: PTIME

29 JonSeap1p1 Jonq1q1 q2q2 q3q3 SELECT DISTINCT x.City FROM Person p x, Purchase p y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ SELECT DISTINCT x.City FROM Person p x, Purchase p y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ JonSeap1q1p1q1 JonSeap1q2p1q2 JonSeap1q3p1q3 1-(1-p 1 q 1 )(1- p 1 q 2 )(1- p 1 q 3 ) £  JonSeap1p1 Jonq1q1 q2q2 q3q3 1-(1-q 1 )(1-q 2 )(1-q 3 ) £  JonSeap 1 (1-(1-q 1 )(1-q 2 )(1-q 3 )) [Dalvi&Suciu:2004] Wrong ! Correct Depends on plan !!!

30 Query Complexity Sometimes @ correct (“safe”) extensional plan Q bad :- R(x), S(x,y), T(y) Data complexity is #P complete Theorem The following are equivalent Q has PTIME data complexity Q admits an extensional plan (and one finds it in PTIME) Q does not have Q bad as a subquery Theorem The following are equivalent Q has PTIME data complexity Q admits an extensional plan (and one finds it in PTIME) Q does not have Q bad as a subquery [Dalvi&Suciu:2004]

31 Computing a Safe SPJ Extensional Plan Problem is due to projection operations An “unsafe” extensional projection combines tuples that are correlated assuming independence Projection over a join that projects away at least one of the join attrs  Unsafe projection! Intuitive: Joins create correlated output tuples

32 Computing a Safe SPJ Extensional Plan Algorithm for Safe Extensional SPJ Evaluation Apply safe projections as late as possible in the plan If no more safe projections exist, look for joins where all attributes are included in the output –Recurse on the LHS, RHS of the join Sound and complete safe SPJ evaluation algorithm If a safe plan exists, the algo finds it!

33 Summary on Query Complexity Extensional query evaluation: Very popular Guarantees polynomial complexity However, result depends on query plan and correctness not always possible! General query complexity #P complete (not surprising, given #SAT) Already #P hard for very simple query (Q bad ) Probabilistic databases have high query complexity

1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.

Similar presentations

Presentation on theme: "1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.

Similar presentations

Presentation on theme: "1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic."— Presentation transcript:

Similar presentations

About project

Feedback