1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.

Slides:

Advertisements

Similar presentations

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.

Advertisements

Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.

Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.

Representing and Querying Correlated Tuples in Probabilistic Databases

Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.

Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.

D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.

Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.

PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.

1 541: Relational Calculus. 2 Relational Calculus  Comes in two flavours: Tuple relational calculus (TRC) and Domain relational calculus (DRC).  Calculus.

Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.

A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.

1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.

1 Probabilistic/Uncertain Data Management Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.

Efficient Query Evaluation on Probabilistic Databases

E FFICIENT T OP - K Q UERY E VALUATION ON P ROBABILISTIC D ATA P APER B Y C HRISTOPHER R´ E N ILESH D ALVI D AN S UCIU Presented By Chandrashekar Vijayarenu.

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 13: Incorporating Uncertainty into Data Integration PRINCIPLES OF DATA INTEGRATION.

Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.

Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan.

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.

CS151 Complexity Theory Lecture 7 April 20, 2004.

1 8. Safe Query Languages Safe program – its semantics can be at least partially computed on any valid database input. Safety is tied to program verification,

1 Probabilistic/Uncertain Data Management Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.

Relational Algebra & Calculus Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 21, 2004 Some slide content.

1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

1 Foundations of Probabilistic Answers to Queries Dan Suciu and Nilesh Dalvi University of Washington.

1 Probabilistic/Uncertain Data Management -- IV 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’ Sen, Deshpande. “Representing.

Relational Algebra Wrap-up and Relational Calculus Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 11, 2003.

Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.

1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.

The Relational Model: Relational Calculus

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Calculus Chapter 4, Section 4.3.

Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

Relational Calculus Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 17, 2007 Some slide content courtesy.

May 9, New Topic The complexity of counting.

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.

1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)

1 Relational Algebra and Calculas Chapter 4, Part A.

OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.

Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.

Webdamlog and Contradictions Daniel Deutch Tel Aviv University Joint work with Serge Abiteboul, Meghyn Bienvenu, Victor Vianu.

Random Interpretation Sumit Gulwani UC-Berkeley. 1 Program Analysis Applications in all aspects of software development, e.g. Program correctness Compiler.

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 9: Test Generation from Models.

Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington.

A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB

Database Management Systems, R. Ramakrishnan1 Relational Calculus Chapter 4, Part B.

Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.

Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.

Seminar on Databases and the Internet Focus: Probabilistic and Uncertain Databases Credit: Slideshow is a hodge-podge of slides, some of which were taken.

Relational Calculus Chapter 4, Section 4.3.

Relational Algebra & Calculus

A Course on Probabilistic Databases

Relational Calculus Chapter 4, Part B

Approximate Lineage for Probabilistic Databases

Intro to Theory of Computation

Queries with Difference on Probabilistic Databases

Data Integration with Dependent Sources

Lecture 16: Probabilistic Databases

Probabilistic Databases

Relational Algebra & Calculus

Relational Calculus Chapter 4, Part B 7/1/2019.

Probabilistic Databases with MarkoViews

Presentation transcript:

1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’ Das Sarma et al. “Working models for uncertain data”, ICDE’2006.

2 What is a Probabilistic Database ? “An item belongs to the database” is a probabilistic event –Tuple-existence uncertainty –Attribute-value uncertainty “A tuple is an answer to the query” is a probabilistic event Can be extended to all data models; we discuss only probabilistic relational data

3 Possible Worlds Semantics The set of all possible database instances: INST = {I 1, I 2, I 3,..., I N } Definition A probabilistic database I p is a probability distribution on INST s.t.  i=1,N Pr(I i ) = 1 Pr : INST ! [0,1] Definition A possible world is I s.t. Pr(I) > 0

4 Query Semantics Given a query Q and a probabilistic database I p, what is the meaning of Q(I p ) ?

5 Query Semantics Semantics 1: Possible Answers A probability distribution on sets of tuples 8 A. Pr(Q = A) =  I 2 INST. Q(I) = A Pr(I) Semantics 2: Possible Tuples A probability function on tuples 8 t. Pr(t 2 Q) =  I 2 INST. t 2 Q(I) Pr(I)

6 Possible Worlds Query Semantics Possible answers semantics Precise Can be used to compose queries Difficult user interface Possible tuples semantics Less precise, but simple; sufficient for most apps Cannot be used to compose queries Simple user interface

7 Possible Worlds Semantics: Summary Complete model; Clean formal semantics for SQL queries Not very useful as a representation or implementation tool HUGE number of possible worlds! Need more effective representation formalisms Something that users can understand/explore Allow more efficient query execution –Avoid “possible worlds explosion” Perhaps giving up completeness

8 Representation Formalisms Problem Need a good representation formalism Will be interpreted as possible worlds Several formalisms exists, but no winner Main open problem in probabilistic db

9 Evaluation of Formalisms Completeness? What possible worlds can it represent? What probability distributions on worlds? Closure? Is it closed under evaluation of query operators?

10 A Complete Formalism: Intensional Databases Atomic event ids Intensional probabilistic database J: each tuple t has an event attribute t.E e 1, e 2, e 3, … p 1, p 2, p 3, … 2 [0,1] e 3 Æ (e 5 Ç : e 2 ) Probabilities: Event expressions: Æ, Ç, :

11 Intensional DB ) Possible Worlds NameAddressE JohnSeattle e 1 Æ (e 2 Ç e 3 ) SueDenver (e 1 Æ e 2 ) Ç (e 2 Æ e 3 ) J = IpIp e1e2e3=e1e2e3= JohnSeattle SueDenver JohnSeattleSueDenver ; (1-p 1 )(1-p 2 )(1-p 3 ) +(1-p 1 )(1-p 2 )p 3 +(1-p 1 )p 2 (1-p 3 ) +p 1 (1-p 2 )(1-p 3 p 1 (1-p 2 ) p 3 (1-p 1 )p 2 p 3 p 1 p 2 (1-p 3 ) +p 1 p 2 p 3

12 Possible Worlds ) Intensional DB NameAddressE JohnSeattle E 1 Ç E 2 JohnBoston E 1 Ç E 4 SueSeattle E 1 Ç E 2 Ç E 3 E 1 = e 1 E 2 = : e 1 Æ e 2 E 3 = : e 1 Æ : e 2 Æ e 3 E 4 = : e 1 Æ : e 2 Æ : e 3 Æ e 4 “Prefix code” Pr(e 1 ) = p 1 Pr(e 2 ) = p 2 /(1-p 1 ) Pr(e 3 ) = p 3 /(1-p 1 -p 2 ) Pr(e 4 ) = p 4 /(1-p 1 -p 2 -p 3 ) J = Intensional DBs are complete NameAddress JohnSeattle JohnBoston SueSeattle NameAddress JohnSeattle SueSeattle NameAddress SueSeattle NameAddress JohnBoston p1p1 p2p2 p3p3 p4p4 =I p

13 Closure Under Operators  vEvE £ v1v1 E1E1 v1v1 v2v2 E1 ÆE2E1 ÆE2 v2v2 E2E2  vE1E1 vE2E2 …… v E 1 Ç E 2 Ç.. One still needs to compute probability of event expression - vE1E1 v E 1 Æ: E 2 vE2E2

14 Summary on Intensional Databases Event expression for each tuple Possible worlds: any subset Probability distribution: any Complete… but impractical Evaluate the probability of long event expressions Important abstraction: consider restrictions Related to c-tables [Imilelinski&Lipski:1984]

15 A Restricted Formalism: Explicit Independent Tuples Pr(I) =  t 2 I pr(t) £  t  I (1-pr(t)) No restrictions pr : TUP ! [0,1] Tuple independent probabilistic database INST = P (TUP) N = 2 M TUP = {t 1, t 2, …, t M } = all tuples

16 Tuple Prob. ) Possible Worlds NameCitypr JohnSeattlep 1 = 0.8 SueBostonp 2 = 0.6 FredBostonp 3 = 0.9 I p = NameCity JohnSeattl SueBosto FredBosto NameCity SueBosto FredBosto NameCity JohnSeattl FredBosto NameCity JohnSeattl SueBosto NameCity FredBosto NameCity SueBosto NameCity JohnSeattl ; I1I1 (1-p 1 ) (1-p 2 ) (1-p 3 ) I2I2 p 1 (1-p 2 )(1-p 3 ) I3I3 (1-p 1 )p 2 (1-p 3 ) I4I4 (1-p 1 )(1-p 2 )p 3 I5I5 p 1 p 2 (1-p 3 ) I6I6 p 1 (1-p 2 )p 3 I7I7 (1-p 1 )p 2 p 3 I8I8 p1p2p3p1p2p3  = 1 J = E[ size(I p ) ] = 2.3 tuples

17 Tuple-Independent DBs are Incomplete NameAddresspr JohnSeattlep1p1 SueSeattlep2p2 NameAddress JohnSeattle SueSeattle NameAddress JohnSeattle p1p1 p1p2p1p2 =I p ; 1-p 1 - p 1 p 2 Very limited – cannot capture correlations across tuples Not Closed Query operators can introduce complex correlations!

18 Tuple Prob. ) Query Evaluation NameCitypr JohnSeattlep1p1 SueBostonp2p2 FredBostonp3p3 CustomerProductDatepr JohnGizmo...q1q1 JohnGadget...q2q2 JohnGadget...q3q3 SueCamera...q4q4 SueGadget...q5q5 SueGadget...q6q6 FredGadget...q7q7 SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Customer and y.Product = ‘Gadget’ SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Customer and y.Product = ‘Gadget’ TupleProbability Seattle Boston 1-(1-q 2 )(1-q 3 )p 1 ( ) 1- (1- ) £ (1 - ) p 2 ( )1-(1-q 5 )(1-q 6 ) p 3 q 7

19 Application: Similarity Predicates NameCityProfession JohnSeattlestatistician SueBostonmusician FredBostonphysicist CustProductCategory JohnGizmodishware JohnGadgetinstrument JohnGadgetinstrument SueCameramusicware SueGadgetmicrophone SueGadgetinstrument FredGadgetmicrophone SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ and x.profession ~ ‘scientist’ and y.category ~ ‘music’ SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ and x.profession ~ ‘scientist’ and y.category ~ ‘music’ Step 1: evaluate ~ predicates

20 Application: Similarity Predicates NameCityProfessionpr JohnSeattlestatisticianp 1 =0.8 SueBostonmusicianp 2 =0.2 FredBostonphysicistp 3 =0.9 CustProductCategorypr JohnGizmodishwareq 1 =0.2 JohnGadgetinstrumentq 2 =0.6 JohnGadgetinstrumentq 3 =0.6 SueCameramusicwareq 4 =0.9 SueGadgetmicrophoneq 5 =0.7 SueGadgetinstrumentq 6 =0.6 FredGadgetmicrophoneq 7 =0.7 SELECT DISTINCT x.city FROM Person p x, Purchase p y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ and x.profession ~ ‘scientist’ and y.category ~ ‘music’ SELECT DISTINCT x.city FROM Person p x, Purchase p y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ and x.profession ~ ‘scientist’ and y.category ~ ‘music’ TupleProbability Seattlep 1 (1-(1-q 2 )(1-q 3 )) Boston 1-(1-p 2 (1-(1-q 5 )(1-q 6 ))) £ (1-p 3 q 7 ) Step 1: evaluate ~ predicates Step 2: evaluate rest of query

21 Summary on Explicit Independent Tuples Independent tuples Possible worlds: subsets Probability distribution: restricted Closure: no

22 Query Evaluation on Probabilistic DBs Focus on possible tuple semantics –Compute likelihood of individual answer tuples Probability of Boolean expressions Complexity of query evaluation

23 Probability of Boolean Expressions E = X 1 X 3 Ç X 1 X 4 Ç X 2 X 5 Ç X 2 X 6 Randomly make each variable true with the following probabilities Pr(X 1 ) = p 1, Pr(X 2 ) = p 2,....., Pr(X 6 ) = p 6 What is Pr(E) ??? Answer: re-group cleverly E = X 1 (X 3 Ç X 4 ) Ç X 2 (X 5 Ç X 6 ) Pr(E)=1 - (1-p 1 (1-(1-p 3 )(1-p 4 ))) (1-p 2 (1-(1-p 5 )(1-p 6 ))) Needed for query processing

24 E = X 1 X 2 Ç X 1 X 3 Ç X 2 X 3 Pr(E)=(1-p 1 )p 2 p 3 + p 1 (1-p 2 )p 3 + p 1 p 2 (1-p 3 ) + p 1 p 2 p 3 Now let’s try this: No clever grouping seems possible. Brute force: X1X1 X2X2 X3X3 EPr (1-p 1 )p 2 p p 1 (1-p 2 )p p 1 p 2 (1-p 3 ) 1111p1p2p3p1p2p3 Seems inefficient in general…

25 Complexity of Boolean Expression Probability Theorem [Valiant:1979] For a boolean expression E, computing Pr(E) is #P-complete NP = class of problems of the form “is there a witness ?” SAT #P = class of problems of the form “how many witnesses ?” #SAT The decision problem for 2CNF is in PTIME The counting problem for 2CNF is #P-complete [Valiant:1979]

26 Summary on Boolean Expression Probability #P-complete It’s hard even in simple cases: 2DNF Can approximate through Monte Carlo (MC) simulation

27 Query Complexity Data complexity of a query Q: Compute Q(I p ), for probabilistic database I p Simplest scenario only: Possible tuples semantics for Q Independent tuples for I p

28 Extensional Query Evaluation Relational ops compute probabilities  vpvp £ v1v1 p1p1 v1v1 v2v2 p 1 p 2 v2v2 p2p2  vp1p1 vp2p2 v1-(1-p 1 )(1-p 2 )… [Fuhr&Roellke:1997,Dalvi&Suciu:2004] - vp1p1 vp 1 (1-p 2 )vp2p2 Unlike intensional evaluation, data complexity: PTIME

29 JonSeap1p1 Jonq1q1 q2q2 q3q3 SELECT DISTINCT x.City FROM Person p x, Purchase p y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ SELECT DISTINCT x.City FROM Person p x, Purchase p y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ JonSeap1q1p1q1 JonSeap1q2p1q2 JonSeap1q3p1q3 1-(1-p 1 q 1 )(1- p 1 q 2 )(1- p 1 q 3 ) £  JonSeap1p1 Jonq1q1 q2q2 q3q3 1-(1-q 1 )(1-q 2 )(1-q 3 ) £  JonSeap 1 (1-(1-q 1 )(1-q 2 )(1-q 3 )) [Dalvi&Suciu:2004] Wrong ! Correct Depends on plan !!!

30 Query Complexity correct (“safe”) extensional plan Q bad :- R(x), S(x,y), T(y) Data complexity is #P complete Theorem The following are equivalent Q has PTIME data complexity Q admits an extensional plan (and one finds it in PTIME) Q does not have Q bad as a subquery Theorem The following are equivalent Q has PTIME data complexity Q admits an extensional plan (and one finds it in PTIME) Q does not have Q bad as a subquery [Dalvi&Suciu:2004]

31 Computing a Safe SPJ Extensional Plan Problem is due to projection operations An “unsafe” extensional projection combines tuples that are correlated assuming independence Projection over a join that projects away at least one of the join attrs  Unsafe projection! Intuitive: Joins create correlated output tuples

32 Computing a Safe SPJ Extensional Plan Algorithm for Safe Extensional SPJ Evaluation Apply safe projections as late as possible in the plan If no more safe projections exist, look for joins where all attributes are included in the output –Recurse on the LHS, RHS of the join Sound and complete safe SPJ evaluation algorithm If a safe plan exists, the algo finds it!

33 Summary on Query Complexity Extensional query evaluation: Very popular Guarantees polynomial complexity However, result depends on query plan and correctness not always possible! General query complexity #P complete (not surprising, given #SAT) Already #P hard for very simple query (Q bad ) Probabilistic databases have high query complexity