1 Probabilistic/Uncertain Data Management -- IV 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’2004. 2.Sen, Deshpande. “Representing.

Slides:



Advertisements
Similar presentations
A*-tree: A Structure for Storage and Modeling of Uncertain Multidimensional Arrays Presented by: ZHANG Xiaofei March 2, 2011.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Representing and Querying Correlated Tuples in Probabilistic Databases
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Efficient Query Evaluation on Probabilistic Databases
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Approximate Counting via Correlation Decay Pinyan Lu Microsoft Research.
Lineage Processing over Correlated Probabilistic Databases Bhargav Kanagal Amol Deshpande University of Maryland.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.
Bayes Nets Rong Jin. Hidden Markov Model  Inferring from observations (o i ) to hidden variables (q i )  This is a general framework for representing.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.
1 Probabilistic/Uncertain Data Management Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
第十讲 概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
May 9, New Topic The complexity of counting.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
1 Relational Algebra and Calculas Chapter 4, Part A.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Webdamlog and Contradictions Daniel Deutch Tel Aviv University Joint work with Serge Abiteboul, Meghyn Bienvenu, Victor Vianu.
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
Random Interpretation Sumit Gulwani UC-Berkeley. 1 Program Analysis Applications in all aspects of software development, e.g. Program correctness Compiler.
Inference Algorithms for Bayes Networks
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Introduction on Graphic Models
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
P & NP.
A Course on Probabilistic Databases
Probabilistic Data Management
Intro to Theory of Computation
Queries with Difference on Probabilistic Databases
Lecture 16: Probabilistic Databases
CS 188: Artificial Intelligence Fall 2008
Probabilistic Databases
Umans Complexity Theory Lectures
Variable Elimination Graphical Models – Carlos Guestrin
Probabilistic Databases with MarkoViews
Presentation transcript:

1 Probabilistic/Uncertain Data Management -- IV 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’ Sen, Deshpande. “Representing and Querying Correlated Tuples in Probabilistic DBs”, ICDE’2007.

2 A Restricted Formalism: Explicit Independent Tuples Pr(I) =  t 2 I pr(t) £  t  I (1-pr(t)) No restrictions pr : TUP ! [0,1] Tuple independent probabilistic database INST = P (TUP) N = 2 M TUP = {t 1, t 2, …, t M } = all tuples

3 Tuple Prob. ) Possible Worlds NameCitypr JohnSeattlep 1 = 0.8 SueBostonp 2 = 0.6 FredBostonp 3 = 0.9 I p = NameCity JohnSeattl SueBosto FredBosto NameCity SueBosto FredBosto NameCity JohnSeattl FredBosto NameCity JohnSeattl SueBosto NameCity FredBosto NameCity SueBosto NameCity JohnSeattl ; I1I1 (1-p 1 ) (1-p 2 ) (1-p 3 ) I2I2 p 1 (1-p 2 )(1-p 3 ) I3I3 (1-p 1 )p 2 (1-p 3 ) I4I4 (1-p 1 )(1-p 2 )p 3 I5I5 p 1 p 2 (1-p 3 ) I6I6 p 1 (1-p 2 )p 3 I7I7 (1-p 1 )p 2 p 3 I8I8 p1p2p3p1p2p3  = 1 J = E[ size(I p ) ] = 2.3 tuples

4 Tuple-Independent DBs are Incomplete NameAddresspr JohnSeattlep1p1 SueSeattlep2p2 NameAddress JohnSeattle SueSeattle NameAddress JohnSeattle p1p1 p1p2p1p2 =I p ; 1-p 1 - p 1 p 2 Very limited – cannot capture correlations across tuples Not Closed Query operators can introduce complex correlations!

5 Query Evaluation on Probabilistic DBs Focus on possible tuple semantics –Compute likelihood of individual answer tuples Probability of Boolean expressions –Key operation for Intensional Query Evaluation Complexity of query evaluation

6 Complexity of Boolean Expression Probability Theorem [Valiant:1979] For a boolean expression E, computing Pr(E) is #P-complete NP = class of problems of the form “is there a witness ?” SAT #P = class of problems of the form “how many witnesses ?” #SAT The decision problem for 2CNF is in PTIME The counting problem for 2CNF is #P-complete [Valiant:1979]

7 Query Complexity Data complexity of a query Q: Compute Q(I p ), for probabilistic database I p Simplest scenario only: Possible tuples semantics for Q Independent tuples for I p

8 Extensional Query Evaluation Relational ops compute probabilities  vpvp £ v1v1 p1p1 v1v1 v2v2 p 1 p 2 v2v2 p2p2  vp1p1 vp2p2 v1-(1-p 1 )(1-p 2 )… [Fuhr&Roellke:1997,Dalvi&Suciu:2004] - vp1p1 vp 1 (1-p 2 )vp2p2 Unlike intensional evaluation, data complexity: PTIME

9 JonSeap1p1 Jonq1q1 q2q2 q3q3 SELECT DISTINCT x.City FROM Person p x, Purchase p y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ SELECT DISTINCT x.City FROM Person p x, Purchase p y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ JonSeap1q1p1q1 JonSeap1q2p1q2 JonSeap1q3p1q3 1-(1-p 1 q 1 )(1- p 1 q 2 )(1- p 1 q 3 ) £  JonSeap1p1 Jonq1q1 q2q2 q3q3 1-(1-q 1 )(1-q 2 )(1-q 3 ) £  JonSeap 1 (1-(1-q 1 )(1-q 2 )(1-q 3 )) [Dalvi&Suciu:2004] Wrong ! Correct Depends on plan !!!

10 Query Complexity correct (“safe”) extensional plan Q bad :- R(x), S(x,y), T(y) Data complexity is #P complete Theorem The following are equivalent Q has PTIME data complexity Q admits an extensional plan (and one finds it in PTIME) Q does not have Q bad as a subquery Theorem The following are equivalent Q has PTIME data complexity Q admits an extensional plan (and one finds it in PTIME) Q does not have Q bad as a subquery [Dalvi&Suciu:2004]

11 Computing a Safe SPJ Extensional Plan Problem is due to projection operations An “unsafe” extensional projection combines tuples that are correlated assuming independence Projection over a join that projects away at least one of the join attrs  Unsafe projection! Intuitive: Joins create correlated output tuples

12 Computing a Safe SPJ Extensional Plan Algorithm for Safe Extensional SPJ Evaluation Apply safe projections as late as possible in the plan If no more safe projections exist, look for joins where all attributes are included in the output –Recurse on the LHS, RHS of the join Sound and complete safe SPJ evaluation algorithm If a safe plan exists, the algo finds it!

13 Summary on Query Complexity Extensional query evaluation: Very popular Guarantees polynomial complexity However, result depends on query plan and correctness not always possible! General query complexity #P complete (not surprising, given #SAT) Already #P hard for very simple query (Q bad ) Probabilistic databases have high query complexity

14 Efficient Approximate Evaluation: Monte-Carlo Simulation Run evaluation with no projection/dup elimination till the very final step Intermediate tuples carry all attributes Each result tuple = group t 1,…,t n of tuples with the same projection attribute values –Prob(group) = Prob( C 1 OR C 2 OR… C n ), where each C i = e 1 AND e 2 … AND e k –Evaluate the probability of a large DNF expression –Can be efficiently approximated through MC simulation (a.k.a. sampling)

15 Monte Carlo Simulation [Karp,Luby&Madras:1989] E = X 1 X 2 Ç X 1 X 3 Ç X 2 X 3 Cnt à 0 repeat N times randomly choose X 1, X 2, X 3 2 {0,1} if E(X 1, X 2, X 3 ) = 1 then Cnt = Cnt+1 P = Cnt/N return P /* ' Pr(E) */ Cnt à 0 repeat N times randomly choose X 1, X 2, X 3 2 {0,1} if E(X 1, X 2, X 3 ) = 1 then Cnt = Cnt+1 P = Cnt/N return P /* ' Pr(E) */ Theorem. If N ¸ (1/ Pr(E)) £ (4ln(2/  )/  2 ) then: Pr[ | P/Pr(E) - 1 | >  ] <  Theorem. If N ¸ (1/ Pr(E)) £ (4ln(2/  )/  2 ) then: Pr[ | P/Pr(E) - 1 | >  ] <  May be very big X1X2X1X2 X1X3X1X3 X2X3X2X3 Naïve: 0/1-estimator theorem Works for any E Not in PTIME

16 Monte Carlo Simulation [Karp,Luby&Madras:1989] Cnt à 0; S à Pr(C 1 ) + … + Pr(C m ); repeat N times randomly choose i 2 {1,2,…, m}, with prob. Pr(C i ) / S randomly choose X 1, …, X n 2 {0,1} s.t. C i = 1 if C 1 =0 and C 2 =0 and … and C i-1 = 0 then Cnt = Cnt+1 P = Cnt/N * 1/ return P /* ' Pr(E) */ Cnt à 0; S à Pr(C 1 ) + … + Pr(C m ); repeat N times randomly choose i 2 {1,2,…, m}, with prob. Pr(C i ) / S randomly choose X 1, …, X n 2 {0,1} s.t. C i = 1 if C 1 =0 and C 2 =0 and … and C i-1 = 0 then Cnt = Cnt+1 P = Cnt/N * 1/ return P /* ' Pr(E) */ Theorem. If N ¸ (1/ m) £ (4ln(2/  )/  2 ) then: Pr[ | P/Pr(E) - 1 | >  ] <  Theorem. If N ¸ (1/ m) £ (4ln(2/  )/  2 ) then: Pr[ | P/Pr(E) - 1 | >  ] <  E = C 1 Ç C 2 Ç... Ç C m Improved: Now it’s better Only for E in DNF In PTIME

17 Summary on Monte Carlo Some form of simulation is needed in probabilistic databases, to cope with the #P-hardness bottleneck –Naïve MC: works well when Prob is big –Improved MC: needed when Prob is small Recent work [Re,Dalvi,Suciu, ICDE’07] describes optimized MC for top-k tuple evaluation

18 Handling Tuple Correlations Tuple correlations/dependencies arise naturally –Sensor networks: Temporal/spatial correlations –During query evaluation (even starting with independent tuples) Need representation formalism that can capture and evaluate queries over such correlated tuples

19 Capturing Tuple Correlations: Basic Ideas Use key ideas of Probabilistic Graphical Models (PGMs) –Bayes and Markov networks are special cases Tuple-based random variables –Each tuple t corresponds to a Boolean RV X t Factors capturing correlations across subsets of RVs –f(X) is a function of a (small) subset X of the X t RVs [Sen, Deshpande:2007]

20 Capturing Tuple Correlations: Basic Ideas Associate each probabilistic tuple with a binomial RV Define PGM factors capturing correlations across subsets of tuple RVs Probability of a possible world = product of all PGM factors –PGM = factored, economical representation of possible worlds distribution –Closed & complete representation formalism

21 Example: Mutual Exclusion Want to capture mutual exclusion (XOR) between tuples s 1 and t 1

22 Example: Positive Correlation Want to capture positive correlation between tuples s 1 and t 1

23 PGM Representation Definition: A PGM is a graph whose nodes represent RVs and edges represent correlations Factors correspond to the cliques of the PGM graph –Graph structure encodes conditional independencies –Joint pdf =  clique factors –Economical representation (O(2^k), k=|max clique|)

24 Query Evaluation: Basic Ideas Carefully represent correlations between base, intermediate, and result tuples to generate a PGM for the query result distribution –Each relational op generates Boolean factors capturing the dependencies of its input/output tuples Final model = product of all generated factors Cast probabilistic computations in query evaluation as a probabilistic inference problem over the resulting (factored) PGM –Can import ML techniques and optimizations

25 Query Evaluation: Example

26 Probabilistic DBs: Summary Principled framework for managing uncertainties –Uncertainty management: ML, AI, Stats –Benefits of DB world: declarative QL, optimization, scale to large data, physical access structs, … Prob DBs = “Marriage” of DBs and ML/AI/Stats –ML folks have also been moving our way: Relational extensions to ML models (PRMs, FO models, inductive logic programming, …)

27 Probabilistic DBs: Future Importing more sophisticated ML techniques and tools inside the DBMS –Inference as queries, FO models and optimizations, access structs for relational queries + inference, … More on the algorithmic front: Probabilstic DBs and possible worlds semantics brings new challenges –E.g., approximate query processing, probabilistic data streams (e.g., sketching), …