Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.
Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Representing and Querying Correlated Tuples in Probabilistic Databases
1 Conjunctions of Queries. 2 Conjunctive Queries A conjunctive query is a single Datalog rule with only non-negated atoms in the body. (Note: No negated.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Chris Re, Julie Letchner, Magdalena Balazinska and Dan Suciu University of Washington.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
Efficient Query Evaluation on Probabilistic Databases
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
1 9. Evaluation of Queries Query evaluation – Quantifier Elimination and Satisfiability Example: Logical Level: r   y 1,…y n  r’ Constraint.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
1 Ivan Lanese Computer Science Department University of Bologna Italy Concurrent and located synchronizations in π-calculus.
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA June 26, 2007.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Rutgers University Relational Algebra 198:541 Rutgers University.
Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
1 Probabilistic/Uncertain Data Management -- IV 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’ Sen, Deshpande. “Representing.
Fast Spectral Transforms and Logic Synthesis DoRon Motter August 2, 2001.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
1 Performance Evaluation of Computer Systems By Behzad Akbari Tarbiat Modares University Spring 2009 Introduction to Probabilities: Discrete Random Variables.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
The Mean of a Discrete RV The mean of a RV is the average value the RV takes over the long-run. –The mean of a RV is analogous to the mean of a large population.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Classification Techniques: Bayesian Classification
1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)
1 Relational Algebra and Calculas Chapter 4, Part A.
Relational Algebra.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
EE 5345 Multiple Random Variables
Learning to Share Meaning in a Multi-Agent System (Part I) Ganesh Padmanabhan.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
Lecture 7: Foundations of Query Languages Tuesday, January 23, 2001.
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania PODS 2007.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
Properties of Algebra. 7 + ( ) = ( ) + 9.
1 IAS, Princeton ASCR, Prague. The Problem How to solve it by hand ? Use the polynomial-ring axioms ! associativity, commutativity, distributivity, 0/1-elements.
CS589 Principles of DB Systems Spring 2014 Unit 2: Recursive Query Processing Lecture 2-1 – Naïve algorithm for recursive queries Lois Delcambre (slides.
A Course on Probabilistic Databases
Relational Algebra Chapter 4 1.
Approximate Lineage for Probabilistic Databases
Relational Algebra Chapter 4, Part A
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Algebra.
Queries with Difference on Probabilistic Databases
Lecture 16: Probabilistic Databases
Relational Algebra 1.
Relational Algebra Chapter 4 1.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Lecture 10: Query Complexity
Probabilistic Databases
Relational Algebra & Calculus
Probabilistic Databases with MarkoViews
Presentation transcript:

Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database

High level Overview Evaluation of conjunctive Boolean queries with aggregate tests on probabilistic DBs: HAVING in SQL, e.g. is the SUM(profit) > 100k? Looking for optimal algorithms (dichotomies): For all queries q with aggregate A want P time algorithm, call this A-Safe [DS04,DS07] Some instance s.t. q is hard (#P). Technique: In safe plans, use multiplication In A-safe plans, use convolution (on monoids) 2

Motivation ItemForecasterAmountP WidgetAlice$-99k0.99 Bob$100M0.01 WhatsitAlice$1M1 SELECT SUM(Amount) FROM Profit WHERE item=‘Widget’ SELECT item FROM Profit WHERE item =‘Widget’ GROUP BY item HAVING SUM(Amount) > 0 Expectation Style [Prior Art] HAVING style Ans: -99k * M*0.01 ~900K Ans: 0.01 Profit 3

Overview Preliminaries Formal Problem Description Query plans and Datalog Monoid Random Variables and Convolutions Max,Min,Count and hints for others Conclusions 4

SELECT ITEM FROM PROFIT WHERE ITEM=‘Widget’ GROUP BY ITEM HAVING SUM(PROFIT) > 0 HAVING Query semantics NB: Assume SQL-like semantics Conjunctive rule: No repeated symbols Aggregates Comparision: k, is a constant 5

Probabilistic Semantics NB: In paper, allow disjoint tuples Possible worlds, model Query Semantics In talk, restrict to tuple independence 6

Complexity and formal problem Data complexity: Fix Query. Instance grows. In practice, query is small. Consider k, i.e. 1000, as part of the input Skeleton, 7

Overview Preliminaries Formal Problem Description Query plans and Datalog Monoid Random Variables and Convolutions Max,Min,Count and hints for others Conclusions 8

Monoids and Semirings NB: n=1 is logical OR A monoid is a triple where M is a set and + is associative with identity 0. e.g. Commutative Semiring is Both are commutative monoids * distributes over + e.g. a Boolean algebra 9

Fix a Semiring S. Annotation is a function to S with finite support Plans defined inductively: [GKT07] : Datalog + Semirings 10

Goal: define value of tuple t in a plan P, support, i.e. tuples contributing to a value Value of a plan, i.e, the annotation computes [GKT07] Inductive definition 11

Annotations and HAVING XY A10 B100 C1 t(Y) Monoid sum is 1 iff all values are bigger than probabilities 0 is tuple not present 1 is tuple present, y > 3 2 is tuple present, Monoids and Aggregates How can we deal with probabilities? 12

Overview Preliminaries Formal Problem Description Query plans and Datalog Monoid Random Variables and Convolutions Max,Min,Count and hints for others Conclusions 13

An M-random variable (rv) is Correlations r,s are independent if for any m,m’ in M Extended to sets via total independence Monoid Random Variables 14

Monoid Convolutions Let r be an rv. A marginal vector is The monoid convolution * (depending on +) is 15

Convolutions Convolutions are efficient, if M is not too big If r,s monoid rvs then r+s is an rv defined as PROP: If r,s are independent then the distribution of r + s is given by convolution: PROP: The convolution of n r.v.s can be computed in Single convolution in time Convolution is associative. 16

Overview Preliminaries Formal Problem Description Query plans and Datalog Monoid Random Variables and Convolutions Max,Min,Count and hints for others Conclusions 17

Annotations and HAVING XY A10 B100 C1 t(Y) Monoid sum is 1 iff all values are bigger than probabilities (0.8,0.2,0) (0.6,0.4,0) (0.9,0,0.1) Marginal of 1 after convolution = value of query 0 is tuple not present 1 is tuple present, y > 3 marginal vectors 2 is tuple present, Monoids and Aggregates 18

Compute value of “Safe Plans”: Plan is safe [DS04], if all projects and joins are independent tuples, else #P THM: value is correct if the plan is safe. “Safe plans” for semirings Only efficient if the semiring is “small” Gives dicohotomy for MIN,MAX,COUNT – not the others 19

Additional Results Dichotomy for SUM,AVG,COUNT DISTINCT Not all safe plans allowed! e.g. cannot have independent projections “on top” Disjoint tuples in the paper Need a “disjoint projection” operation More work for dichotomies Algorithms for finding safe plans (P time) 20

Conclusion Semantic for aggregation queries on prob DBs Similar to HAVING in SQL Proposed a complexity measure for such queries Central technique was marginal vectors and convolutions Dichotomy for HAVING queries w.o. self-joins 21

22

Conjunctive rule: No repeated subgoals Aggregates Comparision: k, is a constant SELECT ITEM FROM PROFIT WHERE ITEM=‘Widget’ GROUP BY ITEM HAVING SUM(PROFIT) > 0 HAVING Query semantics NB: Assume SQL-like semantics 23

Annotations and HAVING XY A10 B100 C1 t(Y) Monoid sum is 1 iff all values are bigger than probabilities (0.8,0.2,0) (0.6,0.4,0) (0.9,0,0.1) Marginal of 1 after convolution = value of query 0 is tuple not present 1 is tuple present, y > 3 marginal vectors 2 is tuple present, Monoids and Aggregates 24