Efficient Query Evaluation on Probabilistic Databases

Slides:



Advertisements
Similar presentations
Automated Theorem Proving Lecture 1. Program verification is undecidable! Given program P and specification S, does P satisfy S?
Advertisements

An Abstract Interpretation Framework for Refactoring P. Cousot, NYU, ENS, CNRS, INRIA R. Cousot, ENS, CNRS, INRIA F. Logozzo, M. Barnett, Microsoft Research.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.
Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.
Query Folding Xiaolei Qian Presented by Ram Kumar Vangala.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
CS4432: Database Systems II
1 Conjunctions of Queries. 2 Conjunctive Queries A conjunctive query is a single Datalog rule with only non-negated atoms in the body. (Note: No negated.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
1 541: Relational Calculus. 2 Relational Calculus  Comes in two flavours: Tuple relational calculus (TRC) and Domain relational calculus (DRC).  Calculus.
Having Proofs for Incorrectness
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
1 Relational Algebra. 2 Relational Query Languages Query languages: Allow manipulation and retrieval of data from a database. Relational model supports.
Complexity 19-1 Complexity Andrei Bulatov More Probabilistic Algorithms.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 The PCP starting point. 2 Overview In this lecture we’ll present the Quadratic Solvability problem. In this lecture we’ll present the Quadratic Solvability.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Rutgers University Relational Algebra 198:541 Rutgers University.
Rutgers University Relational Calculus 198:541 Rutgers University.
1 First order theories. 2 Satisfiability The classic SAT problem: given a propositional formula , is  satisfiable ? Example:  Let x 1,x 2 be propositional.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
The Relational Model: Relational Calculus
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Calculus Chapter 4, Section 4.3.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
1 Relational Algebra. 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of data from a database. v Relational model supports.
On Reducing the Global State Graph for Verification of Distributed Computations Vijay K. Garg, Arindam Chakraborty Parallel and Distributed Systems Laboratory.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)
1 Relational Algebra and Calculas Chapter 4, Part A.
Relational Algebra.
Propositional Calculus CS 270: Mathematical Foundations of Computer Science Jeremy Johnson.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.
PROBABILITY, PROBABILITY RULES, AND CONDITIONAL PROBABILITY
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
1 Reasoning with Infinite stable models Piero A. Bonatti presented by Axel Polleres (IJCAI 2001,
Containment of Relational Queries with Annotation Propagation Wang-Chiew Tan University of California, Santa Cruz.
Database Management Systems, R. Ramakrishnan1 Relational Calculus Chapter 4, Part B.
Complexity 24-1 Complexity Andrei Bulatov Interactive Proofs.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Chapter 13: Query Processing
1 Section 7.1 First-Order Predicate Calculus Predicate calculus studies the internal structure of sentences where subjects are applied to predicates existentially.
Extensions of Datalog Wednesday, February 13, 2001.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
Relational Calculus Chapter 4, Section 4.3.
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Algebra.
Queries with Difference on Probabilistic Databases
Propositional Calculus: Boolean Algebra and Simplification
Lecture 10: Query Complexity
Chapter 2: Intro to Relational Model
Probabilistic Databases
Relational Algebra & Calculus
Relational Calculus Chapter 4, Part B 7/1/2019.
Probabilistic Databases with MarkoViews
Presentation transcript:

Efficient Query Evaluation on Probabilistic Databases Papers by Nilesh Dalvi, Dan Suciu, Chris Re

Outline Motivation Definitions through examples Evaluation Complexity

Motivation Imprecise information on the web Partial Information Contradictions Imprecise queries

Imprecise Querying

Interpreting the ‘~’ For the actors name we can use edit distance, frequency similarity measures… For the films rating we can use user preferences, analysis of previous queries,… But how to combine them? And how to assign a score for a tuple w.r.t. the entire query?

Probabilistic Independence P(a) denotes the probability of event a P(S) for a set of events S P(B) for a boolean expression over events Iff a and b are independent P({a,b}) = P(a)*P(b), P(a or b) = (1- (1-P(a))(1-P(b)))

Probabilistic DB Each tuple has a probability of appearing in the DB Assume tuple independence Distribution over all possible DB instances Possible Worlds Semantics

Example

Semantics A query is evaluated on every possible world Note that for each concrete world, the query may have several answers In this case, sum, for each answer, the probabilities of the worlds in which it appeared in the set of answers Example

Example (Join on B=C)

Another Example (join and projection on A)

Solution attempt Obtain a query plan Compute intermediate results along with probabilities A plan in our (first) example: First compute the join, then project on D

Evaluation of the plan

Wrong! The tuples in the original DB were independent The tuples in the intermediate DB are not! Thus the multiplication (for the projection) is incorrect.

The problem is hard Theorem: Answering a query over a general probabilistic DB is #P-hard (Data Complexity) #P-hard is the “equivalent” of NP-hard for functional problems E.g. #SAT - given a Boolean formula, compute how many satisfying assignments it has. Likely not to have a polynomial solution

Other plans Some query plans are OK These are plans that preserve independencies Let us represent the query as a logical formula Tuples that support the answer ‘p’ satisfy: (s1 or s2) and t1

Plans and formulas The query was P((s1 or s2) and t1) First join, then project corresponds to P((s1 and t1) or (s2 and t1)). This conversion is fine in classic DB But (s1 and t1), (s2 and t1) are not independent events!

Safe Plan A plan that preserves independencies is called safe In our example: first project s over b, only then join with t = first compute the ‘OR’, then the ‘AND’

Safe Plan

Intuition on evaluation Work with probabilistic events Carry the events during evaluation

Probabilistic Events Atomic events tuples in the original DB Complex Events – boolean combination of events  tuples in intermediate DBs Translate a query plan to a complex event

Translation

Translating events to probabilities (Works iff the DB preserves independence!)

Safe Plans A relational algebra expression has multiple equivalent expressions Each corresponds to a concrete execution plan. Some of these plans may correspond to correct or incorrect probabilistic computations Let us try to detect what makes a plan safe.s

Checking for safe plans Attach a complex event E as an attribute of each tuple For every relation R, Attr(R)-> R.E is a FD of R, where attr(R) are all the other attributes

So what can we do? 1. Compute a safe plan when there is one 2. Compute an approximation when not

Approximation Most common is called Monte-carlo approximation Originally by Karp, improved in [suciu07] Guarantees convergence The error is greater than e with a probability of less than d after (4*n / e^2)* ln(2/d)

Functional Dependencies (FDs) A functional dependency {A1,…An} -> B holds for a relation R if the values of the A1,…An decide the value of B

Safe plans using FDs Selections and joins (over conjunctive queries) are always safe (but may cause unsafe successions..) Projection of a1,…,ak over the result obtained from q is safe if for every R, there is an FD a1,...,ak -> Head(q) Where Head(q) are the attributes in the result of q

Intuition Projection over a1,…,an  OR over all tuples that have the same values of {a1,…,an} To be independent, each atomic event must be sufficient to distinguish tuples that are ORed (otherwise it appears in more than one) I.e it uniquely determines the other atomic events appearing in the tuple Hence the FD (valid only in combination with a1,…,an)

Safe and Unsafe queries We can always compute an answer But the computation might be exponential… Computing P(e) for a general formula e is #P-hard (data complexity) It’s hard when no safe plan exists

Conjunctive Queries and Union thereof Whiteboard discussion

Algorithms Much more details in the top-k talk We’ll give just an overview here

Optimality The data complexity of a query q is #P-complete iff the algorithm fails

Safe Plan algorithm Top-Down Push all safe projections late in the plan When you can’t, split the query q into two sub-queries q1 and q2 such that their join is q (when possible) If stuck, the query is unsafe

(Union of) Conjunctive Queries by example T(x):- R(x,y),S(y,30) T(x):- P(x,y) In relational algebra? Multiple Possible translations Correspond to different ordering of operations Each option is called a “query plan”

More notations Head(q) is the set of head variables in q, FreeVar(q) is the set of free variables (i.e. non-head variables) in q R.Key is the set of variables in the key position of the relation R R.NonKey is the set of variables in the non-key positions of the relation R, R.Pred is the predicate that q applies to R. For x in FreeVar(q), denote qx a new query whose body is identical with q and where Head(qx) = Head(q) U {x}.

Conclusion Probabilistic DB is a very strong tool Combines the exact semantics of classic DB with capabilities of IR Exact evaluation becomes hard sometimes But have good approximations (with bounds!)