Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.
Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.
Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
Representing and Querying Correlated Tuples in Probabilistic Databases
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Fast Algorithms For Hierarchical Range Histogram Constructions
SECTION 21.5 Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
Efficient Query Evaluation on Probabilistic Databases
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
Hardness Results for Problems P: Class of “easy to solve” problems Absolute hardness results Relative hardness results –Reduction technique.
1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.
FDImplication: 1 Functional Dependencies (FDs) Let r(R) be a relation and let t  r, then the restriction of t to X  R, written t[X], is the projection.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
The Theory of NP-Completeness
Analysis of Algorithms CS 477/677
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
CS246 Query Translation. Mind Your Vocabulary Q: What is the problem? A: How to integrate heterogeneous sources when their schema & capability are different.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Probabilistic/Uncertain Data Management -- IV 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’ Sen, Deshpande. “Representing.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
CS143 Review: Normalization Theory Q: Is it a good table design? We can start with an ER diagram or with a large relation that contain a sample of the.
NP Complexity By Mussie Araya. What is NP Complexity? Formal Definition: NP is the set of decision problems solvable in polynomial time by a non- deterministic.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
1 Relational Algebra. 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of data from a database. v Relational model supports.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Techniques for Proving NP-Completeness Show that a special case of the problem you are interested in is NP- complete. For example: The problem of finding.
1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)
1 Relational Algebra and Calculas Chapter 4, Part A.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Random Interpretation Sumit Gulwani UC-Berkeley. 1 Program Analysis Applications in all aspects of software development, e.g. Program correctness Compiler.
Bhanu Pratap Gupta Devang Vira S. Sudarshan Dept. of Computer Science and Engineering, IIT Bombay.
Relaxing Queries Presented by Ashwin Joshi Kapil Patil Sapan Shah.
CSC 413/513: Intro to Algorithms
Complexity 24-1 Complexity Andrei Bulatov Interactive Proofs.
Daniel Kroening and Ofer Strichman 1 Decision Procedures An Algorithmic Point of View Basic Concepts and Background.
Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
The NP class. NP-completeness
Chapter 10 NP-Complete Problems.
NP-Completeness Yin Tat Lee
Queries with Difference on Probabilistic Databases
Data Integration with Dependent Sources
Lecture 16: Probabilistic Databases
ICS 353: Design and Analysis of Algorithms
NP-Complete Problems.
Probabilistic Databases
CSE 6408 Advanced Algorithms.
Probabilistic Databases with MarkoViews
Presentation transcript:

Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Outline Motivation Query Evaluation: Intensional Extensional Query Optimization Complexity Unsafe Plans Extensions Conclusions

Databases Are Deterministic Databases we see today are deterministic A tuple is either in the query answer or not They don’t deal with uncertainties

Future of Data Management Uncertainties in Data Biological Data Sensor Data (Geographical Data) Data extracted from various AI, data mining techniques (information extraction) Uncertainties are represented as probabilities Extend data management tools to handle probabilistic data

Example Review Text I have not used IPOD but Apple products are good Facts Table CompanyProductsRating AppleIPOD0.3

Representing Uncertainty Tuple-existence uncertainty All attributes in a tuple are known precisely; existence of the tuple is uncertain E.g. in previous slide. More later Attribute-value uncertainty Tuples (identified by keys) exist for certain; attributes (one or more) value are however uncertain Tomorrow, it may rain (probability is 0.6)

Our Goal For Today Understand how queries can be evaluated efficiently on Probabilistic Databases For simplicity, we will deal with tuple-level uncertainties only We also assume independence among tuples. i.e. P(t1, t2) = P(t1) * P(t2)

Possible Worlds: Example 1 CameraFeaturep C21LensP1 C29BatteryP2 C31LensP3 ; LensC21 FeatCam BattC29 FeatCam LensC31 LensC21 FeatCam LensC31 BattC29 LensC21 FeatCam I1I1 (1-p 1 ) (1-p 2 ) (1-p 3 ) I2I2 p 1 (1-p 2 )(1-p 3 ) I4I4 p 1 (1-p 2 )p 3 I3I3 (1-p 1 )p 2 (1-p 3 ) I5I5 p1p2p3p1p2p3 Total number of worlds: 2^count_tuples ∑I i = 1

Possible Worlds: Example 2 AB ‘m’1 ‘n’1 s1 s CD 1‘p’ 0.6 t1 S T WorldProb. D1 = {s1, s2, t1} D2 = {s1, t1} D3 = {s2, t1} D4 = {t1} D5 = {s1, s2} D6 = {s1} D7 = {s2} D8 = ! Possible Worlds pwd(D p )

Query Evaluation So, lets consider a query: Q(D) :- S(A,B), T(C,D), B = C S join T on B = C, project on D Intuitively: Execute the query on each possible world The final result is a probabilistic relation that represents end result

Query Evaluation: Example WorldProb.Result D1 = {s1, s2, t1} D2 = {s1, t1} D3 = {s2, t1} D4 = {t1} D5 = {s1, s2} D6 = {s1} D7 = {s2} D8 = Φ {‘p’} {} S join T on B = C, project on D AnswerProb. {‘p’}0.54 Φ0.46 q pwd (D p ) =

Query Evaluation Semantically correct If T has ‘n’ tuples, there can be as many as 2^n possible worlds. Exponential complexity, thus impractical Goal of the paper: Evaluate query efficiently

Intensional Query Evaluation Define the complex event e p (t) for each tuple t For each intermediate tuple, associate an explicit (complex) event expression Compute the actual probabilities at the end For this talk, we will look only select, join project queries

Intensional Semantics  Ev Ev X v2v2 E 1 ˄ E 2 v1v1  vE1E1 v1v1 E2E2 v2v2 E2E2 v E1E1 v …… E 1 V E 2 V …

Theorem (2) The intesional semantics and the possible world semantics on probabilistic databases are equivalent for conjunctive queries. pwd(q i (D p )) = q pwd (D p )

Intensional Semantics: Example AB s1‘m’10.8 s2‘n’10.5 CD t11‘p’0.6 S T S join T on B = C ABCDE ‘m’11‘p’ s1 ˄ t1 ‘n’11‘p’ s2 ˄ t1 Project on D DRank ‘p’ (s1 ˄ t1) V (s2 ˄ t1) q rank (D p ) = Pr(q) = (0.8 * 0.6) + (0.5 * 0.6) – (0.8 * 0.5 * 0.6) = – 0.24 = 0.54

Intensional Semantics Does not depend on the choice of plan Impractical to use it: The event expressions can become very large due to projections For each tuple t, one has to compute Pr(e) for its event e, which is #P-complete problem Thus very expensive

Extensional Semantics Play with probabilities instead of event expressions Much more efficient Assume tuple independence Not always correct. WHY?

Extensional Semantics  pvpv x p1p1 v1v1 v2v2 p 1 p 2 v1v1 p2p2 v2v2  p2p2 v p1p1 v1-(1-p 1 )(1-p 2 )…v

Extensional Query Evaluation: Example AB s1‘m’10.8 s2‘n’10.5 CD t11‘p’0.6 S T S join T on B = C ABCDProb ‘m’11‘p’0.48 ‘n’11‘p’0.30 Project on D DProb ‘p’1 – (1-0.48)*(1-0.30) = Wrong?? Because the two tuples in the join are no longer independent!! Plan : π D (S join B=C T)

Extensional: Alternate Query Plan AB s1‘m’10.8 s2‘n’10.5 CD t11‘p’0.6 S T Project S on B BProb 11 – (1-0.8)*(1-0.5) = 0.9 Join with T on B=C BCDProb 11‘p’0.9 * 0.6 = 0.54 CORRECT!! Plan : π D (π B (S) join B=C T)

Observation The answer depends on query plan

Notations R is a relation name. D = instance of a database schema Γ = set of functional dependencies E = set of all complex events q = query PRels(q) = the probabilistic relation names in q Attr(q) = all attributes in all relations in q Head(q) = the set of attributes that are in output of the query q

Safe Plan A plan is safe if it produces the correct result Formally, given a schema R p, Γ p, a plan P for a query q is safe if P e (D p ) = q rank (D p ) for all instances D p of that schema

Theorem (3) Consider a database schema where all the probabilistic relations are tuple-independent. Let q, q’ be the conjunctive queries that do not share any relation name. Then σ is always safe x is always safe in q x q’ Π is safe iff A 1,…A k, R.E → Head (q)

Example Same example, Γ p is : S.A, S.B → S.E T.C, T.D → T.E S.E → S.A, S.B T.E → T.C, T.D Query :- S join T on B = C, project on D Plan : π D (S join B=C T) Join is safe. We need to check the safeness of project. From theorem 3, we need to check A 1,…A k, R.E → Head (q) T.D, S.E → S.A, S.B, T.C, T.D (pass) T.D, T.E → S.A, S.B, T.C, T.D (fails, why?) Where A 1,…A k is T.D R.E is S.E and T.E Head (q) is S.A, S.B, T.C, T.D

Example: Alternative Plan Query :- S join T on B = C, project on D Plan : π D (π B (S) join B=C T) Project on B is safe. We need to check the safeness of project on D. From theorem 3, we need to check A 1,…A k, R.E → Head (q) T.D, S.E → S.B, T.C, T.D T.D, T.E → S.B, T.C, T.D Where A 1,…A k is T.D R.E is S.E and T.E Head (q) is S.B, T.C, T.D Plan is safe!!

Separation Let q be a conjunctive query. Two relations R1, R2 are called connected if the query contains a join condition R1.A = R2.B and either R1.A or R2.B is not in Head(q). The relations R1, R2 are called separate if they are not connected. Two sets of relations Y1 and Y2 are said to form a separation for query q iff They partition the set Rels(q) For any pair of R1 and R2 s.t. R1 belongs to Y1 and R2 belongs to Y2, they are separate Intuitively, The query does not contains a join condition If the query has join condition, output of query does contains both R1.A and R2.B

Separation: Example Query :- S(A,B), T(C,D), B = C q BC = (S join B=C T) Head(q BC ) = {B,C,D} S join T on B = C BCD 11‘p’ 11 Both B and C are present in head(q BC ). Thus S and T are separate for this query

Finding Safe Plan Authors proposed SAFE-PLAN algorithm to find safe plans for a query Try to postpone all safe projections in the query plan When no more safe projections possible, it tries to perform a join, by splitting q into q1 join q2 Since we perform join in the last, all attributes of join condition must be in Head(q), thus making sure that relations involved in join are separate. If a safe plan exist, the algorithm finds it

Finding Safe Plan: Example Processing :- SAFE-PLAN(π D (S join B=C T)) Head(q A ) = {A, D} q A = π D (S join B=C T)) Z = {A} Head(q) = {D} Is π Head(q) (q A ) is a safe operator? Conditions: T.D, S.E → S.A, T.D (safe) T.D, T.E → S.A, T.D (unsafe)

Finding Safe Plan: Example Processing :- SAFE-PLAN(π D (S join B=C T)) Head(q B ) = {B, D} q B = π D (S join B=C T)) Z = {B} Head(q) = {D} Is π Head(q) (q B ) is a safe operator? Conditions: T.D, S.E → S.B, T.D (safe) T.D, T.E → S.B, T.D (safe) Return π D (SAFE-PLAN(q B ))

Finding Safe Plan: Example Processing :- π D (SAFE-PLAN(q B )) Head(q AB ) = {A, B, D} q AB = π D (S join B=C T)) Z = {A} Head(q B ) = {B, D} Is π Head(q) (q AB ) is a safe operator? Conditions: T.D, S.E → S.A, S.B, T.D (safe) T.D, T.E → S.A, S.B, T.D (unsafe)

Finding Safe Plan: Example Processing :- π D (SAFE-PLAN(q B )) Head(q BC ) = {B, C, D} q BC = π D (S join B=C T)) Z = {C} Head(q B ) = {B, D} Is π Head(q) (q BC ) is a safe operator? Conditions: T.D, S.E → T.C, S.B, T.D (safe) T.D, T.E → T.C, S.B, T.D (safe) Return π BD (SAFE-PLAN(q BC ))

Finding Safe Plan: Example Processing :- π D ( π BD ( SAFE-PLAN(q BC )) Head(q ABC ) = {A, B, C, D} q ABC = π D (S join B=C T)) Z = {A} Head(q BC ) = {B, C, D} Is π Head(q) (q ABC ) is a safe operator? Conditions: T.D, S.E → S.A,T.C, S.B, T.D (safe) T.D, T.E → S.A,T.C, S.B, T.D (unsafe)

Finding Safe Plan: Example Processing :- π D ( π BD ( SAFE-PLAN(q BC )) No projection possible!! q BC = π D (S join B=C T)) Head(q BC ) = {B, C, D} Split q BC into q1 join B=C q2, s.t. q1(B) :- S(A,B) q2(C,D) :- T(C,D) We know that S and T are separate on query q BC !! Return SAFE-PLAN(q1) join B=C SAFE-PLAN(q2))

Finding Safe Plan: Example π D ( π BD ( SAFE-PLAN(q1) join B=C SAFE-PLAN(q2))) Head(q A ) = {A, B} q A = S(A,B) Z = {A} Head(q 1 ) = {B} Is π Head(q1) (q A ) is a safe operator? Conditions: S.B, S.E → S.A, S.B (safe) Return π B (SAFE-PLAN(S(A,B))) i.e. π B (S(A,B))

Finding Safe Plan: Example SAFE-PLAN(q2) = T(C,D) Thus, final result : π D (π BD (π B (S) join B=C T)) π BD is redundant. Can be optimized. SAFE-PLAN algorithm is sound and complete How can we optimize our query plan? Traditional equivalences do not work in extensional semantics. Need to define extensional semantics equivalences

Query Optimization Select behaves exactly like traditional select operator Extensional joins are commutative R join S  S join R Extensional joins are associative R join (S join T)  (R join S) join T Cascading Projections π A (π AUB (R))  π A (R) Pushing Projection below a join π A (R join S) => (π A (R)) join (π A (S)) Lifting Projections Up a Join: only when it satisfies the project condition in theorem 3 (π A (R)) join S => π AUAttrs(S) (R join S) Theorem (10) : Let Z1 and Z2 be two safe plans for a query q. Then Z1  Z2

Complexity Fundamentals PTIME : solvable in polynomial time NP complete : Is? Checks satisfiability. #P complete : How many?

Complexity Analysis The data complexity of a query q is the complexity of evaluating q rank (D p ) as a function of size of D p If q has a safe plan, then its data complexity is in PTIME All extensional operators are in PTIME If q does not has a safe plan, then its data complexity is in #P-complete. i.e. if SAFE- PLAN algorithm fails to return a plan

Unsafe Plans What if there is no safe plan? The author proposes two solutions Least Unsafe Plans Monte-Carlo Approximations

Least Unsafe Plans Minimize the error in computing the probabilities Modify SAFE-PLAN algorithm When splitting a query q in two sub-queries q1 and q2, allow joins b/w q1 and q2 on attributes not in Head(q), then project out these attributes These projections will be unsafe. Minimize their degree of unsafety Pick q1, q2 to be a minimum cut of graph (rather than separation) Problem of finding minimum cut is in PTIME

Monte-Carlo Approximations Let q’ be the query obtained from q by making it return all the variables in its body. Evaluate q’ instead of q without any probability calculations Group the tuples based on the values of attributes in Head(q) Complex event expression of a group will be in CNF. i.e. V n i=1 C i where each C i is in DNF. i.e. e1 ˄ e2 ˄ … Back to same problem!! Complexity of evaluating the probability of a boolean expression is in #P-complete

Monte-Carlo Approximations Given a DNF formula with N clauses and any ε and δ, the probability can be approximated in time O(N/ε 2 ln (1/δ)) Probability of the error being greater than ε is less than δ. If N is small, an exact algorithm may be applied in place of simulation

Extensions Till now: All the events in probabilistic relations are distinct Dealt with select, project, join queries. The authors have extended their solutions to non-distinct relations and additional operators

Handling Repeated Events Multiple tuple can share a common event 4 easy steps to handle them: Normalize the schema – represents the same data in normalized form, s.t. no probabilistic table has repeated events T P :- T 1 and T P 2 Translate original query into new schema Find a safe plan Translate back to original schema

Handling Repeated Events: Example Consider two prob. Relations: R(A,B) and S(C,D) s.t. R has all distinct events while S has a distinct event for each value of D Query q(x) :- R(x,y), S(y,z) Step1: create a new schema. Decompose S into two relations: S1(C, D, EID) and S2(EID) q’(x) :- R(x,y), S1(y,z, eid), S2(eid) Using SAFE-PLAN, we get the following plan P’ = π A (R join B=C (π B,EID (S1) join EID S2)) Substitute back S1 and S2 accordingly

Additional Operators Union, Difference and Groupby operators Covers almost all queries with nested sub- queries, aggregates, group-by and existensial/universal quantifiers

Uncertain Predicates q≈ predicate on a deterministic database Syntactic closeness: String Matching. e.g. certain ~ uncertain Edit distances, q-grams etc. Semantic closeness: e.g. musical ~ opera TF/IDF, ontologies from Wordnet Numeric closeness: e.g. 25 ~ 26 similar numeric values Once distances are defined, they need to be meaningfully converted into probabilities gaussian, student-T, normal-gamma parameters can be learned (ideal case) or can be specified by user

Conclusions Extensional semantics can be used to evaluate certain class of queries in PTIME #P-complete problems can be solved using approximations techniques In practice, many (around 80% as in experiments) queries have safe plans Extended their approach to deal with non- distinct relations and additional operators