Containment of Conjunctive Queries on Annotated Relations Todd J. Green University of Pennsylvania March 25, ICDT 09, Saint Petersburg.

Slides:



Advertisements
Similar presentations
Functions Reading: Epp Chp 7.1, 7.2, 7.4
Advertisements

Containment of Conjunctive Queries on Annotated Relations TJ Green University of Pennsylvania Symposium on Database Provenance University of Edinburgh.
Presented by: Ms. Maria Estrellita D. Hechanova, ECE
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION.
CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman.
2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.
Having Proofs for Incorrectness
Annotated XML: Queries and Provenance Nate Foster T.J. Green Val Tannen University of Pennsylvania PODS ’08 Vancouver, B.C. June 11, 2008.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
Efficient Query Evaluation on Probabilistic Databases
Math 3121 Abstract Algebra I
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
1 Lecture 11: Provenance and Data privacy December 8, 2010 Dan Suciu -- CSEP544 Fall 2010.
An Overview of Data Provenance
Collaborative Data Sharing with Mappings and Provenance Todd J. Green University of Pennsylvania Spring 2009.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
2005certain1 Views as Incomplete Databases – Certain & Possible Answers  Views – an incomplete representation  Certain and possible answers  Complexity.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 23 Instructor: Paul Beame.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA June 26, 2007.
Complexity ©D. Moshkovitz 1 And Randomized Computations The Polynomial Hierarchy.
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
An Investigation of the Multimorphisms of Tractable and Intractable Classes of Valued Constraints Mathematics of Constraint Satisfaction: Algebra, Logic.
Todd J. Green University of California, Davis Efficiently Supporting Changes to Declarative Schema Mappings January 28, InfoSeminar.
Relation, function 1 Mathematical logic Lesson 5 Relations, mappings, countable and uncountable sets.
Math 3121 Abstract Algebra I Lecture 3 Sections 2-4: Binary Operations, Definition of Group.
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
February 18, 2015CS21 Lecture 181 CS21 Decidability and Tractability Lecture 18 February 18, 2015.
Theory of Computation, Feodor F. Dragan, Kent State University 1 NP-Completeness P: is the set of decision problems (or languages) that are solvable in.
CS151 Complexity Theory Lecture 13 May 11, Outline proof systems interactive proofs and their power Arthur-Merlin games.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
On Reducing the Global State Graph for Verification of Distributed Computations Vijay K. Garg, Arindam Chakraborty Parallel and Distributed Systems Laboratory.
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
1 Annoucement n Skills you need: (1) (In Thinking) You think and move by Logic Definitions Mathematical properties (Basic algebra etc.) (2) (In Exploration)
CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman Fall 2006.
1 Lecture 6: Schema refinement: Functional dependencies
1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)
1 Relational Algebra and Calculas Chapter 4, Part A.
Reconcilable Differences Todd J. GreenZachary G. IvesVal Tannen University of Pennsylvania March 24, ICDT 09, Saint Petersburg.
LDK R Logics for Data and Knowledge Representation Propositional Logic: Reasoning First version by Alessandro Agostini and Fausto Giunchiglia Second version.
A Logic of Partially Satisfied Constraints Nic Wilson Cork Constraint Computation Centre Computer Science, UCC.
Boolean Algebra and Computer Logic Mathematical Structures for Computer Science Chapter 7.1 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesBoolean Algebra.
DISCRETE COMPUTATIONAL STRUCTURES CSE 2353 Fall 2010 Most slides modified from Discrete Mathematical Structures: Theory and Applications by D.S. Malik.
Lecture 7: Foundations of Query Languages Tuesday, January 23, 2001.
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania PODS 2007.
1 Finite Model Theory Lecture 16 L  1  Summary and 0/1 Laws.
Containment of Relational Queries with Annotation Propagation Wang-Chiew Tan University of California, Santa Cruz.
Variants of LTL Query Checking Hana ChocklerArie Gurfinkel Ofer Strichman IBM Research SEI Technion Technion - Israel Institute of Technology.
Provenance for Database Transformations Val Tannen University of Pennsylvania Joint work with J.N. Foster T.J. Green G. Karvounarakis IPAW ’08 Salt Lake.
Group A set G is called a group if it satisfies the following axioms. G 1 G is closed under a binary operation. G 2 The operation is associative. G 3 There.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
Chapter 8: Concurrency Control on Relational Databases
Approximate Lineage for Probabilistic Databases
NP-Completeness Yin Tat Lee
Queries with Difference on Probabilistic Databases
Propositional Calculus: Boolean Algebra and Simplification
Lesson 5 Relations, mappings, countable and uncountable sets
Sungho Kang Yonsei University
CS 583 Fall 2006 Analysis of Algorithms
Lesson 5 Relations, mappings, countable and uncountable sets
Chapter 11 Limitations of Algorithm Power
Probabilistic Databases
NP-Completeness Yin Tat Lee
Presentation transcript:

Containment of Conjunctive Queries on Annotated Relations Todd J. Green University of Pennsylvania March 25, ICDT 09, Saint Petersburg

The Need for Data Provenance Many new database applications must track where data came from (as it is combined and transformed by queries, schema mappings, etc.): data provenance – Debugging schema mappings – Assessing data quality, trustworthiness – Computing probabilities – Enforcing access control policies – Preserving the scientific record Must do this while also satisfying DBMS performance requirements and retaining compatibility with legacy systems 2

Challenge: Provenance May Affect Query Optimization Query optimization strategies depend fundamentally on issues of query containment and equivalence – Query minimization, rewritings queries using materialized views, etc. Well-known difference between set, bag semantics: consider Q(x,y) :– R(x,y) Q’(u,v) :– R(u,v), R(u,w) Under set semantics, Q and Q’ are equivalent; under bag semantics, they are not! (“redundant” join in Q’ affects output tuple multiplicities) Issues pointed out in [Buneman+ 01], reiterated in [Buneman+ 08] 3

Contributions We study containment and equivalence of conjunctive queries (CQs) and unions of conjunctive queries (UCQs), for provenance models captured by semiring-annotated relations: – Provenance polynomials (O RCHESTRA system) [Green+ 07] – Why-provenance [Buneman+ 01] – Data warehousing lineage [Cui+ 01] – Trio system lineage [Das Sarma+ 08] We give positive decidability results and complexity characterizations in (nearly) all cases We show interesting connections with same problems under set semantics and bag semantics 4

Outline Semiring-annotated relations (K-relations) Bounds based on semiring homomorphisms Results for provenance polynomials Overview of other results 5

Basic idea: annotate source tuples with tuple ids, combine and propagate during query processing – Abstract “+” records alternative use of data (union, projection) – Abstract “ ¢ ” records joint use of data (join) – Yields space of annotations K K-relation: a relation whose tuples are annotated with elements from K – Notation: R(t) means annotation of t in K-relation R A Unifying Framework for Data Provenance: Semiring Annotated Relations [Green+ PODS 07] 6

Combining Annotations in Queries 7 IDSpeciesImg 61Lemur cattas SpeciesComm. Name Lemur cattaRing-tailed Lemur u IDSpeciesImgCharacterState 34L.cattahand colorwhitep 47L.cattahand colorwhiteq IDCharacterState 61hand colorblackr source tuples annotated with tuple ids from K A C B D

A Combining Annotations in Queries 8 IDSpeciesImg 61Lemur cattas SpeciesComm. Name Lemur cattaRing-tailed Lemur u IDSpeciesImgCharacterState 34L.cattahand colorwhitep 47L.cattahand colorwhiteq IDCharacterState 61hand colorblackr Comm. NameHand Color Ring-tailed Lemurblack E(name, color) :– B(id, “hand color”, color), C(id, species,_), D(species, name) Operation x ¢ y means joint use of data annotated by x and data annotated by y Union of conjunctive queries (UCQ) join r¢s¢ur¢s¢u r s u C B D E

C B Combining Annotations in Queries 9 IDSpeciesImg 61Lemur cattas SpeciesComm. Name Lemur cattaRing-tailed Lemur u IDSpeciesImgCharacterState 34L.cattahand colorwhitep 47L.cattahand colorwhiteq IDCharacterState 61hand colorblackr Comm. NameHand Color Ring-tailed Lemurblackr¢s¢ur¢s¢u Ring-tailed Lemurwhite Ring-tailed Lemurwhite E(name, color) :– B(id, “hand color”, color), C(id, species,_), D(species, name) Operation x ¢ y means joint use of data annotated by x and data annotated by y Union of conjunctive queries (UCQ) p¢up¢u u E(name, color) :– A(id, species,_, “hand color”, color), D(species, name) q¢uq¢u p q p¢up¢u A D E

C B Comm. NameHand Color Ring-tailed Lemurblackr¢s¢ur¢s¢u Ring-tailed Lemurwhite Combining Annotations in Queries 10 IDSpeciesImg 61Lemur cattas SpeciesComm. Name Lemur cattaRing-tailed Lemur u IDSpeciesImgCharacterState 34L.cattahand colorwhitep 47L.cattahand colorwhiteq IDCharacterState 61hand colorblackr Comm. NameHand Color Ring-tailed Lemurblackr¢s¢ur¢s¢u Ring-tailed Lemurwhite Ring-tailed Lemurwhite E(name, color) :– B(id, “hand color”, color), C(id, species,_), D(species, name) Union of conjunctive queries (UCQ) E(name, color) :– A(id, species,_, “hand color”, color), D(species, name) Operation x+y means alternate use of data annotated by x and data annotated by y p ¢ u + q ¢ u q¢uq¢u p¢up¢u A D E

What Properties Do K-Relations Need? DBMS query optimizers choose from among many plans, assuming certain identities: – union is associative, commutative – join associative, commutative, distributive over union – projections and selections commute with each other and with union and join (when applicable) Equivalent queries should produce same provenance! Proposition [Green+ 07]. Above identities hold for positive relational algebra queries on K-relations iff (K, +, ¢, 0, 1) is a commutative semiring 11

What is a Commutative Semiring? An algebraic structure (K, +, ¢, 0, 1) where: – K is the domain – + is associative, commutative with 0 identity –¢ is associative, commutative with 1 identity –¢ is distributive over + – 8 a 2 K, a ¢ 0 = 0 ¢ a = 0 (unlike ring, no requirement for additive inverses) Big benefit of semiring-based framework: one framework unifies many database semantics 12

Semirings Unify Commonly-Used Database Semantics 13 (PosBool(X), Æ, Ç, >, ? )Conditional tables [Imielinski&Lipski 84] ( P (  ), [, Å, ;,  ) Probabilistic event tables [Fuhr&Rölleke 97] ( B, Æ, Ç, >, ? ) Set semantics ( ℕ, +, ∙, 0, 1) Bag semantics Standard database models: Incomplete/probabilistic data: Also ranked query models, dissemination policies,...

Semirings Unify Provenance Models X a set of indeterminates, can be thought of as tuple ids 14 ( N [X], +, ¢, 0, 1) “most informative” Provenance polynomials [Green+ 07] (Lin(X), [, [ *, ;, ; * ) sets of contributing tuples Data warehousing lineage [Cui+ 00] (Why(X), [, d, ;, { ; }) sets of sets of contributing tuples Why-provenance [Buneman+ 01] (Trio(X), +, ¢, 0, 1) bags of sets of contributing tuples Trio-style lineage [Das Sarma+ 08] ( B [X], +, ¢, 0, 1)Boolean prov. polynomials

A Hierarchy of Provenance N[X]N[X] B[X]B[X] Trio(X) Why(X) Lin(X) PosBool(X) A path downward from K 1 to K 2 indicates that there exists a surjective semiring homomorphism h : K 1  K 2 most informative least informative Example: 2p 2 r + pr + 5r 2 + s drop exponents 3pr + 5r + s drop coefficients p 2 r + pr + r 2 + s collapse terms prs drop both exp. and coeff. pr + r + s apply absorption (pr + r ´ r) r + s 15 B non-zero? true

What Does Query Containment Mean for K-Relations? Notion of containment based on natural order for K: a ≤ K b iff exists c s.t. a + c = b – When this is a partial order, call K naturally ordered; all semirings considered here are naturally ordered Lift to K-relations: R ≤ K R’ iff for all tuples t R(t) ≤ K R’(t) – For K = B (set semantics), this is set-containment – For K = ℕ (bag semantics), this is bag-containment – For K = PosBool(X), this is logical implication Queries on K-relations: say that Q is K-contained in Q’ iff for all K-relations R, Q(R) ≤ K Q’(R) 16

Provenance Hierarchy and Query Containment N[X]N[X] B[X]B[X] Trio(X) Why(X) Lin(X) PosBool(X) B A path downward from K 1 to K 2 also indicates that for UCQs Q 1, Q 2, if Q 1 is K 1 -contained in Q 2, then Q 1 is K 2 -contained in Q 2 most informative least informative 17 strongest notion of containment weakest notion of containment N any K (positive K)

Prov. Hierarchy and Query Containment (2) Provenance hierarchy tells us something about relative behavior of K-containment for various K Doesn’t tell us which implications are strict; we’d also like to know whether containment/equivalence is even decidable! One case already known: Theorem [Grahne+ 97]. If K is a distributive lattice, then for UCQs Q,Q’, Q is K-contained in Q’ iff Q is set-contained in Q’ – Distributive lattices are between PosBool(X) (for c-tables) and B in previous slide – Other examples: dissemination policies, prob. event tables,... 18

Summary: Logical Implications of Containment/Equivalence 19 N[X]N[X] B[X]B[X] Trio(X)Why(X) Lin(X) PosBool(X) B N CQs, cont. N[X]N[X] B[X]B[X] Trio(X)Why(X) Lin(X) PosBool(X) B N[X]N[X] B[X]B[X] Trio(X)Why(X) Lin(X) PosBool(X) B CQs, equiv. N N UCQs, cont. N[X]N[X] Trio(X) Lin(X) PosBool(X) B UCQs, equiv. N Why(X) B[X]B[X] “K 1 K 2 ” indicates that for CQs (UCQs), K 1 cont. (equiv.) implies K 2 cont. (equiv.) All implications not marked “ ” are strict. Red arrows are from [Grahne+ 97].

Summary: Logical Implications of Containment/Equivalence 20 N[X]N[X] B[X]B[X] Trio(X)Why(X) Lin(X) PosBool(X) B N CQs, cont. N[X]N[X] B[X]B[X] Trio(X)Why(X) Lin(X) PosBool(X) B N[X]N[X] B[X]B[X] Trio(X)Why(X) Lin(X) PosBool(X) B CQs, equiv. N N UCQs, cont. N[X]N[X] Trio(X) Lin(X) PosBool(X) B UCQs, equiv. N Why(X) B[X]B[X] CQs separating the various notions of K-containment: Q(x,y) :– R(x,y) Q’(u,v):– R(u,v), R(u,w) Q is set-contained in Q’, but Q is not Lin(X)-contained in Q’ Q(u):– R(u,v), R(u,w) Q’(x) :– R(x,y) Q is Lin(X)-contained in Q’, but Q is not bag-contained in Q’ other examples...other examples... “K 1 K 2 ” indicates that for CQs (UCQs), K 1 cont. (equiv.) implies K 2 cont. (equiv.) All implications not marked “ ” are strict. Red arrows are from [Grahne+ 97].

Summary: Logical Implications of Containment/Equivalence 21 N[X]N[X] B[X]B[X] Trio(X)Why(X) Lin(X) PosBool(X) B N CQs, cont. N[X]N[X] B[X]B[X] Trio(X)Why(X) Lin(X) PosBool(X) B N[X]N[X] B[X]B[X] Trio(X)Why(X) Lin(X) PosBool(X) B CQs, equiv. N N UCQs, cont. N[X]N[X] Trio(X) Lin(X) PosBool(X) B UCQs, equiv. N Why(X) B[X]B[X] bag semantics Bag-equivalence of UCQs implies K- equivalence for provenance models (in fact, bag-equivalence implies K- equivalence for any K) “K 1 K 2 ” indicates that for CQs (UCQs), K 1 cont. (equiv.) implies K 2 cont. (equiv.) All implications not marked “ ” are strict. Red arrows are from [Grahne+ 97].

Tools for Main Results: Containment Mappings, Canonical Databases Theorem [Chandra&Merlin 77]. For CQs Q, Q’, following are equivalent: 1. Q is (set-)contained in Q’ 2. Q(can(Q)) ⊆ Q’(can(Q)) where can(Q) is canonical database for Q 3. There is a containment mapping h : vars(Q)  vars(Q’) Most of our results follow this template, with two key differences: – We use provenance-annotated canonical databases: e.g., Q(x,y) :– R(x,z), R(z,y)can N [X] (Q) is R = – We use variations of containment mappings e.g., exact containment mapping: a containment mapping h : vars(Q)  vars(Q’) that induces a bijection between atoms of Q and atoms of Q’ 22 xzp zyq

N [X]-Containment/Equivalence of CQs Natural order for N [X]: monomial-wise comparison of coefficients e.g., p 2 ≤ N [X] 2p 2 + pq but p 2 ≰ N [X] p 3 Theorem. For CQs Q, Q’, the following are equivalent: 1. Q is N [X]-contained in Q’ 2. Q(can N [X] (Q)) ≤ N [X] Q’(can N [X] (Q)) 3. There is an exact containment mapping h : vars(Q)  vars(Q’) and checking containment is NP-complete Corollary. Q and Q’ are N [X]-equivalent iff they are isomorphic (and checking equivalence is graph isomorphism-complete) 23

N [X]-Containment/Equivalence of UCQs Theorem. For UCQs Q,Q’, if Q is not N [X]-contained in Q’, then there is a small counterexample, i.e., an N [X]-relation R s.t. – Size of R (tuples and their annotations) polynomial in |Q| + |Q’| – Q(R) ≰ N [X] Q’(R) Corollary. N [X]-containment of UCQs is in PSPACE – Exact complexity: don’t know! Theorem. For UCQs Q,Q’, Q is N [X]-equivalent to Q’ iff Q and Q’ are isomorphic (and checking is again graph isomorphism- complete) 24

Highlights of Other Results Why(X) and Trio(X): CQ containment based on onto containment mappings Lin(X): CQ containment based on covering containment mappings These kinds of containment mappings have been used before, for checking bag-containment of CQs [Chaudhuri&Vardi 93] ! – Decidability of this problem: open – But, onto containment mappings sufficient for bag-containment – And, covering containment mappings necessary for bag-containment Hence for CQs, Why(X)/Trio(X)-containment and Lin(X)- containment “sandwich” bag-containment 25

N [X]-Equivalence and Bag-Equivalence Theorem. For UCQs, N [X]-equivalence is the same as bag- equivalence – Proof idea. For polynomials A, B in N [X], we have A = B iff for all valuations ν : X  N, Eval ν (A) = Eval ν (B) We have used this idea in another ICDT 09 paper; and results there for Z -relations also hold for Z [X]-relations – A fact used in O RCHESTRA Penn for optimizing change propagation with provenance 26

Summary: Complexity of Checking Containment/Equivalence of CQs/UCQs B PosBool(X)Lin(X)Why(X)Trio(X) B[X]B[X] N[X]N[X] N CQscontNP ? (Π 2 p - hard) equivNP GI UCQscontNP ? in PSPACE undec equivNP GINPGI 27 Bold type indicates results of this paper “NP” indicates NP-complete, “GI” indicates graph isomorphism-complete NP-complete/GI-complete considered “tractable” here - Complexity in size of query; queries small in practice

Related Work on Query Containment Set semantics [Chandra&Merlin 77], [Sagiv&Yannakakis 80],... Bag, bag-set semantics [Lovász 67], [Chaudhuri&Vardi 93], [Ioannidis&Ramakrishnan 95], [Cohen+ 99], [Jayram+ 06],... – Label systems of [Ioannidis&Ramakrishnan 95] : similar in spirit to K-relations Bilattice-annotated relations [Grahne+ 97], parametric databases [Lakshmanan&Shiri 01] – Also similar in spirit to K-relations Minimal-witness why-prov. [Buneman+ 01], where-prov. [Tan 03] Z -relations/ Z [X]-relations [Green+ 09] 28

Conclusion When optimizers rewrite queries, the provenance of query answers may change! This paper helps us understand how. We have given positive decidability results and complexity characterizations for CQ/UCQ containment/equivalence on various kinds of provenance-annotated databases For optimizations common in commercial DBMSs (i.e., those compatible with bag semantics), we have shown that they imply no change in provenance 29

Open Problems for Future Work Decidability of Trio(X)-containment of UCQs? Exact complexity of N [X]-containment of UCQs? (GI-hard, in PSPACE) Complexity when UCQs are represented as positive relational algebra queries (exponentially more concise than UCQs)? 30