Download presentation
Presentation is loading. Please wait.
Published byEugenia Black Modified over 9 years ago
1
Provenance for Database Transformations Val Tannen University of Pennsylvania Joint work with J.N. Foster T.J. Green G. Karvounarakis IPAW ’08 Salt Lake City June 17, 2008
2
Motivation Some of the work in IPAW! Data integration [Wang,Madnick 1990, Lee,Bressan,Madnick 1998] Data warehousing – Lineage [Cui,Widom,Wiener 2000] Scientific applications – Why-Provenance [Buneman,Khanna,Tan 2001] Collaborative data sharing networks in the ORCHESTRA system (project headed by Zack Ives) – Trust conditions based on provenance – Deletion propagation [Green,Ives,Karvounarakis,T. 2007, Karvounarakis,Ives 2008] 2
3
Database transformations, e.g., views a b c d b e f g e a c a e d c d e f e CREATE VIEW V AS (SELECT u.1, v.3 FROM R u, R v WHERE u.3 = v.3 UNION SELECT u.1, v.3 FROM R u, R v WHERE u.2 = v.2) 1 2 3 V V R R View V = q(R) white box (Ludäscher) white box (Ludäscher) 3
4
Database transformations, e.g., views a b c d b e f g e a c a e d c d e f e Datalog without recursion V(x,z) :- R(x, _,z), R( _, _,z) V(x,z) :- R(x,y, _ ), R( _,y,z) 1 2 3 V V R R View V = q(R) Relational algebra V := ¼ 12 (( ¼ 13 (R) ⋈ ¼ 23 (R)) [ ( ¼ 12 (R) ⋈ ¼ 23 (R))) 4
5
Provenance questions a b c d b e f g e a c a e d c d e f e V V R R View V = q(R) t ? ? ? Which input tuples contributed in some way to t being in the output? Which sets of input tuples support each way for t to be in the output? What are all possible ways in which t was caused to be in the output? 5
6
Provenance answers … d e … t Which input tuples contributed in some way to t being in the output? {r,s} lineage [CWW 00] Which sets of input tuples support each way for t to be in the output? {{r},{r,s}} proof why-provenance [BKT 01] see [PODS 08] What are all possible ways in which t was caused to be in the output? 2r 2 + rs prov. polynomials [Green,Karvounarakis,T. 2007] a b c p d b e r f g e s 6 tuple ids
7
More generality: annotated relations Provenance: an annotation on tuples Other instances of relations with annotated tuples – incomplete databases (conditional tables) [Imielinski,Lipski 1984] – probabilistic databases (independent tuple tables) [Fuhr, Rölleke 1997, Zimányi 1997, others] – bag semantics databases (tuples with multiplicities) […SQL!] How do annotations combine as they propagate through queries? (Is there an algebra of annotations?) 7 Imielinski and Lipski already computed some form of provenance!!
8
semantics: a set of instances Incomplete databases: boolean c-tables [IL 84] a b c p d b e r f g e s a b c f g e { } I(R) =,,,,,,, ; d b ea b cf g e d b e f g e a b c d b e f g e R a b c d b e boolean variables
9
Queries on c -tables R a b cp d b er f g es a c(p Æ p) Ç (p Æ p) a ep Æ r d cr Æ p d e(r Æ r) Ç (r Æ r) Ç (r Æ s) f e(s Æ s) Ç (s Æ s) Ç (s Æ r) p p Æ r r s = V V(x,z) :- R(x, _,z), R( _, _,z) V(x,z) :- R(x,y, _ ), R( _,y,z) r r rr s 9 But… simplifying like this misses the general idea!
10
Probabilistic independent-tuple tables a b c0.6 d b e0.5 f g e0.1 Events “tuple in instance” are independent View V may not be representable as an independent-tuple table R a b cX d b eY f g eZ eventss a cX a eX Å Y d cX Å Y d eY f eZ R V view computation: similar to c-tables, but for algebra of sets
11
C –tables vs. Lineage a c ({p} {p}) ({p} {p}) a e {p} {r} d c {r} {p} d e ({r} {r}) ({r} {r}) ({r} {s}) f e ({s} {s}) ({s} {s}) ({s} {r}) a c(p Æ p) Ç (p Æ p) a ep Æ r d cr Æ p d e(r Æ r) Ç (r Æ r) Ç (r Æ s) f e(s Æ s) Ç (s Æ s) Ç (s Æ r) c-table calculations lineage calculations [CWW 00] The structure of the calculations is the same ! 11
12
Another analogy, with bag semantics a b c2 d b e5 f g e1 R tuple multiplicities a c2 ¢ 2 + 2 ¢ 2 a e2 ¢ 5 d c5 ¢ 2 d e5 ¢ 5 + 5 ¢ 5 + 5 ¢ 1 f e1 ¢ 1 + 1 ¢ 1 + 1 ¢ 5 V a c8 a e10 d c10 d e55 f e7 multiplicity calculation s a c(p Æ p) Ç (p Æ p) a ep Æ r d cr Æ p d e(r Æ r) Ç (r Æ r) Ç (r Æ s) f e(s Æ s) Ç (s Æ s) Ç (s Æ r) c-table calculations Again, the structure of the calculations is the same! 12
13
Abstracting the structure of these calculations These expressions capture the abstract structure of the calculations We will end up using these expressions as provenance! db ops c-tablesbagslineageabstract join Æ¢[¢ union Ç+[+ a c(p ¢ p) + (p ¢ p) a ep ¢ r d cr ¢ p d e(r ¢ r) + (r ¢ r) + (r ¢ s) f e(s ¢ s) + (s ¢ s) + (s ¢ r) abstract calculations 13
14
Technical Development: K -relations Annotations are elements from an algebraic structure (K,+, ¢, 0, 1) If D is the domain of database values, an n -ary K -relation is a function: R: D n ! K Although the notation resembles arithmetic, these are abstract operations All possible tuples 14
15
K -relations, annotated tables K -relation corresponds to table: R: D n ! K If R(t)=k, then t “is annotated by k” For all but finitely many tuples t, R(t) = 0 we omit the tuples annotated with 0 tuple 1 k1k1 tuple 2 k2k2 tuple 3... k3k3 15
16
Positive K -relational algebra We define an RA+ on K -relations: The ¢ corresponds to joint use (join) The + corresponds to alternative use (union and projection) 0 and 1 are used for selection predicates 16
17
Positive K-relational algebra: details Natural join: [R 1 ⋈ R 2 ](t) = R 1 (t 1 ) ¢ R 2 (t 2 ) t on attrs( R 1 ) = t 1, t on attrs( R 2 ) = t 2 Union: [R 1 [ R 2 ](t) = R 1 (t) + R 2 (t) Projection: [ V R](t) = t'=t on V and R(t’) 0 R(t') Selection: [ P R](t) = R(t) ¢ P(t) P(t) = 0 or 1 17
18
RA+ identities imply semiring structure! Common RA+ identities – Union and join are associative, commutative – Join distributes over union – etc. (but not idempotence!) These identities hold for RA+ on K -relations iff (K, +, ¢, 0, 1) is a commutative semiring 18
19
Semiring Bestiary ( B, Ç, Æ, ?, > )Usual rel. alg. (sets) ( N, +, ¢, 0, 1)Bag semantics (PosBool(X), Ç, Æ, ?, > ) Boolean c-tables, also Minimal why-provenance [BKT 01] ( P ( ), [, Å, ;, )Event tables (prob. db) ( P ( P (X)), [, d, ;, { ; })Proof why-provenance where A d B := {a [ b : a 2 A, b 2 B} ( P (X), [, [, ;, ; ) ★ Lineage ( N [X], +, ¢, 0, 1)Provenance polynomials 19
20
Provenance polynomials X = {p, r, s, …}: indeterminates (provenance “tokens” for base tuples) N [X] : multivariate polynomials with coefficients in N and indeterminates in X ( N [X], +, ¢, 0, 1) is the free commutative semiring generated by X ; its elements abstract calculations in all semirings The polynomials capture the propagation of provenance through (positive) relational algebra in the most general way allowed by commutative semiring-based semantics 20
21
Provenance calculations a b cp d b er f g es R V a c{p}{{p}} p2p 2 a e{p,r}{{p,r}} p Æ r pr d c{p,r}{{p,r}}{{p.r}}p Æ r pr d e{r,s}{{r},{r,s}}{{r}}r2r 2 + rs f e{r,s}{{r,s},{s}}{{s}}s2s 2 + rs lineage proof why- prov. minimal why- prov. provenance polynomials boolean c-table annot. ≈ ≈ Three derivations: two of them use r, twice, and the third uses r and s, once each
22
p : certified by Moe r : certified by Larry s : certified by Curly Trust assesment a b cp d b er f g es R a c2p 2 a epr d cpr d e2r 2 + rs f e2s 2 + rs V One alternative needs Larry and Curly Two others only need Larry, twice 2 alternatives, both need Moe, twice Needs both Moe and Larry Which output tuples can be trusted after Larry is jailed? 22
23
A glimpse at work by T.J. Green: Provenance and Query Optimization Many kinds of semiring-based provenance annotations to choose from: – Lineage – Proof why-provenance – Minimal why-provenance – Provenance polynomials –... They keep track of more/less information A fundamental question, asked repeatedly by Peter Buneman: how does this affect query optimization? 23
24
Choice of K Affects Query Optimization K = N (bag semantics) differs from K = B (set semantics) e.g., the conjunctive queries Q 1 (x) :- R(x,y), R(x,z) Q 2 (u) :- R(u,v) are set-equivalent, but not bag-equivalent 24
25
A Hierarchy of Semiring Provenance (1) Provenance polynomials ( N [X], +, ¢, 0, 1) – tracks calculations abstractly; most general e.g., 2p 2 r + 3ps + ps 3 Drop coefficients to get ( B [X], +, ¢, 0, 1) p 2 r + ps + ps 3 Drop exponents to get proof why-prov. ( P ( P (X)), [, d, ;, { ; }) {{p,r}, {p,s}} Flatten set-of-sets to get lineage {p,r,s} Drop, flatten, etc. correspond to surjective semiring homomorphisms 25
26
A Hierarchy of Semiring Provenance (2) Definition: K 1 ¹ L K 2 means that for all queries P, Q in language L P ´ K 2 Q implies P ´ K 1 Q Languages of interest: CQ and UCQ (equivalent to RA+) Definition: K 1 ¼ L K 2 means K 1 ¹ L K 2 and K 2 ¹ L K 1 Proposition: If there exists a surjective homomorphism h : K 1 K 2 then K 1 ¹ UCQ K 2 Proposition (from [GKT 07]) If K is a distributive lattice then B ¼ UCQ K (In particular B ¼ UCQ PosBool(X) ) 26
27
A Hierarchy of Semiring Provenance (3) Definition: A semiring is positive if 0=1 and a+b = 0 implies a=0 and b=0 and a ¢ b = 0 implies a=0 or b=0 All the semirings we consider are positive. Proposition: For any positive K (and “big enough” X) B ¹ UCQ K ¹ UCQ N [X] Moreover: Proposition (Provenance Hierarchy): B ¹ UCQ lineage ¹ UCQ proof why-prov. UCQ ¹ B [X] ¹ UCQ N [X] 27
28
Separating the Models for ´ of CQs B Á CQ lineage: Q 1 (x,y) :- R(x,y), R(x,z) Q 2 (x,y) :- R(x,y) Q 1 ´ B Q 2 but Q 1 ´ lin Q 2 lineage Á CQ why: Q 1 (x) :- R(x,y), R(x,z) Q 2 (x) :- R(x,y) Q 1 ´ lin Q 2 but Q 1 ´ why Q 2 28
29
Summary: Provenance Hierarchy 29 B PosB.(B)Lineage N Why-Pr. B[X]B[X] N[X]N[X] CQs vKvK ¼ÁÁÁÁ¼ ´K´K ¼ÁÁ¼¼¼ B PosB.(B)LineageWhy-Pr. B[X]B[X] N[X]N[X] UCQs vKvK ¼ÁÁÁÁ ´K´K ¼ÁÁÁÁ More importantly, Green’s results also show decidability for containment and equivalence of CQs and UCQs under the various provenance semantics
30
Extension to annotated XML Data model: unordered XML data with semiring annotations (K-UXML) Query language: positive, unordered XQuery fragment (K-UXQuery) Sanity checks: agrees with encoded relational queries, bag semantics, probabilistic XML,... Applications: security, incomplete XML databases,... 30
31
K-UXML No attributes, no text values, no repeated children (inessential); no order (essential!) Each subtree decorated with a value k from semiring K (1 “neutral,” 0 “not present”) K-collection: a finite set of elements annotated with values from K The child subtrees of a node form a K- collection 31
32
c b c b c ad c ad In NRC K : { h a, { h b, { h a, { h c, {} i y 3, h d, {} i 1 } i 1 } i x 1, h c, {...} i y 1 } i 1 } K-UXML Example 32 a bx1bx1 cy3cy3 cy1cy1 ad a cy2cy2 bx2bx2 d a bc a d 1 1y3y3 x1x1 1 y1y1 y2y2 x2x2 1 ´ Annotations are on elements of K-collections. There are 5 K-collections in this tree (all colored differently). To annotate whole tree, must include in singleton K- collection.
33
a dudu x b dvdv ewew y c f z,, K-UXQuery Semantics: for -Loops 33 Answer: axax dudu byby dvdv, czcz f, ewew d xu + yv, e yw, f z Computation: axax dudu byby dvdv czcz f, ewew, Source, $S: d xu, d vy, e yw, f z x d u, y d y, y e w, z f Query: for $t in $S return $t/*
34
Annotation of result is a sum over products of annotations along paths to root K-UXQuery Semantics: // Operator 34 Source, $S: r c x 1 ¢ y 3 + y 1 ¢ y 2 cy1cy1 d a cy2cy2 bx2bx2 Answer: Query: $S//c a bx1bx1 cy3cy3 cy1cy1 ad a cy2cy2 bx2bx2 d
35
Data annotated with clearance levels from total order C : P < C < S < T < 0 Joint use of data ( ¢ ) requires access to both (max of clearances); alternative use of data (+) requires access to either (min of clearances) ( C, min, max, 0, P) is a commutative semiring p d min(max(P,C,C), max(P,C,S)) e max(P,C,T) Application: Access Control 35 Query: $S/*/* bCbC dCdC cCcC dSdS eTeT a dCdC eTeT p
36
For any given clearance level (e.g., C), want the following diagram to commute: Security Condition: Non-Interference 36 query erase > C a bCbC dCdC cCcC dSdS eTeT p dCdC eTeT p dCdC a dCdC bCbC cCcC
37
Application: Incomplete XML Data annotated with Boolean expressions; tree T represents set of possible worlds Rep(T) 37 T = a b cycy cxcx a d a czcz b d a b c c ad a cb d Rep(T) = a b a d a b c a d a bc ad a b d,,,..., 7 possible worlds
38
Correctness: Possible Worlds 38 For every incomplete tree T, and every UXQuery query q, want this diagram to commute: TRep(T) q(Rep(T)) = Rep(q(T)) q(T)q(T) q q Rep q(Rep(T))
39
Commutation with Homomorphisms Ex: access control h c : C C h c (k) := if k · c then k else 0 Ex: incomplete databases º : Vars B Eval º : PosBool(Vars) B Ex: duplicate elimination ± : N B ± (k) := if k = 0 then ? else > 39 Theorem: Let h : K 1 K 2 be a semiring homomorphism. Then for any RA+/NRC/UXQuery query q, and for any K 1 - instance D, we have h(q(D)) = q(h(D)).
40
Provenance Polynomials are Universal 40 Corollary: The semantics of RA+/NRC/UXQuery evaluation on K-instances for any commutative semiring K factors through evaluation using provenance polynomials N [X]. e.g., for any K-UXML document D, for any K-UXQuery q, we have q(D) = Eval º (q(D’)) where 1. D’ is obtained by replacing K-annotations in D with fresh variables from X 2. º : X K is the corresponding valuation 3.Eval º : N [X] K is the unique semiring homomorphism such that for the one-variable monomials, Eval º (x) = º (x).
41
Datalog? The semiring structure on annotations works out nicely for positive relational algebra, positive nested relational calculus (NRC), a large fragment of XQuery,. What more do we need to capture recursion, i.e., for Datalog queries? -complete semirings with -continuous operations (so fixed points exist!) -continuous semirings N is not, but N 1 ≜ N [ { 1 } is. 41
42
Datalog may have infinite derivations! Polynomials do not suffice, since they are finite! Nonetheless, the calculations are finitely representable through a system of equations The equations have a least solution in any -continuous semiring For provenance, we must generalize from polynomials to formal power series (in general, infinitely many monomials) 42
43
Related Work Foundations: semirings/systems of equations/formal power series first used in CS in theory of formal languages [Chomsky,Schutzenberger 1963] Our work is related to and shares similar goals with “Debugging schema mappings with routes” [Chiticariu,Tan VLDB2006], where “routes” are like minimal finite portions of our provenance polynomials 43
44
More Related Work Bag semantics for NRC [Libkin&Wong 97] Incomplete XML [Kanza+ 99, Abiteboul+ 06] Probabilistic XML [Nierman&Jagadish 02, van Keulen+ 05, Abit.&Senellart 06, Sen.&Abit. 07, Hung+ 07] XML provenance [Buneman+ 01] NRC provenance [Hidders+ 07] Soft CSPs [Bistarelli et al] Semiring-annotated XPath [Grahne+ 07] Negation, expressiveness of RA K [Geerts&Poggi 08] 44
45
Related Work for T.J. Green Already mentioned – Set-cont. and equiv. of CQs [Chandra&Merlin 77] – Set-cont. and equiv. of UCQs [Sagiv&Yannakakis 80] – Bag-cont. of UCQs [Ioannidis&Ramakrishnan 95] – Bag-equiv. of CQs [Chaudhuri&Vardi 93] Containment of CQs with where-provenance [Tan 03] Bag-set semantics [CV 93], combined semantics [Cohen 06] – For K-relations: support operator of [Geerts&Poggi 08] generalizes duplicate elimination Bag-containment of CQ s [Jayram+ 06] 45
46
Conclusion Annotations forming a commutative semiring seem to fit well with database transformations expressed in positive query languages, be they relational, even recursive, or for complex values or tree data. We obtained explanations for a number of puzzles related to why-provenance in a broad sense. Provenance polynomials also capture tuple multiplicity and serve well systems such as Orchestra. Big open questions: negation (although see work by Geerts, Poggi) and order 46
47
Future Work I have the feeling that we have only scratched the surface so far… I am working on marrying this approach with data exchange, with a broader perspective on security, with integrity constraints, with a broader perspective on mapping/view maintenance and update… 47
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.