Download presentation
Presentation is loading. Please wait.
1
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA June 26, 2007
2
2 PROPR 2007 Provenance ● First studied in data warehousing ▪ Lineage [Cui,Widom,Wiener 2000] ● Scientific applications (to assess quality of data) ▪ Why-Provenance [Buneman,Khanna,Tan 2001] ● Our interest: P2P data sharing in the O RCHESTRA system (project headed by Zack Ives) ▪ Trust conditions based on provenance ▪ Deletion propagation
3
3 PROPR 2007 Annotated relations ● Provenance: an annotation on tuples ● Our observation: propagating provenance/lineage through views is similar to querying ▪ Incomplete Databases (conditional tables) ▪ Probabilistic Databases (independent tuple tables) ▪ Bag Semantics Databases (tuples with multiplicities) ● Hence we look at queries on relations with annotated tuples
4
4 PROPR 2007 semantics: a set of instances Incomplete databases: boolean C-tables a b c p d b e r f g e s a b c f g e { } I(R)=,,,,,,, ; d b ea b cf g e d b e f g e a b c d b e f g e R a b c d b e boolean variables
5
5 PROPR 2007 Imielinski & Lipski (1984): queries on C -tables R a b cp d b er f g es a c(p Æ p) Ç (p Æ p) a ep Æ r d cr Æ p d e(r Æ r) Ç (r Æ r) Ç (r Æ s) f e(s Æ s) Ç (s Æ s) Ç (s Æ r) p p Æ r r s = q(R) q(x,z) :- R(x, _,z), R( _, _,z) q(x,z) :- R(x,y, _ ), R( _,y,z) r r rr s union of conjunctive queries (UCQ) a c f e p=true r=false s=true
6
6 PROPR 2007 Why-provenance/lineage a b cp d b er f g es R Which input tuples contribute to the presence of a tuple in the output? q(R) tuple ids same query a c{p} a e{p,r} d c{p,r} d e{r,s} f e{r,s} [Cui,Widom,Wiener 2000] [Buneman,Khanna,Tan 2001]
7
7 PROPR 2007 C – tables vs. Why-provenance a c ({p} {p}) ({p} {p}) a e {p} {r} d c {r} {p} d e ({r} {r}) ({r} {r}) ({r} {s}) f e ({s} {s}) ({s} {s}) ({s} {r}) a c (p Æ p) Ç (p Æ p) a e p Æ r d c r Æ p d e (r Æ r) Ç (r Æ r) Ç (r Æ s) f e (s Æ s) Ç (s Æ s) Ç (s Æ r) c-table calculations Why-provenance calculations The structure of the calculations is the same!
8
8 PROPR 2007 Another analogy, with bag semantics a b c2 d b e5 f g e1 R tuple multiplicities a c 2 ¢ 2 + 2 ¢ 2 a e 2 ¢ 5 d c 5 ¢ 2 d e 5 ¢ 5 + 5 ¢ 5 + 5 ¢ 1 f e 1 ¢ 1 + 1 ¢ 1 + 1 ¢ 5 q(R) a c8 a e10 d c10 d e55 f e7 multiplicity calculations same query a c (p Æ p) Ç (p Æ p) a e p Æ r d c r Æ p d e (r Æ r) Ç (r Æ r) Ç (r Æ s) f e (s Æ s) Ç (s Æ s) Ç (s Æ r) c-table calculations The structure of the calculations is the same!
9
9 PROPR 2007 Abstracting the structure of these calculations These expressions capture the abstract structure of the calculations, which encodes the logical derivation of the output tuples We shall use these expressions as provenance C-tablesBagsWhy-provenanceAbstract join Æ¢[¢ union Ç + [ + a c (p ¢ p) + (p ¢ p) a e p ¢ r d c r ¢ p d e (r ¢ r) + (r ¢ r) + (r ¢ s) f e (s ¢ s) + (s ¢ s) + (s ¢ r) abstract calculations
10
10 PROPR 2007 Positive K -relational algebra ● We define an RA+ on K -relations: ▪ The ¢ corresponds to join: ▪ The + corresponds to union and projection ▪ 0 and 1 are used for selection predicates ▪ Details in the paper (but recall how we evaluated the UCQ q earlier and we will see another example later)
11
11 PROPR 2007 RA+ identities imply semiring structure! ● Common RA+ identities ▪ Union and join are associative, commutative ▪ Join distributes over union ▪ etc. (but not idempotence!) These identities hold for RA+ on K -relations iff (K, +, ¢, 0, 1) is a commutative semiring (K,+,0) is a commutative monoid (K, ¢,1) is a commutative monoid ¢ distributes over +, etc
12
12 PROPR 2007 Calculations on annotated tables are particular cases ( B, Ç, Æ, false, true) usual relational algebra ( N, +, ¢, 0, 1) bag semantics (PosBool(B), Ç, Æ, false, true) boolean C-tables ( P ( ), [, Å, ;, ) probabilistic event tables ( P (X), [, [, ;, ; ) lineage/why-provenance
13
13 PROPR 2007 Provenance Semirings ● X = {p, r, s, …}: indeterminates (provenance “tokens” for base tuples) ● N [X] : multivariate polynomials with coefficients in N and indeterminates in X ● ( N [X], +, ¢, 0, 1) is the most “general” commutative semiring: its elements abstract calculations in all semirings ● N [X] –relations are the relations with provenance! ▪ The polynomials capture the propagation of provenance through (positive) relational algebra
14
14 PROPR 2007 A provenance calculation a b cp d b er f g es R a c2p 2 a epr d cpr d e2r 2 + rs f e2s 2 + rs q(R) a c{p} a e{p,r} d c{p,r} d e{r,s} f e{r,s} Why-provenance ● Not just why- but also how-provenance (encodes derivations)! ● More informative than why-provenance same why-provenance, different polynomials q(x,z) :- R(x, _,z), R(_, _,z) q(x,z) :- R(x,y, _), R(_,y,z)
15
15 PROPR 2007 Further work ● Application: P2P data sharing in the O RCHESTRA system: ▪ Need to express trust conditions based on provenance of tuples ▪ Incremental propagation of deletions ▪ Semiring provenance itself is incrementally maintainable ● Future extensions: ▪ full relational algebra: For difference we need semirings with “proper subtraction” ▪ richer data models: nested relations/complex values, XML
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.