Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania PODS 2007.

Similar presentations


Presentation on theme: "1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania PODS 2007."— Presentation transcript:

1 1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania PODS 2007

2 2 Provenance ● First studied in data warehousing ▪ Lineage [Cui,Widom,Wiener 2000] ● Scientific applications (to assess quality of data) ▪ Why-Provenance [Buneman,Khanna,Tan 2001] ● Our interest: P2P data sharing in the O RCHESTRA system (project headed by Zack Ives) ▪ Trust conditions based on provenance ▪ Deletion propagation

3 3 PODS 2007 Annotated relations ● Provenance: an annotation on tuples ● Our observation: propagating provenance/lineage through views is similar to querying ▪ Incomplete Databases (conditional tables) ▪ Probabilistic Databases (independent tuple tables) ▪ Bag Semantics Databases (tuples with multiplicities) ● Hence we look at queries on relations with annotated tuples

4 4 PODS 2007 semantics: a set of instances Incomplete databases: boolean C-tables a b c p d b e r f g e s a b c f g e { } I(R)=,,,,,,, ; d b ea b cf g e d b e f g e a b c d b e f g e R a b c d b e boolean variables

5 5 PODS 2007 Imielinski & Lipski (1984): queries on C -tables R a b cp d b er f g es a c(p Æ p) Ç (p Æ p) a ep Æ r d cr Æ p d e(r Æ r) Ç (r Æ r) Ç (r Æ s) f e(s Æ s) Ç (s Æ s) Ç (s Æ r) p p Æ r r s = q(R) q(x,z) :- R(x, _,z), R( _, _,z) q(x,z) :- R(x,y, _ ), R( _,y,z) r r rr s union of conjunctive queries (UCQ) a c f e p=true r=false s=true

6 6 PODS 2007 Why-provenance/lineage a b cp d b er f g es R Which input tuples contribute to the presence of a tuple in the output? q(R) tuple ids same query a c{p} a e{p,r} d c{p,r} d e{r,s} f e{r,s} [Cui,Widom,Wiener 2000] [Buneman,Khanna,Tan 2001]

7 7 PODS 2007 C – tables vs. Lineage a c ({p}  {p})  ({p}  {p}) a e {p}  {r} d c {r}  {p} d e ({r}  {r})  ({r}  {r})  ({r}  {s}) f e ({s}  {s})  ({s}  {s})  ({s}  {r}) a c (p Æ p) Ç (p Æ p) a e p Æ r d c r Æ p d e (r Æ r) Ç (r Æ r) Ç (r Æ s) f e (s Æ s) Ç (s Æ s) Ç (s Æ r) c-table calculations lineage calculations The structure of the calculations is the same!

8 8 PODS 2007 Another analogy, with bag semantics a b c2 d b e5 f g e1 R tuple multiplicities a c 2 ¢ 2 + 2 ¢ 2 a e 2 ¢ 5 d c 5 ¢ 2 d e 5 ¢ 5 + 5 ¢ 5 + 5 ¢ 1 f e 1 ¢ 1 + 1 ¢ 1 + 1 ¢ 5 q(R) a c8 a e10 d c10 d e55 f e7 multiplicity calculations same query a c (p Æ p) Ç (p Æ p) a e p Æ r d c r Æ p d e (r Æ r) Ç (r Æ r) Ç (r Æ s) f e (s Æ s) Ç (s Æ s) Ç (s Æ r) c-table calculations The structure of the calculations is the same!

9 9 PODS 2007 Abstracting the structure of these calculations These expressions capture the abstract structure of the calculations, which encodes the logical derivation of the output tuples We shall use these expressions as provenance C-tablesBagsLineageAbstract join Æ¢[¢ union Ç + [ + a c (p ¢ p) + (p ¢ p) a e p ¢ r d c r ¢ p d e (r ¢ r) + (r ¢ r) + (r ¢ s) f e (s ¢ s) + (s ¢ s) + (s ¢ r) abstract calculations

10 10 PODS 2007 Technical Development ● Abstractly annotated relations ( K -relations) and their relational algebra ▪ K must be semiring ▪ For provenance, K is semiring of polynomials ● Datalog on K -relations ▪ For provenance, K consists of (possibly infinite) formal power series

11 11 PODS 2007 K -relations ● Annotations are elements from an algebraic structure (K,+, ¢, 0, 1) ● If D is the domain of database values, an n -ary K -relation is a function: R: D n ! K Although the notation resembles arithmetic, these are abstract operations All possible tuples

12 12 PODS 2007 K -relations, annotated tables ● K -relation corresponds to table: R: D n ! K ● If R(t)=k, then t “is annotated by k” ● For all but finitely many tuples t, R(t) = 0 ▪ we omit those tuples from the table representation tuple 1 k1k1 tuple 2 k2k2 tuple 3... k3k3

13 13 PODS 2007 Positive K -relational algebra ● We define an RA+ on K -relations: ▪ The ¢ corresponds to join: ▪ The + corresponds to union and projection ▪ 0 and 1 are used for selection predicates ▪ Details in the paper (but recall how we evaluated the UCQ q earlier and we will see another example later)

14 14 PODS 2007 RA+ identities imply semiring structure! ● Common RA+ identities ▪ Union and join are associative, commutative ▪ Join distributes over union ▪ etc. (but not idempotence!) These identities hold for RA+ on K -relations iff (K, +, ¢, 0, 1) is a commutative semiring (K,+,0) is a commutative monoid (K, ¢,1) is a commutative monoid ¢ distributes over +, etc

15 15 PODS 2007 Calculations on annotated tables are particular cases ( B, Ç, Æ, false, true) usual relational algebra ( N, +, ¢, 0, 1) bag semantics (PosBool(B), Ç, Æ, false, true) boolean C-tables ( P ( ­ ), [, Å, ;, ­ ) probabilistic event tables ( P (X), [, [, ;, ; ) lineage/why-provenance

16 16 PODS 2007 Provenance Semirings ● X = {p, r, s, …}: indeterminates (provenance “tokens” for base tuples) ● N [X] : multivariate polynomials with coefficients in N and indeterminates in X ● ( N [X], +, ¢, 0, 1) is the most “general” commutative semiring: its elements abstract calculations in all semirings ● N [X] –relations are the relations with provenance! ▪ The polynomials capture the propagation of provenance through (positive) relational algebra

17 17 PODS 2007 A provenance calculation a b cp d b er f g es R a c2p 2 a epr d cpr d e2r 2 + rs f e2s 2 + rs q(R) a c{p} a e{p,r} d c{p,r} d e{r,s} f e{r,s} Lineage ● Not just why- but also how-provenance (encodes derivations)! ● More informative than lineage same lineage, different provenance q(x,z) :- R(x, _,z), R(_, _,z) q(x,z) :- R(x,y, _), R(_,y,z)

18 18 PODS 2007 p : justified by Moe r : justified by Larry s : justified by Curly Trust assesment a b cp d b er f g es R a c2p 2 a epr d cpr d e2r 2 + rs f e2s 2 + rs q(R) One alternative needs Larry and Curly Two others only need Larry, twice 2 alternatives, both need Moe, twice Needs both Moe and Larry q(x,z) :- R(x, _,z), R(_, _,z) q(x,z) :- R(x,y, _), R(_,y,z) Which output tuples can be trusted after Larry is jailed?

19 19 PODS 2007 More Technical Development ● The semiring structure on annotations works out nicely for (positive) relational algebra. ● What more do we need for Datalog queries? ●  -continuous semirings (so fixed points exist)! ▪ N is not  -continuous, but N 1 ≜ N [ { 1 } is ● Here we show only what we need for Datalog provenance (formal power series)

20 20 PODS 2007 Beyond RA+: Datalog a b a c c b b d d d R R(a b) q(a b) R(b d) q(b d) q(a d) r1r1 r1r1 r2r2 r2r2 R(d d) q(d d) r1r1 q(a d) r2r2 r2r2 … r 1 : q(X, Y) :- R(X, Y) r 2 : q(X, Y) :- q(X, Z), q(Z, Y) R(a b) q(a b) R(b d) q(b d) q(a d) r1r1 r1r1 r2r2 r2r2 R(d d) q(d d) r1r1

21 21 PODS 2007 Provenance: Encoding Infinite Derivations ● Polynomials do not suffice, since they are finite! ● Instead, we use infinite formal power series ● Nonetheless, provenance is finitely representable through a system of equations

22 22 PODS 2007 Provenance equations a bm a cn c bp b dr d ds R a bx a cy c bz b du d dv a dw q(R) x= m + yz y= n z= p u= r + uv v= s + v 2 w= xu + wv The provenances x,y etc. are the power series that solve this system of equations (see next) r 1 : q(X, Y) :- R(X, Y) r 2 : q(X, Y) :- q(X, Z), q(Z, Y) Polynomials are the provenance of the immediate consequence operator (in RA+)

23 23 PODS 2007 Solutions: formal power series x = m + np y = n z = p v = s + s 2 + 2s 3 + 5s 4 + 14s 5 + … u = r v * w = r(m+np)(v * ) 2 where v * ≜ 1 + v + v 2 + v 3 + … In general we need coefficients from: N 1 ≜ N [ { 1 } Coefficients have the form 2k! k!(k+1)!

24 24 PODS 2007 Algorithmic results for Datalog provenance ● Given t  q(I), it is decidable whether the provenance of t is a proper (infinite) power series; ● From CFG ambiguity, we know that testing whether all coefficients are · 1 is undecidable ● However, given t  q(I), and a monomial , the coefficient of  in the power series that is the provenance of t is computable (including when it is 1 )

25 25 PODS 2007 Related Work ● Foundations: semirings/systems of equations/formal power series first used in CS in theory of formal languages [Chomsky,Schutzenberger 1963] ● Our work is related to and shares similar goals with “Debugging schema mappings with routes” [Chiticariu,Tan VLDB2006], where “routes” are like minimal finite portions of our how-provenance ▪ See also tutorial at SIGMOD tomorrow!

26 26 PODS 2007 Further work ● Application: P2P data sharing in the O RCHESTRA system (thanks to our collaborator Zack Ives): ▪ Need to express trust conditions based on provenance of tuples ▪ Incremental propagation of deletions ▪ Semiring provenance itself is incrementally maintainable ▪ See demo of O RCHESTRA in SIGMOD on Thursday! ● Future extension: full relational algebra. For difference we need semirings with “proper subtraction”

27 27 PODS 2007 Union of conjunctive queries (UCQ): q(x,z) :- R(x, _,z), R( _, _,z) q(x,z) :- R(x,y, _ ), R( _,y,z) q(R) cannot be represented by a maybe-table Querying the maybe-table R a b c ? d b e ? f g e ? a c a e d c d e f e d e f e d e { q(R) = a c f e a c a e d c d e f ea c,,,,,,, ; }

28 28 PODS 2007 Beyond RA+: Datalog r 1 : q(X, Y) :- R(X, Y) r 2 : q(X, Y) :- q(X, Z), q(Z, Y) a b a c c b b d d d R R(a c) q(a c) R(c b) q(c b) q(a b) r1r1 r1r1 r2r2 r2r2 R(a b) q(a b) r1r1


Download ppt "1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania PODS 2007."

Similar presentations


Ads by Google