1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania PODS 2007.

Slides:



Advertisements
Similar presentations
1 Datalog: Logic Instead of Algebra. 2 Datalog: Logic instead of Algebra Each relational-algebra operator can be mimicked by one or several Database Logic.
Advertisements

Containment of Conjunctive Queries on Annotated Relations TJ Green University of Pennsylvania Symposium on Database Provenance University of Edinburgh.
Dr. A.I. Cristea CS 319: Theory of Databases: FDs.
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION.
Provenance analysis of algorithms 10/1/13 V. Tannen University of Pennsylvania 1WebDam someTowards ?
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Copyright © Cengage Learning. All rights reserved.
Basic Structures: Sets, Functions, Sequences, Sums, and Matrices
Basic Structures: Sets, Functions, Sequences, Sums, and Matrices
Efficient Query Evaluation on Probabilistic Databases
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
1 Lecture 11: Provenance and Data privacy December 8, 2010 Dan Suciu -- CSEP544 Fall 2010.
An Overview of Data Provenance
1 9. Evaluation of Queries Query evaluation – Quantifier Elimination and Satisfiability Example: Logical Level: r   y 1,…y n  r’ Constraint.
1 2. Constraint Databases Next level of data abstraction: Constraint level – finitely represents by constraints the logical level.
Firewall Policy Queries Author: Alex X. Liu, Mohamed G. Gouda Publisher: IEEE Transaction on Parallel and Distributed Systems 2009 Presenter: Chen-Yu Chang.
1 Digital Logic
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA June 26, 2007.
2005lav-iii1 The Infomaster system & the inverse rules algorithm  The InfoMaster system  The inverse rules algorithm  A side trip – equivalence & containment.
Monadic Predicate Logic is Decidable Boolos et al, Computability and Logic (textbook, 4 th Ed.)
1 Provenance in O RCHESTRA T.J. Green, G. Karvounarakis, Z. Ives, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA.
Propositional Calculus Math Foundations of Computer Science.
First Order Logic. This Lecture Last time we talked about propositional logic, a logic on simple statements. This time we will talk about first order.
CS848: Topics in Databases: Foundations of Query Optimization Topics Covered  Databases  QL  Query containment  More on QL.
Systems Architecture I1 Propositional Calculus Objective: To provide students with the concepts and techniques from propositional calculus so that they.
DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.
The Relational Model: Relational Calculus
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
Boolean Algebra and Computer Logic Mathematical Structures for Computer Science Chapter 7.1 – 7.2 Copyright © 2006 W.H. Freeman & Co.MSCS Slides Boolean.
CSE314 Database Systems The Relational Algebra and Relational Calculus Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson Ed Slide Set.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
Data Security and Encryption (CSE348) 1. Lecture # 12 2.
First Order Logic Lecture 2: Sep 9. This Lecture Last time we talked about propositional logic, a logic on simple statements. This time we will talk about.
CSE 544 Relational Calculus Lecture #2 January 11 th, Dan Suciu , Winter 2011.
Reconcilable Differences Todd J. GreenZachary G. IvesVal Tannen University of Pennsylvania March 24, ICDT 09, Saint Petersburg.
Propositional Calculus CS 270: Mathematical Foundations of Computer Science Jeremy Johnson.
CompSci 102 Discrete Math for Computer Science
A Logic of Partially Satisfied Constraints Nic Wilson Cork Constraint Computation Centre Computer Science, UCC.
In1200/04-PDS 1 TU-Delft Digital Logic. in1200/04-PDS 2 TU-Delft Unit of Information l Computers consist of digital (binary) circuits l Unit of information:
CS848: Topics in Databases: Information Integration Topics covered  Databases  QL  Query containment  An evaluation of QL.
Database Systems Relational Calculus. Relational Algebra vs. Relational Calculus Relational algebra queries are relatively easy to implement in.
Lecture 7: Foundations of Query Languages Tuesday, January 23, 2001.
1 CSE544 Monday April 26, Announcements Project Milestone –Due today Next paper: On the Unusual Effectiveness of Logic in Computer Science –Need.
Querying Big Data by Accessing Small Data Wenfei FanUniversity of Edinburgh & Beihang University Floris GeertsUniversity of Antwerp Yang CaoUniversity.
Lu Chaojun, SJTU 1 Extended Relational Algebra. Bag Semantics A relation (in SQL, at least) is really a bag (or multiset). –It may contain the same tuple.
ICS 321 Fall 2011 Algebraic and Logical Query Languages (ii) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at.
3-2 Solving Systems Algebraically. In addition to graphing, which we looked at earlier, we will explore two other methods of solving systems of equations.
Provenance for Database Transformations Val Tannen University of Pennsylvania Joint work with J.N. Foster T.J. Green G. Karvounarakis IPAW ’08 Salt Lake.
Update Exchange with Provenance Schemas are related by GLAV schema mappings (tgds) : M4: Domain_Ref(SrcID, 'Interpro', ITAcc), Entry2Meth(ITAcc, DBAcc,
Querying and storing XML
Chapter 2 1. Chapter Summary Sets (This Slide) The Language of Sets - Sec 2.1 – Lecture 8 Set Operations and Set Identities - Sec 2.2 – Lecture 9 Functions.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
CS589 Principles of DB Systems Fall 2008 Lecture 4c: Query Language Equivalence Lois Delcambre
CS589 Principles of DB Systems Spring 2014 Unit 2: Recursive Query Processing Lecture 2-1 – Naïve algorithm for recursive queries Lois Delcambre (slides.
Relational Model By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany)
Queries with Difference on Probabilistic Databases
Propositional Calculus: Boolean Algebra and Simplification
CSPs and Relational DBs
Consistent Query Answering: a personal perspective
Lecture 10: Query Complexity
Chapter 2: Intro to Relational Model
Example of a Relation attributes (or columns) tuples (or rows)
Chapter 2: Intro to Relational Model
On Provenance of Queries on Linked Web Data
Representations & Reasoning Systems (RRS) (2.2)
Proofs (2.7) (How to compute logical consequences of a KB rather that
CSE544 Wednesday, March 29, 2006.
CS589 Principles of DB Systems Fall 2008 Lecture 4b: Domain Independence and Safety Lois Delcambre
Presentation transcript:

1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania PODS 2007

2 Provenance ● First studied in data warehousing ▪ Lineage [Cui,Widom,Wiener 2000] ● Scientific applications (to assess quality of data) ▪ Why-Provenance [Buneman,Khanna,Tan 2001] ● Our interest: P2P data sharing in the O RCHESTRA system (project headed by Zack Ives) ▪ Trust conditions based on provenance ▪ Deletion propagation

3 PODS 2007 Annotated relations ● Provenance: an annotation on tuples ● Our observation: propagating provenance/lineage through views is similar to querying ▪ Incomplete Databases (conditional tables) ▪ Probabilistic Databases (independent tuple tables) ▪ Bag Semantics Databases (tuples with multiplicities) ● Hence we look at queries on relations with annotated tuples

4 PODS 2007 semantics: a set of instances Incomplete databases: boolean C-tables a b c p d b e r f g e s a b c f g e { } I(R)=,,,,,,, ; d b ea b cf g e d b e f g e a b c d b e f g e R a b c d b e boolean variables

5 PODS 2007 Imielinski & Lipski (1984): queries on C -tables R a b cp d b er f g es a c(p Æ p) Ç (p Æ p) a ep Æ r d cr Æ p d e(r Æ r) Ç (r Æ r) Ç (r Æ s) f e(s Æ s) Ç (s Æ s) Ç (s Æ r) p p Æ r r s = q(R) q(x,z) :- R(x, _,z), R( _, _,z) q(x,z) :- R(x,y, _ ), R( _,y,z) r r rr s union of conjunctive queries (UCQ) a c f e p=true r=false s=true

6 PODS 2007 Why-provenance/lineage a b cp d b er f g es R Which input tuples contribute to the presence of a tuple in the output? q(R) tuple ids same query a c{p} a e{p,r} d c{p,r} d e{r,s} f e{r,s} [Cui,Widom,Wiener 2000] [Buneman,Khanna,Tan 2001]

7 PODS 2007 C – tables vs. Lineage a c ({p}  {p})  ({p}  {p}) a e {p}  {r} d c {r}  {p} d e ({r}  {r})  ({r}  {r})  ({r}  {s}) f e ({s}  {s})  ({s}  {s})  ({s}  {r}) a c (p Æ p) Ç (p Æ p) a e p Æ r d c r Æ p d e (r Æ r) Ç (r Æ r) Ç (r Æ s) f e (s Æ s) Ç (s Æ s) Ç (s Æ r) c-table calculations lineage calculations The structure of the calculations is the same!

8 PODS 2007 Another analogy, with bag semantics a b c2 d b e5 f g e1 R tuple multiplicities a c 2 ¢ ¢ 2 a e 2 ¢ 5 d c 5 ¢ 2 d e 5 ¢ ¢ ¢ 1 f e 1 ¢ ¢ ¢ 5 q(R) a c8 a e10 d c10 d e55 f e7 multiplicity calculations same query a c (p Æ p) Ç (p Æ p) a e p Æ r d c r Æ p d e (r Æ r) Ç (r Æ r) Ç (r Æ s) f e (s Æ s) Ç (s Æ s) Ç (s Æ r) c-table calculations The structure of the calculations is the same!

9 PODS 2007 Abstracting the structure of these calculations These expressions capture the abstract structure of the calculations, which encodes the logical derivation of the output tuples We shall use these expressions as provenance C-tablesBagsLineageAbstract join Æ¢[¢ union Ç + [ + a c (p ¢ p) + (p ¢ p) a e p ¢ r d c r ¢ p d e (r ¢ r) + (r ¢ r) + (r ¢ s) f e (s ¢ s) + (s ¢ s) + (s ¢ r) abstract calculations

10 PODS 2007 Technical Development ● Abstractly annotated relations ( K -relations) and their relational algebra ▪ K must be semiring ▪ For provenance, K is semiring of polynomials ● Datalog on K -relations ▪ For provenance, K consists of (possibly infinite) formal power series

11 PODS 2007 K -relations ● Annotations are elements from an algebraic structure (K,+, ¢, 0, 1) ● If D is the domain of database values, an n -ary K -relation is a function: R: D n ! K Although the notation resembles arithmetic, these are abstract operations All possible tuples

12 PODS 2007 K -relations, annotated tables ● K -relation corresponds to table: R: D n ! K ● If R(t)=k, then t “is annotated by k” ● For all but finitely many tuples t, R(t) = 0 ▪ we omit those tuples from the table representation tuple 1 k1k1 tuple 2 k2k2 tuple 3... k3k3

13 PODS 2007 Positive K -relational algebra ● We define an RA+ on K -relations: ▪ The ¢ corresponds to join: ▪ The + corresponds to union and projection ▪ 0 and 1 are used for selection predicates ▪ Details in the paper (but recall how we evaluated the UCQ q earlier and we will see another example later)

14 PODS 2007 RA+ identities imply semiring structure! ● Common RA+ identities ▪ Union and join are associative, commutative ▪ Join distributes over union ▪ etc. (but not idempotence!) These identities hold for RA+ on K -relations iff (K, +, ¢, 0, 1) is a commutative semiring (K,+,0) is a commutative monoid (K, ¢,1) is a commutative monoid ¢ distributes over +, etc

15 PODS 2007 Calculations on annotated tables are particular cases ( B, Ç, Æ, false, true) usual relational algebra ( N, +, ¢, 0, 1) bag semantics (PosBool(B), Ç, Æ, false, true) boolean C-tables ( P ( ­ ), [, Å, ;, ­ ) probabilistic event tables ( P (X), [, [, ;, ; ) lineage/why-provenance

16 PODS 2007 Provenance Semirings ● X = {p, r, s, …}: indeterminates (provenance “tokens” for base tuples) ● N [X] : multivariate polynomials with coefficients in N and indeterminates in X ● ( N [X], +, ¢, 0, 1) is the most “general” commutative semiring: its elements abstract calculations in all semirings ● N [X] –relations are the relations with provenance! ▪ The polynomials capture the propagation of provenance through (positive) relational algebra

17 PODS 2007 A provenance calculation a b cp d b er f g es R a c2p 2 a epr d cpr d e2r 2 + rs f e2s 2 + rs q(R) a c{p} a e{p,r} d c{p,r} d e{r,s} f e{r,s} Lineage ● Not just why- but also how-provenance (encodes derivations)! ● More informative than lineage same lineage, different provenance q(x,z) :- R(x, _,z), R(_, _,z) q(x,z) :- R(x,y, _), R(_,y,z)

18 PODS 2007 p : justified by Moe r : justified by Larry s : justified by Curly Trust assesment a b cp d b er f g es R a c2p 2 a epr d cpr d e2r 2 + rs f e2s 2 + rs q(R) One alternative needs Larry and Curly Two others only need Larry, twice 2 alternatives, both need Moe, twice Needs both Moe and Larry q(x,z) :- R(x, _,z), R(_, _,z) q(x,z) :- R(x,y, _), R(_,y,z) Which output tuples can be trusted after Larry is jailed?

19 PODS 2007 More Technical Development ● The semiring structure on annotations works out nicely for (positive) relational algebra. ● What more do we need for Datalog queries? ●  -continuous semirings (so fixed points exist)! ▪ N is not  -continuous, but N 1 ≜ N [ { 1 } is ● Here we show only what we need for Datalog provenance (formal power series)

20 PODS 2007 Beyond RA+: Datalog a b a c c b b d d d R R(a b) q(a b) R(b d) q(b d) q(a d) r1r1 r1r1 r2r2 r2r2 R(d d) q(d d) r1r1 q(a d) r2r2 r2r2 … r 1 : q(X, Y) :- R(X, Y) r 2 : q(X, Y) :- q(X, Z), q(Z, Y) R(a b) q(a b) R(b d) q(b d) q(a d) r1r1 r1r1 r2r2 r2r2 R(d d) q(d d) r1r1

21 PODS 2007 Provenance: Encoding Infinite Derivations ● Polynomials do not suffice, since they are finite! ● Instead, we use infinite formal power series ● Nonetheless, provenance is finitely representable through a system of equations

22 PODS 2007 Provenance equations a bm a cn c bp b dr d ds R a bx a cy c bz b du d dv a dw q(R) x= m + yz y= n z= p u= r + uv v= s + v 2 w= xu + wv The provenances x,y etc. are the power series that solve this system of equations (see next) r 1 : q(X, Y) :- R(X, Y) r 2 : q(X, Y) :- q(X, Z), q(Z, Y) Polynomials are the provenance of the immediate consequence operator (in RA+)

23 PODS 2007 Solutions: formal power series x = m + np y = n z = p v = s + s 2 + 2s 3 + 5s s 5 + … u = r v * w = r(m+np)(v * ) 2 where v * ≜ 1 + v + v 2 + v 3 + … In general we need coefficients from: N 1 ≜ N [ { 1 } Coefficients have the form 2k! k!(k+1)!

24 PODS 2007 Algorithmic results for Datalog provenance ● Given t  q(I), it is decidable whether the provenance of t is a proper (infinite) power series; ● From CFG ambiguity, we know that testing whether all coefficients are · 1 is undecidable ● However, given t  q(I), and a monomial , the coefficient of  in the power series that is the provenance of t is computable (including when it is 1 )

25 PODS 2007 Related Work ● Foundations: semirings/systems of equations/formal power series first used in CS in theory of formal languages [Chomsky,Schutzenberger 1963] ● Our work is related to and shares similar goals with “Debugging schema mappings with routes” [Chiticariu,Tan VLDB2006], where “routes” are like minimal finite portions of our how-provenance ▪ See also tutorial at SIGMOD tomorrow!

26 PODS 2007 Further work ● Application: P2P data sharing in the O RCHESTRA system (thanks to our collaborator Zack Ives): ▪ Need to express trust conditions based on provenance of tuples ▪ Incremental propagation of deletions ▪ Semiring provenance itself is incrementally maintainable ▪ See demo of O RCHESTRA in SIGMOD on Thursday! ● Future extension: full relational algebra. For difference we need semirings with “proper subtraction”

27 PODS 2007 Union of conjunctive queries (UCQ): q(x,z) :- R(x, _,z), R( _, _,z) q(x,z) :- R(x,y, _ ), R( _,y,z) q(R) cannot be represented by a maybe-table Querying the maybe-table R a b c ? d b e ? f g e ? a c a e d c d e f e d e f e d e { q(R) = a c f e a c a e d c d e f ea c,,,,,,, ; }

28 PODS 2007 Beyond RA+: Datalog r 1 : q(X, Y) :- R(X, Y) r 2 : q(X, Y) :- q(X, Z), q(Z, Y) a b a c c b b d d d R R(a c) q(a c) R(c b) q(c b) q(a b) r1r1 r1r1 r2r2 r2r2 R(a b) q(a b) r1r1