An Overview of Data Provenance

Slides:



Advertisements
Similar presentations
Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh.
Advertisements

Containment of Conjunctive Queries on Annotated Relations TJ Green University of Pennsylvania Symposium on Database Provenance University of Edinburgh.
Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania.
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION.
Provenance analysis of algorithms 10/1/13 V. Tannen University of Pennsylvania 1WebDam someTowards ?
COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 5A Relational Algebra.
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
CS CS4432: Database Systems II Logical Plan Rewriting.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Annotated XML: Queries and Provenance Nate Foster T.J. Green Val Tannen University of Pennsylvania PODS ’08 Vancouver, B.C. June 11, 2008.
CS4432: Database Systems II Query Operator & Algebraic Expressions 1.
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
1 Lecture 11: Provenance and Data privacy December 8, 2010 Dan Suciu -- CSEP544 Fall 2010.
Provenance in Databases: Past, Current, Future Peter BunemanUniversity of Edinburgh Wang-Chiew TanUC Santa Cruz.
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
Data Exchange & Composition of Schema Mappings Phokion G. Kolaitis IBM Almaden Research Center.
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
CS 4432query processing1 CS4432: Database Systems II.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
CS 4432logical query rewriting - lecture 151 CS4432: Database Systems II Lecture #15 Logical Query Rewriting Professor Elke A. Rundensteiner.
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA June 26, 2007.
Cs3431 Relational Algebra : #I Based on Chapter 2.4 & 5.1.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
Interoperability for Provenance-aware Databases using PROV and JSON Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy Oracle Corporation Raghav Kapoor,
Approximated Provenance for Complex Applications
A Generic Provenance Middleware for Database Queries, Updates, and Transactions Bahareh Sadat Arab 1, Dieter Gawlick 2, Venkatesh Radhakrishnan 2, Hao.
Relational Algebra Basic Operations Algebra of Bags.
1 Propagating Functional Dependencies with Conditions Wenfei Fan University of Edinburgh & Bell Laboratories Shuai Ma University of Edinburgh Yanli HuNational.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.
Database Management 9. course. Execution of queries.
CIS552Relational Model1 Structure of Relational Database Relational Algebra Extended Relational-Algebra-Operations Modification of the Database.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
CS 4432query processing1 CS4432: Database Systems II Lecture #11 Professor Elke A. Rundensteiner.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
Reconcilable Differences Todd J. GreenZachary G. IvesVal Tannen University of Pennsylvania March 24, ICDT 09, Saint Petersburg.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
More Relation Operations 2014, Fall Pusan National University Ki-Joune Li.
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania PODS 2007.
A Dichotomy in the Complexity of Deletion Propagation with Functional Dependencies 2012 ACM SIGMOD/PODS Conference Scottsdale, Arizona, USA PODS 2012 Benny.
Containment of Relational Queries with Annotation Propagation Wang-Chiew Tan University of California, Santa Cruz.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Querying and storing XML
Default-all is dangerous! Wolfgang Gatterbauer Alexandra Meliou Dan Suciu Database group University of Washington.
Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
©Silberschatz, Korth and Sudarshan2.1Database System Concepts - 6 th Edition Chapter 8: Relational Algebra.
Basic Operations Algebra of Bags
Module 2: Intro to Relational Model
Relational Algebra Chapter 4 1.
CS411 Database Systems 08: Midterm Review Kazuhiro Minami 1.
Chapter 2: Intro to Relational Model
Relational Algebra 1.
Operators Expression Trees Bag Model of Data
Relational Algebra Chapter 4 1.
Relational Algebra : #I
Instructor: Mohamed Eltabakh
Basic Operations Algebra of Bags
Chapter 2: Intro to Relational Model
Probabilistic Databases
Chapter 2: Intro to Relational Model
Example of a Relation attributes (or columns) tuples (or rows)
Chapter 2: Intro to Relational Model
Query Optimization.
On Provenance of Queries on Linked Web Data
Presentation transcript:

An Overview of Data Provenance Why – How – Where

What’s up with this presentation? Cannibalized from: Val Tannen’s EDBT and AMW 2010 tutorials Peter Buneman and Wang-Chiew Tan’s SIGMOD 2007 tutorial Julie and Nodira’s 590q presentation 

Provenance Data provenance Data-intensive science [BunemanKhannaTan01]: aims to explain how a particular result was derived. Data-intensive science Worry about provenance provenance, n. The fact of coming from some particular source or quarter; origin, derivation [Oxford English Dictionary]

Terminology Cf. Peter Buneman Pedigree is for dogs Lineage is for kings Provenance is for art

Motivation Data integration [WangMadnick90, LeeBressanMadnick98] Data Warehousing [CuiWidonWiener00] Scientific Data Management [BunemanKhannaTan01] Determines trust on results Ensure reliability, quality of data Repeatability/verifiability Avoid effort duplication Understanding transport of annotations

Example of Data Provenance A typical question: For a given database query Q, a database D and a tuple t in the output of Q(D), which parts of D “contribute” to t? The question can be applied to attribute values, tables, etc. R Emp Dept John D01 Susan D02 Anna D04 S Did Mgr D01 Mary D02 Ken D03 Ed Emp Dept Mgr John D01 Mary Susan D02 Ken Q Q = select r.A, r.B, s.C from R r, S s where r.B = s.B

Timeline 1990 1997 2000 2001 2002 2003 A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective. Y. R. Wang and S. E. Madnick. VLDB 1990. Supporting Fine-grained Data Lineage in a Database Visualization Environment. A. Woodruff and M. Stonebraker. ICDE 1997. Tracing the Lineage of View Data in a Warehousing Environment. Y. Cui, J. Widom and J. L. Wiener. TODS 2000. Why and Where: A Characterization of Data Provenance. P. Buneman, S. Khanna, Tan. ICDT 2001. On Propagation of Deletions and Annotations through Views. P. Buneman, S. Khanna, Tan. PODS 2002. Containment of Relational Queries with Annotation Propagation. Tan. DBPL 2003.

Timeline 1990 1997 2000 2001 2002 2003 2004 2006 2007 2009 An Annotation Management System for Relational Databases. D. Bhagwat, L. Chiticariu, Tan, G. Vijayvargiya. VLDB 2004, VLDB Journal 2005. MONDRIAN: Annotating and Querying Databases through Colors and Blocks. ICDE 2006. Provenance in Curated Databases. P. Buneman, A. Chapman and J. Cheney. SIGMOD 2006. Annotation propagation revisited for key preserving views. Gao Cong, Wenfei Fan, Floris Geerts. CIKM 2006. ULDBs: Databases with Uncertainty and Lineage. O. Benjelloun, A.D. Sarma, A. Y. Halevy, and J. Widom. VLDB 2006. Debugging Schema Mappings with Routes. L. Chiticariu and Tan. VLDB 2006. On the Expressiveness of Implicit Provenance in Query and Update Languages. P. Buneman, J. Cheney and S. Vansummeren. ICDT 2007. Intentional Associations Between Data and Metadata. D. Srivastava and Y. Velegrakis. SIGMOD 2007. Provenance Semirings. T. J. Green, G. Karvounarakis and V. Tannen. PODS 2007. Annotated XML: Queries and Provenance: J. N. Foster, T. J. Green, V. Tannen. PODS 2008. Containment of Conjunctive Queries on Annotated Relations: T. J. Green, ICDT 2009

Two Approaches Eager or annotation-based Lazy or non-annotation based Changes the transformation from Q to Q’ to carry extra information Source data not needed after transformation Lazy or non-annotation based Q is unchanged Good when extra storage is an issue Recomputation and access to source required Q Q’ Extra information Annotation-based

Types of Provenance Why How Where “What DB tuples contribute to the presence of each result tuple?” How “By what process is each output tuple produced from the DB instance?” Where “Where (from what attribute of what tuple) does each output tuple value come from?”

Lineage Lineage for an output tuple t is a subset of the input tuples which are relevant to the output tuple Lineage: {t1, t5, t6} Problem: Not very precise. e.g. lineage above does not specify that t5 and t6 do not both need to exist.

Why provenance {t1, t5} {t1, t6} {t1, t2, t6, t8} {{t1, t5}, {t1, t6}} Witness of t: Any subset of the database sufficient to reconstruct tuple t in the query result. {t1, t5} {t1, t6} {t1, t2, t6, t8} Witness basis: Leaves of the “proof tree” showing how result tuple t is generated {{t1, t5}, {t1, t6}}

Why: Query Rewriting t1 t t2 t3 Minimal witness basis: Why(Q, I, t): {{t1}} Why(Q’, I, t): {{t1}, {t1, t2}} Minimal witness basis: Minimal witnesses in the witness basis

The View Deletion Problem D a database instance and V=Q(D) a view defined over D. Find a set of tuples ΔD to remove from D so that a specific tuple t is removed from the view Minimize the number of side-effects in the view View side-effect problem Hard: queries with joins and projection or union PTIME: the rest Minimize the number of tuples deleted from D Source side-effect problem Same dichotomy [BunemanKhannaTan. PODS02]

How provenance Identifies “witness tuples” and the operations performed on them to produce each result tuple Expresses operations using provenance semirings MERGE (+): union or projection JOIN (): joins

Propagating Annotations A B C … a b c R S A B C D E … a b c d e Join (on B) S D B E … d b e The annotation means joint use of the data annotated by p and the data annotated by r

Propagating Annotations (2) A B C … a b c R S A B C … a b c p + r Union S A B C … a b c The annotation p + r means alternative use of the data annotated by p and the data annotated by r

Propagating Annotations (3) A B C … a b c1 a b c2 a b c3 πABR A B p … a b Project p + r + s r s + denotes alternative use of data

An example (SPJU) Q = σC=eπAC(πACR πBCR πACR πBCR) R A B C a b c p A C d b e f g e p A C r a c a e d c d e f e s For selection, multiply with annotation 0 and 1.

Summary of approach Space of annotations K K-relations: every tuple annotated with elements from K Binary operations on K  : joint use (join) + : alternative use (union/projection) Special annotations 0 and 1 in K Absent tuples  0 1 is a neutral annotation What are the laws of (K,+,,0,1)?

Annotated Relational Algebra DBMS query optimizers assume certain equivalences: Union is associative, commutative Join is associative, commutative, distributes over union Projections/selections commute with each other and with union/join (when applicable) No idempotence to allow for bag semantics Equivalent queries should produce the same annotations Proposition: Above identities hold for queries on K-relations iff (K,+,,0,1) is a commutative semiring

Commutative Semirings? An algebraic structure (K,+,,0,1) where: K is the domain + is associative, commutative, with 0 identity  is associative with 1 identity  distributes over + a0=0a=0  is also commutative semiring

Back to example R Q A B C A C a b c d b e f g e p a c a e d c d e f e s

Applying the laws: polynomials R Q A B C A C a b c d b e f g e p a e d e f e pr r 2r2 + rs s rs + 2s2 Polynomials with coefficients in and annotation tokens as indeterminates p, r, s capture a very general form of provenance

How to read this provenance Q A B C A C a b c d b e f g e p a e d e f e pr r 2r2 + rs s rs + 2s2 3 ways to derive (d e) 2 of the ways use only r, but they use it twice the 3rd uses r once and s once

Deletion Propagation R Q Q a e d e f e A C 2s2 Q f e A C 2s2 A B C A C 2s2 Q f e A C 2s2 A B C A C a b c d b e f g e p a e d e f e pr r 2r2 + rs s rs + 2s2 Delete (d b e) from R Set r to 0!

Some useful commutative semirings Set Semantics Bag Semantics Probabilistic events Access Control Public Top Secret

Provenance with semirings Q A B C A C a b c d b e f g e p … d e r {r,s} {{r},{r,s}} {{r}} s Lineage Semiring: Why provenance Semiring: Minimal witness Semiring:

Provenance Hierarchy “One semiring to rule them all” – V. Tannen 2x2y + xy + 5y2 + z most informative x2y + xy + y2 + z 3xy + 5y + z xy + y + z least informative xyz y + z Semiring homomorphism h : K1 K2

Example: distrust scores Semiring: Tokens: X={p,r,s} Assignment function f : X  K

Example: Access Control where a c a e d c d e f e 2p2 a b c d b e f g e p pr q r pr s 2r2+rs 2s2+rs p=P, r=S, s=T a c a e d c d e f e P a b c d b e f g e P S S q Evaluate with p=P, r=S, s=T using min for “+”, max for “” S T S T User with secret clearance

Where Provenance Identifies “witness cells” Important for annotations SELECT * FROM R WHERE A <> 5 UNION SELECT A, 7 AS B FROM R WHERE A= 5 R A B 4 5 6 A B 4 5 7 ? UPDATE R SET B=7 WHERE A=5

Color Algebra [Geerts, Kementsietsidis, Milano 06] 3 4 5 6 A B 3 4 5 7 A B P[Q] SELECT * FROM R WHERE A <> 5 UNION SELECT A, 7 AS B FROM R WHERE A= 5 Q =

Color Algebra 3 4 5 6 3 4 5 7 A B A B P[Q] Q = UPDATE R SET B=7 WHERE A=5

Where Provenance and Semirings Ru Ax By C1 πAC(πABR (πBCR S)) … a1 b1 c1 A1 C1 … a1 c1 u2p2xy2 + uvpmxyz Sv B1 C1 … bz c1 m 1 is a neutral annotation, used when we don’t bother to track data

Different Annotations  Different Tuples πCσC=eπAC(πABR πBCR) A B C a b c d b ez f g ew p C r ez ew pr+r2 s s2

Wrap up: Issues and Directions Archiving Compression Generalizations Program Slicing [Cheney07] “Negative” Provenance Why Not? [SIGMOD09], Artemis [PVLDB09] Causality? 

The end