Download presentation
Presentation is loading. Please wait.
Published byAnnabel Engram Modified over 9 years ago
1
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION
2
“Where Did this Data Come from?” Challenge: integrated data may come from many sources and mappings – of different quality or trustworthiness! How did I get this particular result? What mappings produced it? How much should I trust (believe) it? Data provenance (lineage) captures the relationships between tuples in a set of data instances 2
3
An Example: View Tuple Derivations BC 23 32 43 AB 12 24 RS Source relations ACdirectly derivable by 13 R(1,2) ⋈ S(2,3) ∪ R(1,4) ⋈ S(4,3) 22 S(2,3) ⋈ ρ B A, C B S(3,2) 33 S(3,2) ⋈ ρ B A, C B S(2,3) View V 1 = R ⋈ S ∪ S ⋈ S 3
4
Formulating a Provenance Model Conceptually, provenance captures the operations and operands going into a result There are many options to do this, and many levels of detail! A “good” provenance model should: Have a formal semantics Have equivalence properties such that equivalent query plans produce equivalent provenance Connect to notions of value, quality or score 4
5
Outline The two views of provenance Applications of data provenance Provenance semirings: one ring to rule them all Storing provenance 5
6
Provenance as Annotations on Data Annotate each derivation with an “explanation” in terms of relational algebra and the tuple operands Lets us “look up” the derivation of a result BC 23 32 43 AB 12 14 R S ACprovenance annotation 13 R(1,2) ⋈ S(2,3) ∪ R(1,4) ⋈ S(4,3) 22 S(2,3) ⋈ ρ B A, C B S(3,2) 33 S(3,2) ⋈ ρ B A, C B S(2,3) View V 1 (in Datalog): V 1 (x,z) :- R(x,y), S(y,z) V 1 (x,x) :- S(x,y), S(y,x) 6
7
Provenance as a Graph of Relationships Bipartite graph: tuple nodes connected via “derivation nodes” Encodes a hypergraph (hyperedges = derivations) Makes direct derivation relationships more explicit 7
8
Making the Two Interchangeable We can make these equivalent by introducing provenance tokens (equiv. node IDs) for each tuple Derived tuples’ annotations = expressions over tokens BCann 23s1s1 32s2s2 43s3s3 AB 12r1r1 14r2r2 R S AC 13 v 1 = r 1 ⋈ s 1 ∪ r 2 ⋈ s 3 22 v 2 = s 1 ⋈ s 2 33 v 3 = s 2 ⋈ s 1 8 V1V1V1V1 r1r1 r2r2 s1s1 s2s2 s3s3 v1v1 v2v2 v3v3 V 1 V 1 V 1 V 1
9
Outline The two views of provenance Applications of data provenance Provenance semirings: one ring to rule them all Storing provenance 9
10
Where Can We Use Provenance? Explanations Help the user understand why an item exists Scoring Provide a ranked list of “most relevant” results Reasoning about interactions Help the user understand data relationships
11
Examples of Provenance’s Utility Schema mapping debugging: We may have a bad result Determine why that result exists, what is faulty Bioinformatics data integration: Different sources have different levels of reliability or authoritativeness Rank results by score! Probabilistic databases: We may need to know that results are correlated Encode the relationships, use to assign probabilities
12
Outline The two views of provenance Applications of data provenance Provenance semirings: one ring to rule them all Storing provenance 12
13
The Notion of Provenance as Annotations Many formalisms were defined for using query computations to produce annotations Each captured certain subtleties The key question: Is there one “most powerful” model that captures the properties of the relational algebra*? Equivalent queries should produce equivalent provenance * over multi-sets or bags, as used by “real” systems
14
The Provenance Semiring Model To represent provenance, use: A set of provenance tokens or tuple IDs, K Abstract operators representing combination of tuples Abstract sum operator, ⊕, for union or projection has identity element 0 (a ⊕ 0 ≡ 0 ⊕ a ≡ 0) Abstract product operator, ⊗, for join has identity element 1 (a ⊗ 1 ≡ 1 ⊗ a ≡ 1) also (a ⊗ 0 ≡ 0 ⊗ a ≡ 0) This is formally a commutative semiring 14
15
The Provenance Semiring Model We can re-express our example as below, using the semiring operators instead of the relational algebra ones BCann 23s1s1 32s2s2 43s3s3 AB 12r1r1 14r2r2 R S ACAnn 13 v 1 = r 1 ⊗ s 1 ⊕ r 2 ⊗ s 3 22 v 2 = s 1 ⊗ s 2 33 v 3 = s 2 ⊗ s 1 15 V1V1V1V1 r1r1 r2r2 s1s1 s2s2 s3s3 v1v1 v2v2 v3v3 V 1 V 1 V 1 V 1
16
Tokens for Mappings Sometimes we would like to assign a token to the actual mapping or rule used – so we can assign it a value BCann 23s1s1 32s2s2 43s3s3 AB 12r1r1 14r2r2 R S ACAnn 13 v 1 = m 1 ⊗ [r 1 ⊗ s 1 ] ⊕ m 2 ⊗ [r 2 ⊗ s 3 ] 22 v 2 = m 2 ⊗ [ s 1 ⊗ s 2 ] 33 v 3 = m 2 ⊗ [ s 2 ⊗ s 1 ] 16 V1V1V1V1 View V 1 (in Datalog): V 1 (x,z) :- R(x,y), S(y,z) V 1 (x,x) :- S(x,y), S(y,x) Call this m 1 Call this m 2
17
Example Application: Provenance Visualization Base tuple derivation (token not shown) Tuple nodes Derivation by mapping M5
18
Example Application: Tuple Scoring For ranked query results, we may adopt the following model commonly used in ranking: Assign a score to each base tuple = - log 2 (probability) Use arithmetic sum as ⊗ Use min as ⊕ Suppose prob(r 1 ) = 0.5, prob(s 1 ) = 0.5, others are 1.0 ACAnn 13 v 1 = r 1 ⊗ s 1 ⊕ r 2 ⊗ s 3 = min((2+1),(1+1)) = 2 22 v 2 = s 1 ⊗ s 2 = 2+1 = 3 33 v 3 = s 2 ⊗ s 1 = 1+2 = 3 V1V1V1V1
19
Useful Semirings Use caseBase value Product R ⊗ SSum R ⊕ S DerivabilityTrue R ∧ S R ∨ S TrustTrust condition result R ∧ S R ∨ S Confidentiality level Tuple confidentiality level More_secure(R, S) Less_secure(R,S) Weight / costBase tuple weight R + Smin(R,S) LineageTuple ID R ∪ S R ∩ S Probabilistic event Tuple probabilistic event R ∧ S R ∨ S Number of derivations 1 R ⋅ S R + S 19
20
Outline The two views of provenance Applications of data provenance Provenance semirings: one ring to rule them all Storing provenance 20
21
Storing Provenance Use tuple keys as tokens Encode provenance graph as relations BC 23 32 43 AB 12 14 R S AC 13 22 33 V1V1V1V1 View V 1 (in Datalog): V 1 (x,z) :- R(x,y), S(y,z) V 1 (x,x) :- S(x,y), S(y,x) Relate tuples with table P v1-1 Relate tuples with table P v1-2 R.AR.BS. BS.CV1.AV1.C 122313 144313 S.BS. C S.B ’ S. C’ V1. A V1. C 233222 322333 21 P v1-1 P v1-2
22
Storing Provenance Use tuple keys as tokens Encode provenance graph as relations BC 23 32 43 AB 12 14 R S AC 13 22 33 V1V1V1V1 View V 1 (in Datalog): V 1 (x,z) :- R(x,y), S(y,z) V 1 (x,x) :- S(x,y), S(y,x) R.AR.BS. BS.CV1.AV1.C 122313 144313 S.BS. C S.B ’ S. C’ V1. A V1. C 233222 322333 22 P v1-1 P v1-2 These are redundant if we know the Datalog
23
Storing Provenance Use tuple keys as tokens Encode provenance graph as relations BC 23 32 43 AB 12 14 R S AC 13 22 33 V1V1V1V1 View V 1 (in Datalog): V 1 (x,z) :- R(x,y), S(y,z) V 1 (x,x) :- S(x,y), S(y,x) ABC 123 143 BCC’ 232 323 23 P v1-1 P v1-2
24
Data Provenance Wrap-up Provenance is critical to understanding and assessing the believability of data, and in debugging Two equivalent representations – annotations vs graph Provenance semiring model preserves the “expected” equivalences of the relational algebra We can take semiring provenance and evaluate it with different semirings to get useful scores We can store provenance using relations Recent work beyond the scope of the book: Extending provenance to more complex queries, e.g., with aggregation Languages for querying provenance (primarily as a graph)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.