On Provenance of Queries on Linked Web Data 1,2Yannis Theoharis, 2Irini Fundulaki, 3,2Grigoris Karvounarakis and 1,2Vassilis Christophides 1Institute of Computer Science, FORTH and 2Computer Science Department, University of Crete 3LogicBox, USA
What is “Linked Data” W3C Linking Open Data publish various open datasets as RDF on the Web set RDF typed links between data items from different data sources. 2
Motivation: Linked Data Processing Data is: fetched from heterogeneous sources integrated materialized in RDF made available via SPARQL Range of computations SPARQL queries Complex programs (logic or procedular) 3
Provenance Aware Applications Trust assessment trustworthiness Access control confidentiality level Data cleaning validity Curated databases source data origin All these applications need to represent and store the relation of the input with the output of data processes gain efficiency impossible without provenance 4
Data Provenance Models Annotation Models: annotation computation coupled with a particular application and a particular assignment of source data annotations R1 R2 R1 R2 X Y Annot. a b t c d Y Z Annot. b e X Y Z Annot. a b e t: trusted f: untrusted t f f t query recomputation! Abstract Provenance Models: abstract provenance tokens and operators are substituted by appropriate concrete tokens for a particular application and assignment R1 R2 R1 R2 X Y Annot. a b c1 c d c2 Y Z Annot. b e c3 X Y Z Annot. a b e c1 * c3 t t f t Λ f t Λ t 5
This Talk “Can previous work on abstract provenance models be leveraged for SPARQL” ? NO: due to the OPTIONAL (similar to the SQL left outer join) operator YES: for the positive (without OPTIONAL) fragment of SPARQL We present our ongoing work on a SPARQL abstract provenance model. Challenge: to capture the form of negation that OPTIONAL introduces 6
Outline SPARQL algebra Abstract Provenance Models for Positive SPARQL Limitations of Previous Models Towards a SPARQL Provenance Model 7
Outline SPARQL algebra Abstract Provenance Models for Positive SPARQL Limitations of Previous Models Towards a SPARQL Provenance Model 8
SPARQL (1/2) SPARQL: W3C Recommendation language to Query RDF data. triple patterns (?x, ?y, e) mappings {(?x,d),(?y,b)} {(?x,f),(?y,g)} Compose Filter { … } Select Construct/ Describe (?x, ?y, e) constant variables Triple Set S P O a b c d e f g Ω1 ?x ?y d b f g μ1 μ2 9
SPARQL (2/2) SPARQL algebra defines 5 operators on mapping bags Unary ops: π (projection), σ (selection, also called filtering) Binary ops: U (union) (join) (optional) Positive SPARQL (SPARQL+) μ and μ’ are compatible (μ ~ μ’), if they agree in their common variables μ1 ~ μ4 μ3 ~ μ4 μ2 ~ μ4 Ω1 Ω1 Ω Ω2 Ω2 Ω1 Ω2 Ω1 Ω2 σ?x=a (Ω) π?x (Ω) ?x ?y a b c d e - ?x ?y a b c d Ω1 ?x ?y a b c d e ?y ?z b f ?y ?z b f Ω2 Ω1 U Ω2 ?x ?y ?z a b f e ?x ?y ?z a b f c d - ?x ?y a b c μ1 μ2 μ3 μ1 μ2 μ4 μ3 μ4 = μ1 U μ3 μ2 μ5 = μ1 U μ4 μ6 = μ3 U μ4 ?x ?y a b ?x ?z c d Ω1 \ Ω2 Ω1 Ω2 ?x ?y ?z a b - c d ?x a d μ1 μ2 μ1 μ2 card(μ1) = 2 card(μ2) = 1 μ1 μ2 ?z is unbound in μ1 10
Outline SPARQL algebra Abstract Provenance Models for Positive SPARQL Limitations of Previous Models Towards a SPARQL Provenance Model 11
Abstract Provenance Models triple patterns (?x, ?y, e) mappings {(?x,d),(?y,b)} {(?x,f),(?y,g)} Compose Filter { … } Select Provenance Most informative How Trio Why Lineage Abstract provenance models encode the query operators in different level of detail Expressiveness vs efficiency (annotation storage and computation time) Less informative 12
Abstract Provenance Models for SPARQL+ Previous models are defined for positive relational algebra Positive relational operators are monotonic The addition (removal) of a tuple can only result in additional (removed) tuples in the output This also holds for SPARQL+ (projection, union, join) Previous models suffice for SPARQL+ 13
Outline SPARQL algebra Abstract Provenance Models for Positive SPARQL Limitations of Previous Models Towards a SPARQL Provenance Model 14
Boolean trust assessment (SPARQL) Trusted: μ1, μ2, μ3, μ4 Trusted: μ1, μ2, μ4 Ω1 Ω2 Ω1 \ Ω2 ?x ?y d b f g ?y ?z b c e h Ω1 \ Ω2 μ1 μ2 μ3μ4 ?x ?y ?z d b - f g ?x ?y f g μ1μ2 μ2 boolean trust semantics set semantics on trusted mappings Ω1 Ω2 Ω1 Ω2 ?x ?y ?z d b c f g - ?x ?y ?z d b - f g μ5μ2 μ1 μ2 and \ are not monotonic: μ3 becomes untrusted μ5 becomes untrusted and μ1 becomes trusted in Ω1 Ω2 15
Perm μ1 μ2 μ3μ4 Ω1 Ω1 \ Ω2 Ω1 Ω2 Ω2 Intuitively, (f, g) is in Ω1 \ Ω2 Ω1 Ω2 ?x ?y d b f g ?x ?y ?y2 ?z2 f g b c e h ?x ?y ?z ?x1 ?y1 ?y2 ?z2 d b c f g - e h μ1 μ2 Ω2 ?y ?z b c e h Intuitively, (f, g) is in Ω1 \ Ω2 because it is not compatible with neither μ3 nor μ4 μ3μ4 (d, b, c) is in Ω1 \ Ω2 due to the join between μ1 and μ3 If μ3 becomes untrusted, Perm infers that (d, b, c) becomes untrusted, but cannot infer that (d, b, -) should become trusted 16
RDF Meta Knowledge & M-semirings Ω1 Ω1 \ Ω2 ?x ?y d b c1 f g c2 ?x ?y RDF MK M-semirings f g c2 Λ (c3Vc4) c2 0 = c2 μ1 μ2 μ2 t t t Ω2 Ω1 Ω2 ?y ?z b c c3 e h c4 μ3μ4 ?x ?y ?z RDF MK M-semirings d b c c1 Λ c3 c1 * c3 f g - c2 Λ (c3Vc4) c2 f t μ5μ2 f f t t Like Perm, RDF Meta Knowledge and M-semirings infer that μ5 is untrusted but can not infer that μ1: (d, b, -) is trusted. 17
Outline SPARQL algebra Abstract Provenance Models for Positive SPARQL Limitations of Previous Models Towards a SPARQL Provenance Model 18
A Third Operation for Compatibility (1/2) Take care about compatible mappings Only one between μ1, μ5 can appear in the result Keep provenance information for both of them ! Ω1 Ω1 Ω2 = (Ω1 Ω2) U (Ω1 \ Ω2) ?x ?y d b c1 f g c2 ?x ?y ?z How SPARQL Prov. d b c c1*c3 - No Info c1*A(μ1, μ3) f g c2 μ1 μ2 t μ5μ1μ2 (t Λ t) = t (t Λ f) = f t f ? Ω2 t ?y ?z b c c3 e h c4 μ3μ4 t f t A(μ1, μ3) = f, if μ1 ~ μ3 and c3 = t t, else 19
A Third Operation for Compatibility (2/2) Ω1 Ω2 = (Ω1 Ω2) U (Ω1 \ Ω2) ?x ?y ?z How SPARQL Prov. d b c c1*c3 - No Info c1*A(μ1, μ3) f g c2 μ5μ1μ2 A is a binary operator on mappings Determines whether the mapping exist in the result or not If yes, its provenance equals the positive provenance part, e.g. c1 for c1*A(μ1, μ3) In general, A(μ1, μ3) = 0, if μ1 ~ μ3 and c3 ≠ 0 1, else 0: the neutral element for + 1: the neutral element for * 20
SPARQL Provenance Operators Two types of operators on provenance tokens, i.e. + and * (for SPARQL+) on mappings, i.e. A (for and \) Good news: Every triple of the dataset is uniquely annotated. Why not to use annotations as mapping identifiers in A? Due to the projection operator… 21
Enrich Tokens with Schema Information A( (c1, S1), (c2, S2) ) = 0, if πS1 (μ1) ~ πS2 (μ2) and c2 ≠ 0 1, else A(c1, c2) = 0, if μ1 ~ μ2 and c2 ≠ 0 1, else Use tokens (c1, c2…) as mapping ids in A expressions But, μ1 ~ μ2 might hold, while π?y,?z (μ1) ~ π ?y,?z (μ2) Tokens don’t suffice, keep pairs token-schema Ω π?y,?z (Ω) ?x ?y ?z a b c d - ?x ?y ?z Prov. a b c (c1, {?x, ?y, ?z}) d - (c2, {?x, ?y, ?z}) ?y ?z Prov. b c (c1, {?y, ?z}) - (c2, {?y, ?z}) μ1 μ2 22
Towards a SPARQL Provenance Model Define an algebra on token-schema pairs 3 operations 2 for SPARQL operators 1 for compatibility What if there is no projection (or projection is not allowed to be pushed down) ? annotations suffice (no need for schema information), still in need of the compatibility operator What if there is no Optional ? previous models suffice, e.g. How 23
Future Work SPARQL Provenance Model Extent model expressiveness to capture other computations on Linked Data Logic explanations Implementation 24
Questions ?