Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh May 21, 2008
Need to Track XML Provenance For scientific data processing [Buneman+ 01] – Tree-structured data, heterogeneous sources – XML is the natural data model – Data annotated with source info; annotations need to be propagated during query processing For incomplete/probabilistic data [Sen.&Abit. 06] – Query output annotated with Boolean formulas – Annotations indicate correlations between source data and output data For data warehousing [Cui+ 00] – Even when data is relational, often have XML views 2
Provenance for Relational Algebra Views 3 ABC abc dbe fge AB ac ae dc de fe V := ¼ AB (( ¼ AC (R) ¼ C (R)) [ ( ¼ AB (R) ¼ BC (R))) source R view V ? ? ?
Semiring-Annotated Relations [PODS07] Associate each tuple in database with an annotation from a commutative semiring (K, +, ¢, 0, 1) Combine and propagate annotations during (positive) relational query processing –, £, Å combine annotations using ¢ – ¼, [ combine annotations using + – ¾ multiplies annotations by 0 or 1 4
Annotated Relations Example 5 ABC abcp dber fges R AB ac2p22p2 aepr dc de2r 2 + rs fe2s 2 + rs V V := ¼ AB (( ¼ AC (R) ¼ C (R)) [ ( ¼ AB (R) ¼ BC (R)))
Semiring Bestiary ( B, Ç, Æ, ?, > )Set semantics ( N, +, ¢, 0, 1)Bag semantics (PosBool(B), Ç, Æ, ?, > )Incomplete dbs ( P ( ), [, Å, ;, )Probabilistic dbs ( P ( P (X)), [, d, ;, { ; })Why-provenance where A d B := {a [ b : a 2 A, b 2 B} ( C, min, max, absent, public) Security clearances ( N [X], +, ¢, 0, 1)Prov. polynomials 6
Our Contribution: Annotated XML We show how to decorate unordered XML data with semiring annotations: K-UXML We propagate the annotations for K-UXQuery (based on a large fragment of positive XQuery) We do this by generalizing the semantics of Nested Relational Calculus (NRC) to handle annotated values and to incorporate a recursive tree type and structural recursion on trees We prove a commutation with homomorphisms theorem, and show that it enables applications in security and incomplete databases 7
K-UXML No attributes, no text values, no repeated children (inessential); no order (essential!) Each node decorated with a value k from semiring K (1 neutral, 0 not present) K-collection: a finite set of elements annotated with values from K Formally, the children of a node form a K- collection of subtrees (to annotate root, also have a top-level K-collection) 8
Example: XPath on K-UXML 9 a bx1bx1 cy3cy3 cy1cy1 ad a cy2cy2 bx2bx2 d Source, $T: r c x 1 ¢ y 3 + y 1 ¢ y 2 cy1cy1 d a cy2cy2 bx2bx2 Answer: Query: element r { $T//c } Omitted annotations are 1 (and omitted subtrees have annotation 0)
Example: For-Loops in K-UXQuery 10 azaz bx1bx1 cx2cx2 dy1dy1 dy2dy2 ey3ey3 Source, $S: Answer: Query: element p { for $t in $S return for $x in ($t)/ ¤ return ($x)/ ¤ } (i.e., element p { $S/ ¤ / ¤ }) p d z ¢ x 1 ¢ y 1 + z ¢ x 2 ¢ y 2 e z¢x2¢y3e z¢x2¢y3
Outline of Technical Approach Extend NRC with a recursive tree type – satisfies: tree = label £ { tree } and an operation for structural recursion on trees (srt) [Robertson+ 07] – apply to each child subtree, collect results using NRC big union Generalize NRC + srt to handle semiring- annotated complex values ) NRC K + srt Define semantics of K-UXQuery by translation to NRC K + srt 11
Semantics of Small Union Sums annotations « e 1 [ e 2 ¬ K (x) := « e 1 ¬ K (x) + « e 2 ¬ K (x) Example: 12 axax byby axax byby axax bzbz, Query: return ($S, $T) (in NRC: $S [ $T) a2xa2x byby axax bzbz, Source: Answer:
Semantics of Big Union Sums and multiplies annotations « [ (x 2 e 1 ) e 2 ¬ K (y) := « e 1 ¬ K (a i ) ¢ « e 2 ¬ K [x := a i ] (y) where the support (the set of elements with non-zero annotations) of « e 1 ¬ K is {a 1,..., a n } 13
Big Union Example With K = N 14 Query: return $T/ ¤ / ¤ (in NRC: [ (x 2 $T) [ (y 2 x) { y }) b2b2 c3c3 b b c ccccc c7c7 b c b c Source, $T : Answer: ´´ c, c, c, c, c, c, c,,,
XPath Descendant Operator Uses srt // ¤ applied to forest $T translates to [ (x 2 $T) ¼ 1 ((srt(b, s). f) x) where f := let self = Tree(b, [ (x 2 s) { ¼ 2 (x)} in let matches = [ (x 2 s) { ¼ 1 (x)} in (matches [ {self}, self)) //a, similar to above 15
Data annotated with clearance levels from total order C : P < C < S < T < 0 Joint use of data ( ¢ ) requires access to both (max of clearances); alternative use of data (+) requires access to either (min of clearances) ( C, min, max, 0, P) is a commutative semiring p d min(max(P,C,C),max(P,C,S)) e max(P,C,T) Application: Security Clearances 16 p d Cd C e T aPaP bCbC cCcC dCdC dSdS eTeT Query: element p { $S/ ¤ / ¤ }
For any given clearance level (e.g., C), want the following diagram to commute: Security Condition: Non-Interference 17 pPpP dCdC eTeT pPpP dCdC aPaP bCbC cCcC dCdC dSdS eTeT aPaP bCbC cCcC dCdC query erase > C
Application: Incomplete XML Data annotated with Boolean expressions; tree T represents set of possible worlds Mod(T) 18 T = a b cy3cy3 cy1cy1 ad a cy2cy2 b d a b c c ad a cb d Mod(T) = a b a d a b c a d a bc ad a b d,,,..., 7 possible worlds
Correctness: Possible Worlds 19 For every incomplete tree T, and every UXQuery query q, want this diagram to commute: TMod(T) q(Mod(T)) = Mod(q(T)) q(T)q(T) q q Mod
Commutation with Homomorphisms Theorem: Let h : K 1 K 2 be a semiring homo- morphism. Then for any UXQuery query q, and for any K 1 -UXML document D, we have h(q(D)) = q(h(D)). Ex: security clearances h c : C C h c (k) := if k · c then k else 0 Ex: incomplete dbs º : B B Eval º : PosBool(B) B Ex: duplicate elimination ± : N B ± (k) := if k = 0 then ? else > 20
Related Work Bag semantics for NRC [Libkin&Wong 97] Incomplete XML [Kanza+ 99, Abiteboul+ 06] Probabilistic XML [Nierman&Jagadish 02, van Keulen+ 05, Abit.&Senellart 06, Sen.&Abit. 07, Hung+ 07] XML provenance [Buneman+ 01] NRC provenance [Hidders+ 07] Semiring-annotated XPath [Grahne+ 07] Negation, expressiveness of RA K [Geerts&Poggi 08] 21
Conclusion We showed how to annotate unordered XML trees (complex values) with values from a commutative semiring K, and propagate those annotations in queries for a large, positive fragment of XQuery (NRC + srt) We saw novel applications in security and incomplete dbs, made possible by a fundamental property of our framework, commutation with homomorphisms 22
Future Work Practical applications based on framework – Security clearances – Jointly recording provenance, security, multiplicities, uncertainty, etc. (product of semirings is also a semiring!) Query optimization: containment/equivalence wrt annotated semantics depends on K – In paper, we show K-equivalence for UXQuery is the same as B -equivalence when K is a distributive lattice 23
24
K-UXQuery Syntax 25