Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania
“Boolean Provenance/Lineage” as a Boolean formula Q is true on D F Q,D is true Poly-size, Poly-time computable (data complexity) But Q is a RA + query This talk : What if Q is a Datalog Program? AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom Boolean query Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y) x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 Database D F Q,D = (x 1 y 1 z 1 ) (x 1 y 2 z 2 ) (x 2 y 3 z 2 )
Provenance – Reliability and repeatability – View management and deletion propagation – Trust and security management – Query answering in probabilistic database, …. Datalog – Datalog is popular again! (two keynotes this ICDT/EDBT) – Data extraction in Web, declarative networking – Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna) Finding suitable “Provenance for Datalog” is important – Both from theoretical and practical viewpoints How do we compute, store, and interpret provenance for datalog programs efficiently and effectively? 3
Can we get poly-size Boolean formulas for datalog provenance? No, even if we allow unbounded time Do we have a solution? Yes! Use Boolean Circuits! What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07] It depends on the semiring 4
Background Circuits for Boolean Provenance Circuits for General Provenance Semirings 5
Background Circuits for Boolean Provenance Circuits for General Provenance Semirings 6
T(x, y) :- R(x, y) T(x, y) :- R(x, z), T(z, y) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, z), T(z, y) S(x) :- T(a, x) 7 Datalog program for Transitive Closure and Single-source Reachability EDB (base) relation for edges: R IDB (derived) relations ─ Transitive closure (T) ─ Single-source reachability from vertex ‘a’ (S) IDB (Intensional Databases) EDB (Extensional Databases)
8 Tuples are annotated with variables from a set X – Here X = {x 1, x 2, y 1, y 2, ….} For n tuples in X, 2 n possible worlds by assignments : X {True, False} Useful in query evaluation on incomplete or probabilistic databases AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 PosBool(X)- database D
9 Annotation propagates from input to output – Join = , Projection/Union = Output tuples are annotated by monotone Boolean formula – F Q,D is the annotation of the unique output tuple AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom RA + Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y) x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 PosBool(X)-Database D F Q,D = (x 1 y 1 z 1 ) (x 1 y 2 z 2 ) (x 2 y 3 z 2 )
10 For all RA + query Q, D, and assignment 1. (Faithful Representation) Q(D )= [Q(D)] 2. (Poly-size overhead) The size of F Q,D is poly in |D| and can be computed in poly-time. AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom RA + Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y) x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 F Q,D = (x 1 y 1 z 1 ) (x 1 y 2 z 2 ) (x 2 y 3 z 2 ) True False True False True False = False PosBool(X)-Database D
Semantics using Derivation Trees (Green et al. 2007) Annotation of T(a, b): 11 T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) R aa ab p q a b Trees Leaves t of Annot(t) … = (q) (p q) (p p q) … Infinitely many trees But always has a finite equivalent form = q But not necessarily poly-size T(a, b) R(a, a) T(a, b) R(a, a) T(a, b) R(a, b) T(a, b) R(a, a) T(a, b) R(a, b) T(a, b)
Theorem: Given PosBool(X)-database D and datalog program P, provenance of tuples in P(D) cannot have a faithful representation using Boolean formulas of size polynomial in |D| Theorem: Given PosBool(X)-database D and datalog program P, provenance of tuples in P(D) cannot have a faithful representation using Boolean formulas of size polynomial in |D| 12 Proof outline: st-connectivity on n nodes requires n (logn) -size monotone Boolean formula Karchmer-Wigderson, 1988 Faithful representation requires: for all True/False assignments to X, P(D )= [P(D)] Reduce to the hard instance with right when P = transitive closure Solution: Boolean Circuit!
Background Circuits for Boolean Provenance or PosBool(X) Circuits for General Provenance Semirings 13
Circuit is a DAG – use common subexpressions – Boolean formula = tree Leaf nodes: – EDB vars in X Internal nodes – : IDB/EDB vars used in one derivation –:–: Alternative derivations Roots: – IDB vars 14 R aa ab p q T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) X T(a, b) q p X R(a, b) X R(a, a) a b
Theorem: Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time) Theorem: Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time) 15
1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalog program P to EDB/IDB tuples [Green et al. 2007] 1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalog program P to EDB/IDB tuples [Green et al. 2007] A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011] N = #IDB tuples Build a circuit with N+1 layers from the system of equations Two key ideas from previous work EDB tuples constants, IDB tuples variables Iteratively solve this system of equations Fixpoint = provenance for all IDB tuples
17 T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) R aa ab p q a b Step1 : Build system of equations by all possible instantiations: x, y, z a, b X T(a, a) = p (p X T(a, a) ) X T(a, b) = q (p X T(a, b) ) X S(b) = X T(a, b) X S(a) = X T(a, a) Step 2: Build a circuit with layers (N = 4) … var Const
18 X T(a,a),0 X S(b),0 X T(a,a),0 X T(a,b),0 X S(a),0 p q X T(a,a),1 X S(b),1 X T(a,a),1 X T(a,b),1 X S(a),1 X S(a),2 X T(a,a),2 X S(b),2 X Ta,a),2 X T(a,b),2 Level 1 Level 2 false X T(a, a) = p (p X T(a, a) ) X T(a, b) = q (p X T(a, b) ) X S(b) = X T(a, b) X S(a) = X T(a, a) Assign leaf IDB vars to false Multiple roots for multiple IDB vars
1. Store only two levels of circuit instead of N+1 levels – Evaluate iteratively 2. Embed circuit construction in semi-naïve evaluation – Check for new derivations, not only new IDB variables – Sound and Complete 3. Remove self-dependency of IDB vars – works for PosBool(X) and also some other semirings… X T(a, a) = p (p X T(a, a) ) X T(a, b) = q (p X T(a, b) ) X S(b) = X T(a, b) X S(a) = X T(a, a) 19
20 X T(a,a),0 X S(b),0 X T(a,a),0 X T(a,b),0 X S(a),0 p q X T(a,a),1 X S(b),1 X T(a,a),1 X T(a,b),1 X S(a),1 X S(a),2 X T(a,a),2 X S(b),2 X Ta,a),2 X T(a,b),2 Level 1 Level 2 false
21 X T(a,a),bottom X T(a,b),bottom X S(a),bottom pq X T(a,a),top X T(a,b),top X S(a),top With all these optimizations Top Level Bottom Level
Linear-time deletion propagation (in circuit-size) Approximation for probabilistic databases – even when only the circuit (and not the database) is available Circuits can be computed “offline” – Only linear-time evaluation is required when needed (e.g. deletion propagation) compared to storing and solving a system of equations iteratively, or re-evaluating datalog program Can use existing techniques for efficient and parallel circuit evaluation 22
Background Circuits for Boolean Provenance or PosBool(X) Circuits for General Provenance Semirings 23
(K, + K, K, 0 K, 1 K ) – domain K – + K, K : associative, commutative, have neutral elements 0 K, 1 K – K distributes over + K, i.e. a K (b + K c) = a K b + K a K c – 0 K cancels any element in K, i.e. a K 0 K = 0 K K a = 0 K Examples: – (B, , , False, True) Set semantics – (N, +, , 0, 1) Bag semantics – (N { }, min, +, , 0) Tropical semiring to compute cost (e.g. cost of a shortest path) 24
Generalization of PosBool(X) (K, + K, K, 0 K, 1 K ) – Tuples are annotated with variables from X – K is of the form Prov(X) – + K denotes alternative usage – K denotes joint usage Examples: – (PosBool(X), , , False, True) – (Lin(X), , , , ) tracks contributing tuples [Cui et. al. ’00] – (Why(X), , , , { } ) : pairwise union of subsets, tracks contributing tuples in alternative derivations [Buneman et. al. ’01] 25
Key property needed for applications like deletion propagation, trust management, cost computation, … Prov(X) specializes correctly to K, if any valuation v : X K extends uniquely to a homomorphism h v : Prov(X) K (which correctly maps +, of Prov(X) to that of K) Further, some provenance semirings are “more informative” than the others 26
27 N[X]N[X] Why(X) Lin(X) PosBool(X) Sorp(X) Tropical N (bag) SecurityBoolean (set) Defined later Specializes correctly More informative Less informative
28 Trees Leaves t of Annot(t) Trees Leaves t of Annot(t) PosBool(X) General Prov(X) +k+k kk Infinite sums should be well-defined Need to consider “ –continuous semirings” and “ –continuous homomorphism”
29 N[X]N[X] Why(X) Lin(X) PosBool(X) Sorp(X) Tropical N (bag) SecurityBoolean (set) Finite so -continuous Need to add N [[X]] and N N [[X]] : Most informative provenance semiring [Green et al. ’07]
Poly-size overhead is not valid because of infinite sum But can outputs have finite annotations (with X, , +) that specializes correctly to semirings with finite domains? 30 Theorem: It is not possible to annotate with finite provenance expressions the output of datalog programs following N [[X]] -semantics that specialize “correctly” to the semiring Why(X) Theorem: It is not possible to annotate with finite provenance expressions the output of datalog programs following N [[X]] -semantics that specialize “correctly” to the semiring Why(X) Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X) Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X) ─ Need more levels in the circuit from system of equations ─ Need a different argument for correctness Finite annotations won’t specialize correctly to Why(X)
We propose Sorp(X) – Most general absorptive semiring a + a.b = a – N[X] but keep polynomials that are not “absorbed” by the others e.g. pq + p 2 q 3 pq p 2 q + pq 2 p 2 q + pq 2 The same algorithm, proof, and optimizations to construct poly-size circuits hold – Circuits are more general than Boolean circuit Specializes correctly to interesting semirings 2. Outputs can be annotated by poly-size circuits
32 N[X]N[X] Why(X) Lin(X) PosBool(X) Sorp(X) Tropical N (bag) SecurityBoolean (set)
Data Provenance – e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08] Circuits – Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g. [Arora-Barak ’09] (book) Provenance for Datalog – System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07] – Poly-size c-tables with Boolean formulas for datalog with contradictions [Abiteboul et al. 2014] 33
Circuits to represent and store Datalog Provenance – for PosBool(X) and other semirings – Semantics, Algorithms, Limitations, Applicability – Preliminary experiments support our results we compared circuits for deletion propagation with iteratively solving system of equations and reevaluation of datalog from scratch Future Work: – A complete implementation, evaluation, new applications 34
Thank You Questions? 35