Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

Algebra Problem Solving with the new Common Core Standards
Path-Sensitive Analysis for Linear Arithmetic and Uninterpreted Functions SAS 2004 Sumit Gulwani George Necula EECS Department University of California,
Mathematical Preliminaries
Chapter 8 Introduction to Number Theory. 2 Contents Prime Numbers Fermats and Eulers Theorems.
Applications Computational LogicLecture 11 Michael Genesereth Spring 2004.
Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.
1 Knowledge and reasoning – second part Knowledge representation Logic and representation Propositional (Boolean) logic Normal forms Inference in propositional.
Constraint Satisfaction Problems
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Document #07-2I RXQ Customer Enrollment Using a Registration Agent (RA) Process Flow Diagram (Move-In) (mod 7/25 & clean-up 8/20) Customer Supplier.
Analysis of Algorithms
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Annotated XML: Queries and Provenance Nate Foster TJ Green Val Tannen University of Pennsylvania Symposium on Database Provenance University of Edinburgh.
Containment of Conjunctive Queries on Annotated Relations TJ Green University of Pennsylvania Symposium on Database Provenance University of Edinburgh.
Dr. Alexandra I. Cristea CS 319: Theory of Databases: C3.
1 Program verification: flowchart programs (Book: chapter 7)
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.
Robust Window-based Multi-node Technology- Independent Logic Minimization Jeff L.Cobb Kanupriya Gulati Sunil P. Khatri Texas Instruments, Inc. Dept. of.
Solve Multi-step Equations
Chapter 4: Informed Heuristic Search
3 Logic The Study of What’s True or False or Somewhere in Between.
Semantic Analysis and Symbol Tables
Chap. 3 Logic Gates and Boolean Algebra
Copyright © 2013, 2009, 2005 Pearson Education, Inc.
2 |SharePoint Saturday New York City
15. Oktober Oktober Oktober 2012.
演 算 法 實 驗 室演 算 法 實 驗 室 On the Minimum Node and Edge Searching Spanning Tree Problems Sheng-Lung Peng Department of Computer Science and Information Engineering.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
Quadratic Inequalities
Solving Equations How to Solve Them
Equations of Lines Equations of Lines
Lower Bounds for Exact Model Counting and Applications in Probabilistic Databases Paul Beame Jerry Li Sudeepa Roy Dan Suciu University of Washington.
Copyright © 2013, 2009, 2006 Pearson Education, Inc.
Constant, Linear and Non-Linear Constant, Linear and Non-Linear
1 Decision Procedures An algorithmic point of view Equality Logic and Uninterpreted Functions.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
Model Counting of Query Expressions: Limitations of Propositional Methods Paul Beame 1 Jerry Li 2 Sudeepa Roy 1 Dan Suciu 1 1 University of Washington.
© 2012 National Heart Foundation of Australia. Slide 2.
Faster Query Answering in Probabilistic Databases using Read-Once Functions Sudeepa Roy Joint work with Vittorio Perduca Val Tannen University of Pennsylvania.
Copyright © 2013, 2009, 2006 Pearson Education, Inc. 1 Section 5.4 Polynomials in Several Variables Copyright © 2013, 2009, 2006 Pearson Education, Inc.
1 Chapter 4 The while loop and boolean operators Samuel Marateck ©2010.
Note to the teacher: Was 28. A. to B. you C. said D. on Note to the teacher: Make this slide correct answer be C and sound to be “said”. to said you on.
CS 240 Computer Programming 1
Complexity ©D.Moshkovits 1 Where Can We Draw The Line? On the Hardness of Satisfiability Problems.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 6 The Relational Algebra.
Exponents and Radicals
CSE Lecture 17 – Balanced trees
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION.
Bart Jansen 1.  Problem definition  Instance: Connected graph G, positive integer k  Question: Is there a spanning tree for G with at least k leaves?
User Defined Functions Lesson 1 CS1313 Fall User Defined Functions 1 Outline 1.User Defined Functions 1 Outline 2.Standard Library Not Enough #1.
1 Undirected Graphical Models Graphical Models – Carlos Guestrin Carnegie Mellon University October 29 th, 2008 Readings: K&F: 4.1, 4.2, 4.3, 4.4,
Epp, section 10.? CS 202 Aaron Bloomfield
The Pumping Lemma for CFL’s
Minimum Vertex Cover in Rectangle Graphs
Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.
Provenance analysis of algorithms 10/1/13 V. Tannen University of Pennsylvania 1WebDam someTowards ?
Presentation transcript:

Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

 “Boolean Provenance/Lineage” as a Boolean formula  Q is true on D  F Q,D is true  Poly-size, Poly-time computable (data complexity)  But Q is a RA + query  This talk : What if Q is a Datalog Program? AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom Boolean query Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y) x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 Database D F Q,D = (x 1  y 1  z 1 )  (x 1  y 2  z 2 )  (x 2  y 3  z 2 )

 Provenance – Reliability and repeatability – View management and deletion propagation – Trust and security management – Query answering in probabilistic database, ….  Datalog – Datalog is popular again! (two keynotes this ICDT/EDBT) – Data extraction in Web, declarative networking – Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna)  Finding suitable “Provenance for Datalog” is important – Both from theoretical and practical viewpoints  How do we compute, store, and interpret provenance for datalog programs efficiently and effectively? 3

 Can we get poly-size Boolean formulas for datalog provenance? No, even if we allow unbounded time  Do we have a solution? Yes! Use Boolean Circuits!  What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07] It depends on the semiring 4

 Background  Circuits for Boolean Provenance  Circuits for General Provenance Semirings 5

 Background  Circuits for Boolean Provenance  Circuits for General Provenance Semirings 6

T(x, y) :- R(x, y) T(x, y) :- R(x, z), T(z, y) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, z), T(z, y) S(x) :- T(a, x) 7 Datalog program for Transitive Closure and Single-source Reachability EDB (base) relation for edges: R IDB (derived) relations ─ Transitive closure (T) ─ Single-source reachability from vertex ‘a’ (S) IDB (Intensional Databases) EDB (Extensional Databases)

8  Tuples are annotated with variables from a set X – Here X = {x 1, x 2, y 1, y 2, ….}  For n tuples in X, 2 n possible worlds by assignments   : X  {True, False}  Useful in query evaluation on incomplete or probabilistic databases AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 PosBool(X)- database D

9  Annotation propagates from input to output – Join = , Projection/Union =   Output tuples are annotated by monotone Boolean formula – F Q,D is the annotation of the unique output tuple AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom RA + Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y) x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 PosBool(X)-Database D F Q,D = (x 1  y 1  z 1 )  (x 1  y 2  z 2 )  (x 2  y 3  z 2 )

10 For all RA + query Q, D, and assignment  1. (Faithful Representation) Q(D  )= [Q(D)]  2. (Poly-size overhead) The size of F Q,D is poly in |D| and can be computed in poly-time. AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom RA + Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y) x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 F Q,D = (x 1  y 1  z 1 )  (x 1  y 2  z 2 )  (x 2  y 3  z 2 ) True False True False True False = False PosBool(X)-Database D

 Semantics using Derivation Trees (Green et al. 2007)  Annotation of T(a, b): 11 T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) R aa ab p q a b  Trees   Leaves t of  Annot(t) … = (q)  (p  q)  (p  p  q)  … Infinitely many trees But always has a finite equivalent form = q But not necessarily poly-size T(a, b) R(a, a) T(a, b) R(a, a) T(a, b) R(a, b) T(a, b) R(a, a) T(a, b) R(a, b) T(a, b)

Theorem: Given PosBool(X)-database D and datalog program P, provenance of tuples in P(D) cannot have a faithful representation using Boolean formulas of size polynomial in |D| Theorem: Given PosBool(X)-database D and datalog program P, provenance of tuples in P(D) cannot have a faithful representation using Boolean formulas of size polynomial in |D| 12 Proof outline: st-connectivity on n nodes requires n  (logn) -size monotone Boolean formula Karchmer-Wigderson, 1988 Faithful representation requires: for all True/False assignments  to X, P(D  )= [P(D)]  Reduce to the hard instance with right  when P = transitive closure Solution: Boolean Circuit!

 Background  Circuits for Boolean Provenance or PosBool(X)  Circuits for General Provenance Semirings 13

 Circuit is a DAG – use common subexpressions – Boolean formula = tree  Leaf nodes: – EDB vars in X  Internal nodes – : IDB/EDB vars used in one derivation –:–: Alternative derivations  Roots: – IDB vars 14 R aa ab p q T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) X T(a, b) q p X R(a, b) X R(a, a) a b    

Theorem: Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time) Theorem: Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time) 15

1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalog program P to EDB/IDB tuples [Green et al. 2007] 1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalog program P to EDB/IDB tuples [Green et al. 2007] A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011] N = #IDB tuples Build a circuit with N+1 layers from the system of equations Two key ideas from previous work EDB tuples  constants, IDB tuples  variables Iteratively solve this system of equations Fixpoint = provenance for all IDB tuples

17 T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) R aa ab p q a b Step1 : Build system of equations by all possible instantiations: x, y, z  a, b X T(a, a) = p  (p  X T(a, a) ) X T(a, b) = q  (p  X T(a, b) ) X S(b) = X T(a, b) X S(a) = X T(a, a) Step 2: Build a circuit with layers (N = 4) … var Const

18 X T(a,a),0 X S(b),0 X T(a,a),0 X T(a,b),0 X S(a),0  p q   X T(a,a),1 X S(b),1 X T(a,a),1  X T(a,b),1   X S(a),1  X S(a),2  X T(a,a),2 X S(b),2 X Ta,a),2 X T(a,b),2 Level 1 Level 2 false X T(a, a) = p  (p  X T(a, a) ) X T(a, b) = q  (p  X T(a, b) ) X S(b) = X T(a, b) X S(a) = X T(a, a) Assign leaf IDB vars to false Multiple roots for multiple IDB vars

1. Store only two levels of circuit instead of N+1 levels – Evaluate iteratively 2. Embed circuit construction in semi-naïve evaluation – Check for new derivations, not only new IDB variables – Sound and Complete 3. Remove self-dependency of IDB vars – works for PosBool(X) and also some other semirings… X T(a, a) = p  (p  X T(a, a) ) X T(a, b) = q  (p  X T(a, b) ) X S(b) = X T(a, b) X S(a) = X T(a, a) 19

20 X T(a,a),0 X S(b),0 X T(a,a),0 X T(a,b),0 X S(a),0   p q   X T(a,a),1 X S(b),1 X T(a,a),1  X T(a,b),1   X S(a),1  X S(a),2  X T(a,a),2 X S(b),2 X Ta,a),2 X T(a,b),2 Level 1 Level 2 false

21 X T(a,a),bottom X T(a,b),bottom X S(a),bottom pq X T(a,a),top X T(a,b),top X S(a),top With all these optimizations Top Level Bottom Level   

 Linear-time deletion propagation (in circuit-size)  Approximation for probabilistic databases – even when only the circuit (and not the database) is available  Circuits can be computed “offline” – Only linear-time evaluation is required when needed (e.g. deletion propagation)  compared to storing and solving a system of equations iteratively, or  re-evaluating datalog program  Can use existing techniques for efficient and parallel circuit evaluation 22

 Background  Circuits for Boolean Provenance or PosBool(X)  Circuits for General Provenance Semirings 23

 (K, + K,  K, 0 K, 1 K ) – domain K – + K,  K : associative, commutative, have neutral elements 0 K, 1 K – K distributes over + K, i.e. a  K (b + K c) = a  K b + K a  K c – 0 K cancels any element in K, i.e. a  K 0 K = 0 K  K a = 0 K Examples: – (B, , , False, True)  Set semantics – (N, +, , 0, 1)  Bag semantics – (N  {  }, min, +, , 0)  Tropical semiring to compute cost (e.g. cost of a shortest path) 24

 Generalization of PosBool(X)  (K, + K,  K, 0 K, 1 K ) – Tuples are annotated with variables from X – K is of the form Prov(X) – + K denotes alternative usage –  K denotes joint usage  Examples: – (PosBool(X), , , False, True) – (Lin(X), , , ,  )  tracks contributing tuples [Cui et. al. ’00] – (Why(X), , , , {  } )   : pairwise union of subsets, tracks contributing tuples in alternative derivations [Buneman et. al. ’01] 25

 Key property needed for applications like deletion propagation, trust management, cost computation, …  Prov(X) specializes correctly to K, if any valuation v : X  K extends uniquely to a homomorphism h v : Prov(X)  K (which correctly maps +,  of Prov(X) to that of K)  Further, some provenance semirings are “more informative” than the others 26

27 N[X]N[X] Why(X) Lin(X) PosBool(X) Sorp(X) Tropical N (bag) SecurityBoolean (set) Defined later Specializes correctly More informative Less informative

28  Trees   Leaves t of  Annot(t)  Trees   Leaves t of  Annot(t) PosBool(X) General Prov(X) +k+k kk Infinite sums should be well-defined Need to consider “  –continuous semirings” and “  –continuous homomorphism”

29 N[X]N[X] Why(X) Lin(X) PosBool(X) Sorp(X) Tropical N (bag) SecurityBoolean (set) Finite so  -continuous Need to add   N  [[X]] and N  N  [[X]] : Most informative provenance semiring [Green et al. ’07]

 Poly-size overhead is not valid because of infinite sum  But can outputs have finite annotations (with X, , +) that specializes correctly to semirings with finite domains? 30 Theorem: It is not possible to annotate with finite provenance expressions the output of datalog programs following N  [[X]] -semantics that specialize “correctly” to the semiring Why(X) Theorem: It is not possible to annotate with finite provenance expressions the output of datalog programs following N  [[X]] -semantics that specialize “correctly” to the semiring Why(X) Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X) Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X) ─ Need more levels in the circuit from system of equations ─ Need a different argument for correctness Finite annotations won’t specialize correctly to Why(X)

 We propose Sorp(X) – Most general absorptive semiring  a + a.b = a – N[X] but keep polynomials that are not “absorbed” by the others  e.g. pq + p 2 q 3  pq p 2 q + pq 2  p 2 q + pq 2  The same algorithm, proof, and optimizations to construct poly-size circuits hold – Circuits are more general than Boolean circuit Specializes correctly to interesting semirings 2. Outputs can be annotated by poly-size circuits

32 N[X]N[X] Why(X) Lin(X) PosBool(X) Sorp(X) Tropical N (bag) SecurityBoolean (set)

 Data Provenance – e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08]  Circuits – Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g. [Arora-Barak ’09] (book)  Provenance for Datalog – System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07] – Poly-size c-tables with Boolean formulas for datalog with contradictions [Abiteboul et al. 2014] 33

 Circuits to represent and store Datalog Provenance – for PosBool(X) and other semirings – Semantics, Algorithms, Limitations, Applicability – Preliminary experiments support our results  we compared circuits for deletion propagation with iteratively solving system of equations and reevaluation of datalog from scratch  Future Work: – A complete implementation, evaluation, new applications 34

Thank You Questions? 35