Faster Query Answering in Probabilistic Databases using Read-Once Functions Sudeepa Roy Joint work with Vittorio Perduca Val Tannen University of Pennsylvania.

Slides:

Advertisements

Similar presentations

Path-Sensitive Analysis for Linear Arithmetic and Uninterpreted Functions SAS 2004 Sumit Gulwani George Necula EECS Department University of California,

Advertisements

Global Value Numbering using Random Interpretation Sumit Gulwani George C. Necula CS Department University of California, Berkeley.

Mathematical Preliminaries

Chapter 13: Query Processing

Monday HW answers: p B25. (x – 15)(x – 30) 16. (t – 3)(t – 7)29. (x -2)(x – 7) 19. (y – 6)(y + 3)roots = 2 and (4 + n)(8 + n)34. (x + 7)(x.

Precise Interprocedural Analysis using Random Interpretation Sumit Gulwani George Necula UC-Berkeley.

Applications Computational LogicLecture 11 Michael Genesereth Spring 2004.

1 Knowledge and reasoning – second part Knowledge representation Logic and representation Propositional (Boolean) logic Normal forms Inference in propositional.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

Analysis of Algorithms

CSE 211 Discrete Mathematics

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Data recovery 1. 2 Recovery - introduction recovery restoring a system, after an error or failure, to a state that was previously known as correct have.

1 Approximability Results for Induced Matchings in Graphs David Manlove University of Glasgow Joint work with Billy Duckworth Michele Zito Macquarie University.

Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA

Formal Models of Computation Part II The Logic Model

Containment of Conjunctive Queries on Annotated Relations TJ Green University of Pennsylvania Symposium on Database Provenance University of Edinburgh.

Dr. Alexandra I. Cristea CS 319: Theory of Databases: C3.

Preprocessing Techniques for Computing Nash Equilibria Vincent Conitzer Duke University Based on: Conitzer and Sandholm. A Generalized Strategy Eliminability.

SQL: The Query Language Part 2

Reductions Complexity ©D.Moshkovitz.

Improved Shortest Path Algorithms for Nearly Acyclic Directed Graphs L. Tian and T. Takaoka University of Canterbury New Zealand 2007.

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

1 Bart Jansen Polynomial Kernels for Hard Problems on Disk Graphs Accepted for presentation at SWAT 2010.

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach

Randomized Algorithms Randomized Algorithms CS648 1.

Data Structures ADT List

ABC Technology Project

ML Lists.1 Standard ML Lists. ML Lists.2 Lists A list is a finite sequence of elements. [3,5,9] ["a", "list" ] [] Elements may appear more than once [3,4]

3 Logic The Study of What’s True or False or Somewhere in Between.

演算法實驗室演算法實驗室 On the Minimum Node and Edge Searching Spanning Tree Problems Sheng-Lung Peng Department of Computer Science and Information Engineering.

A D ICHOTOMY ON T HE C OMPLEXITY OF C ONSISTENT Q UERY A NSWERING FOR A TOMS W ITH S IMPLE K EYS Paris Koutris Dan Suciu University of Washington.

1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.

Three Special Functions

Equations of Lines Equations of Lines

Lower Bounds for Exact Model Counting and Applications in Probabilistic Databases Paul Beame Jerry Li Sudeepa Roy Dan Suciu University of Washington.

Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania.

Relational Algebra Chapter 4, Part A

Constant, Linear and Non-Linear Constant, Linear and Non-Linear

Model Counting of Query Expressions: Limitations of Propositional Methods Paul Beame 1 Jerry Li 2 Sudeepa Roy 1 Dan Suciu 1 1 University of Washington.

Graphs, representation, isomorphism, connectivity

Abbas Edalat Imperial College London Contains joint work with Andre Lieutier (AL) and joint work with Marko Krznaric (MK) Data Types.

Chapter 5 Test Review Sections 5-1 through 5-4.

Complexity Classes: P and NP

1 On c-Vertex Ranking of Graphs Yung-Ling Lai & Yi-Ming Chen National Chiayi University Taiwan.

Checking  -Calculus Structural Congruence is Graph Isomorphism Complete Victor Khomenko 1 and Roland Meyer 2 1 School of Computing Science, Newcastle.

25 seconds left…...

Complexity ©D.Moshkovits 1 Where Can We Draw The Line? On the Hardness of Satisfiability Problems.

We will resume in: 25 Minutes.

CS203 Lecture 15.

Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.

University of Washington Database Group The Complexity of Causality and Responsibility for Query Answers and non-Answers Alexandra Meliou, Wolfgang Gatterbauer,

From Approximative Kernelization to High Fidelity Reductions joint with Michael Fellows Ariel Kulik Frances Rosamond Technion Charles Darwin Univ. Hadas.

1 Complexity ©D.Moshkovitz Cryptography Where Complexity Finally Comes In Handy…

Bart Jansen 1.  Problem definition  Instance: Connected graph G, positive integer k  Question: Is there a spanning tree for G with at least k leaves?

all-pairs shortest paths in undirected graphs

Compiler Construction

1 Programming Languages (CS 550) Mini Language Interpreter Jeremy R. Johnson.

Interval Graph Test.

1 Graphs with Maximal Induced Matchings of the Same Size Ph. Baptiste 1, M. Kovalyov 2, Yu. Orlovich 3, F. Werner 4, I. Zverovich 3 1 Ecole Polytechnique,

Finding Skyline Nodes in Large Networks. Evaluation Metrics:  Distance from the query node. (John)  Coverage of the Query Topics. (Big Data, Cloud Computing,

1 General Structural Equation (LISREL) Models Week 3 #2 A.Multiple Group Models with > 2 groups B.Relationship to ANOVA, ANCOVA models C.Introduction to.

SAT Solver CS 680 Formal Methods Jeremy Johnson. 2 Disjunctive Normal Form  A Boolean expression is a Boolean function  Any Boolean function can be.

Lecture 24 MAS 714 Hartmut Klauck

Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.

Algorithms for Finding Distance-Edge-Colorings of Graphs

Queries with Difference on Probabilistic Databases

Probabilistic Databases with MarkoViews

Presentation transcript:

Faster Query Answering in Probabilistic Databases using Read-Once Functions Sudeepa Roy Joint work with Vittorio Perduca Val Tannen University of Pennsylvania 1

Probabilistic Databases 2 Possible worlds model  Each possible world w is a standard database instance, has a probability P[w]  Compact representation D based on independence assumptions Query Semantics in Probabilistic Databases  (wlog.) Boolean query q  Traditional database: q(D)  {true, false}  Probabilistic database: P[q(D)] = ∑ q(w) = true P[w] Goal: Efficiently evaluate P[q(D)]  Data complexity; want time polynomial in n = |D|

Computation of P[q(D)] Can we efficiently compute P[q(D)]?  NO, In general #P-hard DalviSuciu’04, ff. : Positive queries can be partitioned into  Safe queries: Safe plans run in poly-time on all instances  Unsafe queries: Data complexity is #P-hard  Includes very simple queries like R(x) S(x, y) T(y)  Given q as input, we can efficiently decide whether q is safe BUT:  For unsafe queries, probabilities on some instances can be efficiently computed  Our Approach: Take both q and D as input 3

Restrictions a1 a2 a3 b1 b2 b Tuple-independent representation D  Tuple t annotated by P[t] a1 a2 a b1 b2 b RST a1b1a1b1 RST P[w] = 0.3 (1 – 0.4) (1 – 0.6) 0.1 (1 – 0.5) (1–0.2) (1–0.1) 0.7 (1–0.8) (1 – 0.4) w = a possible world D = Conjunctive query without self-join (CQ - )  q():= R(x)S(x, y)T(y)  (This is the H 0 query from Suciu’s keynote) Probability

Query Answering in Two Steps: Example  Event variables for tuples  Step 1: Event expression for q(D) or “lineage”  E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3  The “form” of the expression depends on query plan; here  () ((R ⋈ S) ⋈ T)  Step 2: Compute P[q(D)] = P[E]  given Pr[w1] = 0.3, Pr[v1] = 0.4, …. This work: take advantage of Read-Once expressions D a1 a2 a3 b1 b2 b3 v1 v2 v3 v a1 a2 a3 w1 w2 w b1 b2 b3 u1 u2 u R T S 5 Probability Event variables q():= R(x), S(x, y), T(y) EASY HARD a1 a2 a3 b1 b2 b a1 a2 a b1 b2 b

Read-Once Boolean Expressions Expression in Read-once Form: Every variable occurs exactly once  e.g. ((x+y)z + w)(u+v)  Linear time probability computation  P(x y) = P(x) P(y)  P(x + y) = 1 – (1 -P(x)) (1 – P(y)) Read-once Expression: Has an equivalent read-once form.  e.g.  xzu + xzv + yzu + yzv + wu+ wv [in DNF, as large as O(n |q| )]  xzu + xzv + (yz + w)(u+v) [not in DNF, can be much smaller] Non-read-once Expressions: No read-once form  e.g.. xy + yz + zx, xy + yz + zw xy zuv 6

Read-Once Event Expressions Safe plans for safe queries directly produce expressions in read-once form (OlteanuHuang’08) Unsafe queries can also produce read-once expressions  Our example is read-once  E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 = (w1 v1 + w2 v2) u1 + w3 (v3 u2 + v4 u3)  Corresponds to unsafe query q():= R(x) S(x, y) T(y)  No query plan can produce the read-once form directly 7

Problem Definition Given  a boolean CQ - query q,  a tuple-independent database D,  Can we efficiently decide whether the event expression corresponding to q(D) is read-once?  If yes, can we compute the read-once form efficiently?  (then P[q(D)] can be computed efficiently) 8

Read-once-ness: only a sufficient condition to efficiently compute P[q(D)] e.g., E = x1 x2 + x2 x3 + x3 x4 + ……  Not read-once  P[E] can be computed in poly-time using dynamic programming  Moreover, see detailed analysis in JhaSuciu ’11 using OBDD, FBDD, d-DNNF E is read-once read-once form of E can be computed efficiently P[E] can be computed efficiently 9

Outline Background  Existing characterization of read-once expressions  Co-occurrence Graphs Our Contributions  Co-table graph  Step1. Computation of co-table graph  Step2. Computation of read-once form Related work, Future work and Conclusion 10

Outline Background  Existing characterization of read-once expressions  Co-occurrence Graphs Our Contributions  Co-table graph  Step1. Computation of co-table graph  Step2. Computation of read-once form Related work, Future work and Conclusion 11

Characterization of Read-once Expressions A positive boolean expression is read-once if and only if its “co-occurrence graph” is P4-free (no simple induced path with four vertices) and “normal”.  Gurvich’ 77, ’ 91  Can be checked (and computed) in poly-time if the expression is given in DNF (GolumbicMR’ 06) z 12

Co-occurrence Graph - G CO Graph on variables in the expression as vertices 1. Express boolean expression in irredundant DNF  xy + xyz + zx  xy + zx 2. Put an edge between variables if they co-occur in a disjunct Can be easily computed if the expression is in DNF y x z 13

Outline Background  Existing characterization of read-once expressions  Co-occurrence Graphs Our Contributions  Co-table graph  Step1. Computation of co-table graph  Step2. Computation of read-once form Related work, Future work and Conclusion 14

Our Contributions 1. DNF of event expression is not needed for CQ -  G CO can be directly computed from “ provenance DAGs ” 2. We do not need to compute G CO  A subgraph of G CO suffices – “ Co-table graph” G CT 15 Our Framework Compute G CO Use existing read-once testing algorithms Compute G CT Use our read-once testing algorithm (1) Uses Gurvich’s characterization vs. (2) Uses alternative (2) Is faster than (1) (1)(2)

Provenance DAG Event expressions, called “lineage” (Suciu keynote), are a form of provenance (GreenKarvounarakisT ’07). We use provenance DAGs (Green et. al. ’07) Query q():= R(x), S(x, y), T(y) Query Plan  () ((R ⋈ S) ⋈ T) E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 w1w2w3 v1v2 v3 v4 u1 u2u3 16

Co-Table Graph -- G CT Subgraph of G co: |G CT |  |G CO | Put an edge between variables only if their tables share variables in q e.g.: q():= R(x) S(y)  R, S have n tuples each, G CO has n 2 edges, G CT has zero! q():= R(x) S(x, y) T(y) E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 w1 w2 w3 u1 u2 u3 v1 v2 v3 v4 w1 w2 w3 v1 v2 v3 v4 u1 u2 u3 G CO G CT 17

Our Algorithm Input: Provenance DAG, H  Obtained from the query plan Step1: Compute G CT  (the same procedure can compute G CO as well) Step2: Compute read-once form (if possible)  Otherwise output that event expression is not read-once 18

Step1: Computing G CT Theorem: Two variables are adjacent in G CT if and only if their least common ancestor set contains a product-node in the provenance DAG yxZ E = xy + xz  Proof uses critically the no-self-join assumption 19

Step2: Computing Read-once form Input: G CT Alternate between  Row Decomposition and Table Decomposition Recursive computation Exactly one can be done at a recursion level, otherwise not read-once Proof uses critically no-union assumption Sound and Complete 20 q q q E1E1 E2E2 E3E3 E = E 1 + E 2 + E 3 Row decomposition q1q1 q2q2 E1E1 E2E2 E = E 1 E 2 Table decomposition

Example: Row Decomposition a1 a2 a3 b1 b2 b3 v1 v2 v3 v4 a1 a2 a3 w1 w2 w3 b1 b2 b3 u1 u2 u3 R ST q():= R(x), S(x, y), T(y) E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 w1 w2 w3 v1 v2 v3 v4 a1 a2 b1 v1 v2 a1 a2 w1 w2 b1u1 R1 S1T1 u1 u2 u3 + 21

Example: Table Decomposition w1 w2 v1 v2 a1 a2 b1 v1 v2 a1 a2 w1 w2 b1u1 R1 S1 T1 u1 q():= R(x), S(x, y), T(y) q1():= R(x), S(x, y1) q2():= T(y2)  (w1 v1 + w2 v2) u1 (w1 v1 + w2 v2)u1 Final Expression: (w1 v1 + w2 v2)u1 + w3(v3 u2 + v4 u3) 22

Overall Time Complexity Input: Provenance DAG H Step1: Compute G CT or G CO  Time complexity ≈ O(n m H + W H m CO )  m H = #edges in H, W H = width of H, m CO = #edges in G CO, m CT = #edges in G CT Step2: Compute read-once form (if possible)  Using our algorithm: O((m CT + n) min (|q|, √ n)) ; Data complexity O(m CT + n)  Using existing algorithms: O(m CO + n), m CT ≤ m CO 23 Summary Analysis uses “charging argument” Bound recursion depth, total time at each recursion level Step1 is more expensive Step2 is linear  In |G CO | for existing algorithms  In |G CT | for our algorithms  |G CT | ≤ |G CO |

Outline Background  Co-occurrence Graphs  Existing characterization of read-once expressions Our Contributions  Co-table graph  Step1. Computation of co-table graph  Step2. Computation of read-once form Related work, Future work and Conclusion 24

Related Work SenDeshpandeGetoor’ 10  Independent work, considers the same problem  Shows that “normality” check is not needed for CQ -  Tests P4-freeness using “lineage-trees” without computing the co-occurrence graph Our work:  Computes the co-occurrence graph without DNF computation  existing algorithms can be used.  Was an open question in SenDeshpandeGetoor’10  Obtains a faster and simpler algorithm  Time complexity comparison in the paper  Uses BFS/DFS, easier to implement  Uses compact provenance DAGs instead of lineage trees 25

Other Related Work  Semantics of probabilistic query answering  Fuhr-Rollecke ’97, Zimanyi ‘97  Dichotomy of CQ -,CQ and UCQ queries  Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10  Knowledge compilation techniques  Olteanu-Huang ’08  Jha-Olteanu-Suciu ‘10  Jha-Suciu ’11  Fink-Olteanu ‘11 26

Conclusion and Future Work Can co-occurrence/co-table graph be computed as a pre-processing step?  This is the more expensive step  Akin to building indexes on databases but depends on query’s “join pattern”  Cache the already computed G CT with the join pattern How to handle  Larger classes of queries (UCQ?) and database models (disjoint independent?)  Other efficient knowledge-compilation forms 27

Thank You. Questions? 28