Queries with Difference on Probabilistic Databases

Slides:



Advertisements
Similar presentations
P, NP, NP-Complete Problems
Advertisements

Lower Bounds for Exact Model Counting and Applications in Probabilistic Databases Paul Beame Jerry Li Sudeepa Roy Dan Suciu University of Washington.
Faster Query Answering in Probabilistic Databases using Read-Once Functions Sudeepa Roy Joint work with Vittorio Perduca Val Tannen University of Pennsylvania.
Complexity Classes: P and NP
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Representing and Querying Correlated Tuples in Probabilistic Databases
Max Cut Problem Daniel Natapov.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Efficient Query Evaluation on Probabilistic Databases
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Inapproximability from different hardness assumptions Prahladh Harsha TIFR 2011 School on Approximability.
PCPs and Inapproximability Introduction. My T. Thai 2 Why Approximation Algorithms  Problems that we cannot find an optimal solution.
Approximate Counting via Correlation Decay Pinyan Lu Microsoft Research.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
The Theory of NP-Completeness
NP-Complete Problems Problems in Computer Science are classified into
Analysis of Algorithms CS 477/677
Time Complexity.
Dana Moshkovitz, MIT Joint work with Subhash Khot, NYU.
A D ICHOTOMY ON T HE C OMPLEXITY OF C ONSISTENT Q UERY A NSWERING FOR A TOMS W ITH S IMPLE K EYS Paris Koutris Dan Suciu University of Washington.
NP Complexity By Mussie Araya. What is NP Complexity? Formal Definition: NP is the set of decision problems solvable in polynomial time by a non- deterministic.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Techniques for Proving NP-Completeness Show that a special case of the problem you are interested in is NP- complete. For example: The problem of finding.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Non-Approximability Results. Summary -Gap technique -Examples: MINIMUM GRAPH COLORING, MINIMUM TSP, MINIMUM BIN PACKING -The PCP theorem -Application:
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Approximation algorithms
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
ICS 353: Design and Analysis of Algorithms NP-Complete Problems King Fahd University of Petroleum & Minerals Information & Computer Science Department.
NP-Completeness A problem is NP-complete if: It is in NP
NP-Completeness (2) NP-Completeness Graphs 4/13/2018 5:22 AM x x x x x
P & NP.
Chapter 10 NP-Complete Problems.
8.3.2 Constant Distance Approximations
Richard Anderson Lecture 26 NP-Completeness
Computability and Complexity
NP-Completeness (2) NP-Completeness Graphs 7/23/ :02 PM x x x x
NP-Completeness (2) NP-Completeness Graphs 7/23/ :02 PM x x x x
NP-Completeness Proofs
Richard Anderson Lecture 26 NP-Completeness
Perfect Matchings in Bipartite Graphs
Approximate Lineage for Probabilistic Databases
Possibilities and Limitations in Computation
NP-Completeness Yin Tat Lee
Intro to Theory of Computation
Data Integration with Dependent Sources
Lecture 16: Probabilistic Databases
ICS 353: Design and Analysis of Algorithms
Complexity 6-1 The Class P Complexity Andrei Bulatov.
NP-Completeness (2) NP-Completeness Graphs 11/23/2018 2:12 PM x x x x
Pseudo-derandomizing learning and approximation
Bin Fu Department of Computer Science
Introduction to PCP and Hardness of Approximation
Approximation Algorithms
Chapter 11 Limitations of Algorithm Power
DNF Sparsification and Counting
Probabilistic Databases
Knowledge Compilation: Representations and Lower Bounds
NP-Completeness Yin Tat Lee
CSE 6408 Advanced Algorithms.
Complexity Theory in Practice
Umans Complexity Theory Lectures
Instructor: Aaron Roth
Switching Lemmas and Proof Complexity
NP-Completeness (2) NP-Completeness Graphs 7/9/2019 6:12 AM x x x x x
Probabilistic Databases with MarkoViews
Presentation transcript:

Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania

Probabilistic Databases To model and query uncertain data (sensor networks, information extraction…) Possible worlds model Each possible world W is a standard database instance, has a probability P[W] Compact representation D assuming independence S T R a1 a2 a3 b1 b2 b3 0.1 0.5 0.2 b1 b2 b3 0.7 0.8 0.4 a1 a2 a3 0.3 0.4 0.6 D

Query Semantics Query Semantics on probabilistic databases: Apply the query q on each possible world W Add up the probabilities of the worlds that give the same query answer A P[q(D) = A] = ∑W : q(W) = A P[W] Goal: Efficiently evaluate P[q(D) = A] Data complexity; want time polynomial in n = |D| Can we always efficiently compute P[q(D)]? NO, in general it is #P-hard

Query Answering in Two Steps Introduce event variables for tuples (P[w1] = 0.3, …) Step 1: Boolean provenance for q(D) [FR ’97, Z ’97] f = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 Step 2: Compute P[q(D)] = P[f] given P[w1] = 0.3, P[v1] = 0.4, … easy hard Event variables D S R a1 a2 a3 b1 b2 b3 v1 v2 v3 v4 0.1 0.5 0.2 a1 a2 a3 b1 b2 b3 0.1 0.5 0.2 T Probability b1 b2 b3 u1 u2 u3 0.7 0.8 0.4 b1 b2 b3 0.7 0.8 0.4 a1 a2 a3 w1 w2 w3 0.3 0.4 0.6 a1 a2 a3 0.3 0.4 0.6 Boolean query q():-R(x),S(x, y),T(y)

Probability Computation for Positive Queries Dichotomy Result [DS ’04, ’07; DSS ’10] Given q as input, we can efficiently decide if q is Safe: Safe plans run in poly-time on all instances, or, Unsafe: #P-hard, e.g. q() :- R(x) S(x, y) T(y) Instance-by-instance approach [SDG ’10, RPT ’11] Both q and D are given as input Poly-time algorithm to compute P[q(D)] for special cases even if q is unsafe What about queries with difference?

Boolean Provenances for Difference T b1 b2 c1 c2 c3 u1 u2 u3 c1 c2 c3 a1 a2 a3 v1 v2 v3 v4 a1 a2 a3 w1 w2 w3 q1(x):- R(x, y), S(y, z) q2(x):- R(x, y), S(y, z), T(z) b1 b2 u1(v1 + v2) + u3v4 u2v3 b1 b2 u1v1w1 + u1v2w2 + u3v4w2 u2v3w3 q = q1 – q2 b1 b2 (u1(v1 + v2) + u3v4) . (u1v1w1 + u1v2w2 + u3v4w2) (u2v3) . (u2v3w3)

Previous Work on Difference FOR ’11 Framework for exact and approximate probability computation But, no guarantee of polynomial running time In fact, we show in this paper that with difference, in some cases no approximation exists (unless NP = RP) How far can we go with difference in poly-time?

A Quick Comparison Without difference With difference DNF of boolean provenance is poly-size (n|q|) P[q(D)] is always approximable (FPRAS) With difference DNF of boolean provenance may be exponential in n P[q(D)] may not be approximable FPRAS: Fully Polynomial Randomized Approx. Scheme Compute with prob. ≥ ¾ in time polynomial in n, 1/ε p  [(1-ε) P[q(D)], (1+ε) P[q(D)]

Our Results We study queries of the form q1 – q2 and their generalization FPRAS: If q1 is any UCQ, q2 is any safe CQ- #P-hardness: Even if both q1 and q2 are safe CQ- Inapproximability: Even if q1 is the trivial TRUE query and q2 is a UCQ Our FPRAS result extends to a larger class of queries of which q1 – q2 is a special case [CQ- : Conjunctive queries without self-joins]

Difference Rank Define difference rank (q) of query q recursively (q1 - q2) = (q1) + (q2) + 1 R – S : rank 1 (q1 ⋈ q2) = (q1) + (q2) (R – S1) ⋈ (R - S2) : rank 2 (R - T1) ⋈ T2 : rank 1 (q1  q2) = max ((q1), (q2)) (R – S1) ⋈ (R - S2)  (R - T1) ⋈ T2 : rank 2 Select, project: rank remains the same

FPRAS for queries q with (q) = 1 given some conditions hold (inapproximable for (q) = 1 in general)

Steps in FPRAS Step 1: Compute boolean provenance of q[D] for any query q with (q) = 1 Step 2: Write the boolean provenance in a “Probability Friendly Form” (if possible) Step 3: FPRAS inspired by Karp-Luby framework

Boolean Provenance for Queries q s.t. (q) = 1 Lemma: For any q with (q) = 1, on any D, the provenance f of q(D) has form f is poly-size in n = |D|, poly-time computable

Probability Friendly Form (PFF) f is in PFF, if the negated DNF-s can be written in poly-size d-DNNFs (next slide) If f is in PFF, we can approximate P[f] using Karp-Luby Framework

d-DNNF + Darwiche ’01, ’02, DM ’02 deterministic - Decomposable Negation Normal Form At most one child of a +-node is satisfiable Children of a .-node do not share variables No internal node can have negation + In general, can be a DAG Probability can be computed in linear time

Karp-Luby Framework [KL ’83] Given boolean expression DAGs F1, …, Fm f = F1 + F2 + ... + Fm P[f] can be computed in poly-time (in m, n) if in poly-time,  i (1) P[Fi] can be computed (2) it can be checked if a given assignment satisfies Fi (3) a random satisfying assignment of Fi can be sampled Well-studied special case: DNF counting, where F1, …, Fm are DNF minterms: f = xyz + xyw + wuv

Conditions (1) and (2) hold for PFF Product of minterm and d-DNNF is another d-DNNF + w2=1, z1=1 +

Condition (3) also holds Lemma: Generating a random satisfying assignment on a d-DNNF can be done in poly-time At random + Process in reverse topological order Generate a random satisfying assignment bottom up v1 = 1, v2 = 0 v1 = 0, v2 = 0 v2 = 0 v1 = 1 v1 = 0 v2 = 1 v2 = 0

Expressibility in PFF So, if f is in PFF, we can approximate P[q(D)] But, can we decide in poly-time if some sub-expressions of a boolean expression have poly-size d-DNNFs?  Not known  But, there are natural sufficient conditions that can be verified in poly-time If certain sub-queries are safe and hence generate read-once expressions [OH ’08] If sub-queries generate poly-size OBDDs [JS ’11] Extends to instance-by-instance approach (both q, D given)

#P-hardness for q1 - q2 both q1, q2 are safe CQ-

#P-hardness: Steps in the proof “Hard” query q = q1 – q2 q1() := R1(x, y1) R2(x, y2) R3(x, y3) R4(x, y4) q2() := R1(x1, y) R2(x2, y) R3(x3, y) R4(x4, y) Counting edge covers in bipartite graphs of degree ≤ 4, where the edge set can be partitioned into 4 disjoint matchings Counting independent sets in 3-regular bipartite graphs (XZ ’06)

Other Related Work Semantics of probabilistic query answering Fuhr-Rollecke ’97, Zimanyi ‘97 Dichotomy of CQ- ,CQ and UCQ queries Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10 Knowledge compilation techniques Olteanu-Huang ’08, Jha-Olteanu-Suciu ’10, Jha-Suciu ’11, Fink-Olteanu ’11 Instance-by-instance approach Sen-Deshpande-Getoor ’10, Roy-Perduca-Tannen ’11

Conclusions and Future work A step towards understanding complexity of exact and approximate computation for queries with difference operations Future work Dichotomy results that classify syntactically difference queries (similar to positive UCQ)? Extending FPRAS to queries with difference rank > 1? Experimental evaluation of our algorithms

Thank you Questions?