Approximate Lineage for Probabilistic Databases

Slides:

Advertisements

Similar presentations

A threshold of ln(n) for approximating set cover By Uriel Feige Lecturer: Ariel Procaccia.

Advertisements

University of Washington Database Group The Complexity of Causality and Responsibility for Query Answers and non-Answers Alexandra Meliou, Wolfgang Gatterbauer,

Davide Mottin, Senjuti Basu Roy, Alice Marascu, Yannis Velegrakis, Themis Palpanas, Gautam Das A Probabilistic Optimization Framework for the Empty-Answer.

Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.

PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.

Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.

Efficient Query Evaluation on Probabilistic Databases

CSC5160 Topics in Algorithms Tutorial 2 Introduction to NP-Complete Problems Feb Jerry Le

CSE332: Data Abstractions Lecture 27: A Few Words on NP Dan Grossman Spring 2010.

. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.

Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.

CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.

General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.

CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.

Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.

Private Approximation of Search Problems Amos Beimel Paz Carmi Kobbi Nissim Enav Weinreb (Technion)

Lecture. Today Problem set 9 out (due next Thursday) Topics: –Complexity Theory –Optimization versus Decision Problems –P and NP –Efficient Verification.

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.

Complexity Theory and Explicit Constructions of Ramsey Graphs Rahul Santhanam University of Edinburgh.

TU/e Algorithms (2IL15) – Lecture 11 1 Approximation Algorithms.

Optimization Problems

The NP class. NP-completeness

NP-completeness Ch.34.

Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS

Optimization problems such as

Vitaly Feldman and Jan Vondrâk IBM Research - Almaden

Approximating the MST Weight in Sublinear Time

A paper on Join Synopses for Approximate Query Answering

European Symposium on Algorithms – ESA

Approximate Inference

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

CS4234 Optimiz(s)ation Algorithms

June 2017 High Density Clusters.

Algorithms for Routing Node-Disjoint Paths in Grids

Possibilities and Limitations in Computation

NP-Completeness Yin Tat Lee

Queries with Difference on Probabilistic Databases

Data Integration with Dependent Sources

Lecture 16: Probabilistic Databases

Relational Algebra 1.

1 Department of Engineering, 2 Department of Mathematics,

Effective Social Network Quarantine with Minimal Isolation Costs

Linear sketching with parities

1 Department of Engineering, 2 Department of Mathematics,

Optimization Problems

Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.

1 Department of Engineering, 2 Department of Mathematics,

Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,

Linear sketching over

Linear sketching with parities

Efficient Subgraph Similarity All-Matching

CS 188: Artificial Intelligence

The Byzantine Secretary Problem

Graphs and Algorithms (2MMD30)

NP-Completeness Yin Tat Lee

Lecture 6: Counting triangles Dynamic graphs & sampling

15th Scandinavian Workshop on Algorithm Theory

Switching Lemmas and Proof Complexity

Donghui Zhang, Tian Xia Northeastern University

Presented by Uroš Midić

Probabilistic Databases with MarkoViews

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Presentation transcript:

Approximate Lineage for Probabilistic Databases Christopher Ré and Dan Suciu University of Washington

Approximate Lineage in One Slide Lineage (Provenance) In QP used to track correlations Explain query/view results VLDBs have lots of lineage Chokes QP Hard for users to understand Obs: lineage contains a lot of redundancy! In a view, lineage is all derivations of a tuple probabilistic databases Especially with complex queries/views This work: Approximate the lineage, by keeping only the most important correlations

Overview Motivation & Preliminaries An apx lineage approach: Sufficient Lineage Experiments Conclusions

Lineage from a wide variety of sources – not all trusted the same Inspired by the Geneontology (GO) Database A Protein Database Standard pDB, e.g. Mystiq, Trio Protein Process l AGO2 “Cell Death” X1 “Embryonic Devel.” “Glands” X2 Aac11 X3 Protein Process AGO2 “Cell Death” “Embryonic Devel.” “Glands” Aac11 Data are from somewhere id Description P X1 “Dr. Z told me” 0.9 X2 “PubMed:123” 0.8 X3 “Lab Experiment” 0.3 id Description X1 “Dr. Z told me” X2 “PubMed:123” X3 “Lab Experiment” Process (P) Atoms Lineage (l) is important Manually Created Lineage from a wide variety of sources – not all trusted the same Machine inferred Some with confidence, too!

Review: Lineage tracking PRA[Fuhr&Rolleke 97], Trio [Widom 05], Mystiq [R,Dalvi,S07] Review: Lineage tracking Lineage propagates with queries /views “Proteins related to same process as Àac11’” How do we derive the lineage ? V(y) :- P(x,y),P(Àac11’, y), x  Àac11’ Protein Process l AGO2 “Cell Death” X1 “Embryonic Devel.” “Glands” X2 Aac11 X3 Protein l AGO2 (X1 ˄ X2) Protein l AGO2 Protein l AGO2 (X1 ˄ X2) ˅ (X1 ˄ X3) l1 Lineage tracks all derivations Process (P) Prob QP: Pr[V(‘AGO2’)] = Pr[l1] Big DB = Big Lineage (GO) 1 tuple 10MB lineage! Big Lineage chokes the engine!

Problems with Large Lineage in pDB This talk Lineage is used to: Process Queries Give explanations to users Find influential atoms Large: chokes QP Large: Many redundant explanations Large: Needle in a haystack On VLDBs, helpful to shrink (approximate) the lineage

Approximate Lineage Approach Original VLDB Level 2 Database (Small lineage) Level 1 Database (Big lineage) error, e a l smaller, approximate formula Protein l AGO2 (X1 ˄ X2) ˅ (X1 ˄ X3) Protein a AGO2 0.5*x1 + 0.3 Protein a AGO2 (X1 ˄ X2) All (most) querying on Level 2 database (using a instead of l) Focus is on the Level 2 database

Overview Motivation & Preliminaries An apx lineage approach: Sufficient Lineage Experiments Conclusions

Sufficient lineage (SL) Protein l AGO2 (X1 ˄ X2) ˅ (X1 ˄ X3) Represent as? Use as to: Answer queries? Provide explanations? Find influential tuples? Build good a, efficiently? DNF formulae, that logically imply l Reuse existing systems! a is a lower bound l Protein a AGO2 (X1 ˄ X2) See paper The remainder of this talk Nugget: An algorithm that always finds small, good SL

Formalizing “good as” E[l – a]  e Choosing an approximation a for a lineage function, l id Description P X1 “Dr. Z told me” 0.9 X2 “PubMed:123” 0.8 X3 “Lab Experiment” 0.3 Protein l AGO2 (X1 ˄ X2) ˅ (X1 ˄ X3) Formalizing this, Atoms E[l – a]  e An atom is a Boolean proposition. A world is a set of the true atoms. Expectation of difference over all worlds, should be small Intuition: a should agree on most worlds NB: really standard ℓ2 distance

Illustrating Good Lineage E[l – a] = E[l] – E[a]  e id Description P X1 “Dr. Z told me” 0.9 X2 “PubMed:123” 0.8 X3 “Lab Experiment” 0.3 e = 0.054 Intuition: Pr[a] high means good lineage Protein l AGO2 (X1 ˄ X2) ˅ (X1 ˄ X3) Protein a AGO2 (X1 ˄ X2) 0.9 *(1 - (1 - 0.8)(1-0.3)) 0.9 * 0.8 = 0.72 = 0.9 *0.86 = .774

1st step: Lineage DNFs to “graphs” X1 Y1 (X1 ˄ Y1) ˅(X2 ˄ Y1) X2 Y2 We can think of DNFs as graphs (k-DNF  a k-hypergraph) Atoms = nodes Ym Xn Monomials = edges Trick: matching is an SL formula. Goal: Given error e, find a subset of edges with error smaller than e and small size, i.e. a best lower bound;

How big a matching could we need? Assume Pr[Xi ] = Pr[Yj ] = 0.5 X1 Y1 X2 Y2 Pr[M] = 1- (1-0.25)|M| Matching of size 9 implies Pr[M] > .9 For any e > 0.1 ; M can always < 9 Ym Xn Subtle: size bound depends on k, e and Pr[Xi] – not # of tuples Size Pr[M] 9 0.9 17 0.99 25 0.999 33 0.9999 If l has a small good matching, take a to be matching. Call this a “good enough matching”

There is not always a good-enough matching X1 ˄ APX(Y1 ˅ Y2 ˅ … Ym) ˅ (X2 ˅ Z) X1 Y1 (Y1 ˅ Y2 ˅ … Ym) – a (k-1)-DNF Y2 Y5 Formally, {X1,X2} is a small cover Must apx the (k-1)-DNF w. smaller e to account for correlations Ym X2 Z Obs: no “good-enough matching”, then cover must be small Best matching is  0.4 , but formula very close to 0.625! nodes in any maximal matching

SL is always small Two Cases: Small-good matching THM (SL is always small) Size of SL is constant in data. Two Cases: Small-good matching Small-cover of important nodes We’re done! Recurse on k-1 DNF Requires “non-vanishing” probs In datasets, usually, Pr > 10-3 Exponential in query Similar to data-complexity Problem: Maximum matching in general hypergraphs is NP-hard need a maximal matching – pick greedily! Apx NP-hard!

Summary of Constructing SL For SL, good lineage = big lineage Not true in general. Gave an algorithm that always finds small SL Constant in the data Exponential in almost everything else Main trick: Don’t try to find optimal solutions, when sloppy is good enough!

Other fun results in the paper Sufficient Lineage (SL) Error bounds for QP Finding influential tuples Polynomial Lineage (PL): DNF to polynomial Use Taylor/Fourier approximation of poly Algos for QP, explanations and influential tuples Leverage extensive prior art! PL smaller than SL, but not usable in pDBs (Mystiq, Trio).

Overview Motivation & Preliminaries An apx lineage approach: Sufficient Lineage Experiments Conclusions

Experiments Geneontology Database Discuss a single view Publically available Predefined views Atoms = “evidence codes” Discuss a single view 6 tables 2 sources of evidence 1119 tuples 141MB Similar results on IMDB data not presented “All proteins associated with a single protein”

Compression Ratio v. Error Compress Ratio 30x compression 141MB to 4MB Good compression ratio even for stringent error e, error level (smaller is more conservative)

(smaller is more conservative) Effect on QP Compute each tuple in the view Original Lineage Running Time Seconds (Log10 Scale) Sufficient Lineage e, error level (smaller is more conservative)

Which ls give the biggest gain? Original Lineage Win: Compressing big terms # Terms Sufficient Lineage Compressing Single View Top 500 formula in descending size (# is rank)

Conclusion Discussed approximate lineage approach Sufficient Lineage Goal: Fast QP, Explanations Sufficient Lineage Can be used by standard QPs Improves QP dramatically Apx lineage is more general, e.g. Polynomial