1 Management of Probabilistic Data: Foundations and Challenges Nilesh Dalvi and Dan Suciu Univerisity of Washington.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

The Theory of Zeta Graphs with an Application to Random Networks Christopher Ré Stanford.
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.
Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
CS4432: Database Systems II
Representing and Querying Correlated Tuples in Probabilistic Databases
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
LAHAR: Extracting Events from Probabilistic Streams Chris Re, Julie Letchner, Magdalena Balazinska and Dan Suciu University of Washington.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Chris Re, Julie Letchner, Magdalena Balazinska and Dan Suciu University of Washington.
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.
5/17/20151 Probabilistic Reasoning CIS 479/579 Bruce R. Maxim UM-Dearborn.
Efficient Query Evaluation on Probabilistic Databases
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 13: Incorporating Uncertainty into Data Integration PRINCIPLES OF DATA INTEGRATION.
Uncertainty Lineage Data Bases Very Large Data Bases
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.
Lineage Processing over Correlated Probabilistic Databases Bhargav Kanagal Amol Deshpande University of Maryland.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
Machine Learning Week 2 Lecture 2.
. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.
Representing Uncertainty CSE 473. © Daniel S. Weld 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one.
MystiQ The HusQies* *Nilesh Dalvi, Brian Harris, Chris Re, Dan Suciu University of Washington.
1 Probabilistic/Uncertain Data Management -- IV 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’ Sen, Deshpande. “Representing.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
Course Outline Book: Discrete Mathematics by K. P. Bogart Topics:
CS848: Topics in Databases: Foundations of Query Optimization Topics Covered  Databases  QL  Query containment  More on QL.
DBrev: Dreaming of a Database Revolution Gjergji Kasneci, Jurgen Van Gael, Thore Graepel Microsoft Research Cambridge, UK.
10/6/2015 1Intelligent Systems and Soft Computing Lecture 0 What is Soft Computing.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.
The Relational Model: Relational Calculus
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Extending context models for privacy in pervasive computing environments Jadwiga Indulska The School of Information Technology and Electrical Engineering,
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Describing and Using Query Capabilities of Heterogeneous Sources Vasilis Vassalos& Yannis Papakonstantinou Presented by Srujan Kothapally.
Soft Computing Lecture 19 Part 2 Hybrid Intelligent Systems.
Lecture 7: Foundations of Query Languages Tuesday, January 23, 2001.
Inference Algorithms for Bayes Networks
Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington.
Uncertain Observation Times Shaunak Chatterjee & Stuart Russell Computer Science Division University of California, Berkeley.
1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.
1 Finite Model Theory Lecture 16 L  1  Summary and 0/1 Laws.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
Chapter 7. Propositional and Predicate Logic
A Course on Probabilistic Databases
Solving MAP Exactly by Searching on Compiled Arithmetic Circuits
Approximate Lineage for Probabilistic Databases
Queries with Difference on Probabilistic Databases
Data Integration with Dependent Sources
Lecture 16: Probabilistic Databases
Intelligent Systems and
Lecture 10: Query Complexity
Probabilistic Databases
Chapter 7. Propositional and Predicate Logic
Probabilistic Databases with MarkoViews
Presentation transcript:

1 Management of Probabilistic Data: Foundations and Challenges Nilesh Dalvi and Dan Suciu Univerisity of Washington

2 Databases Are Deterministic Applications since 1970’s required precise semantics –Accounting, inventory Database tools are deterministic –A tuple is an answer or is not Underlying theory assumes determinism –FO (First Order Logic)

3 Future of Data Management We need to cope with uncertainties ! Represent uncertainties as probabilities Extend data management tools to handle probabilistic data Major paradigm shift affecting both foundations and systems

4 Uncertainties Everywhere In the schema mappings: –Data spaces –Pay as you go data integration In the data mapping –Life science data integration –Object reconciliation, fuzzy joins In the data itself –Data “by the masses” –Information Extraction –RFID data, sensor data [Halevy’2007] [Philippi&Kohler’2006] [Gupta&Sarawagi’2006] [Welbourne’2007] [Arasu’06]

5 Example 1 Data Integration in Life Sciences U2 integrates several biological databases [B.Louie et al.2007] User types: “Gene ABCD1” U2 finds 80 “related” proteins Ranks them by uncertainty score Correct 9 functions are among top 11 User types: “Gene ABCD1” U2 finds 80 “related” proteins Ranks them by uncertainty score Correct 9 functions are among top 11 Example: find functional annotations of ABCD1 EntrezProtein, Pfam, TIGRFAM, NCBI Blast, EntrezGene Need to represent uncertainties explicitly

6 Example 2 Information Extraction ID House-NoStreetCityP 152Goregaon WestMumbai AGoregaon WestMumbai GoregaonWest Mumbai AGoregaonWest Mumbai [Gupta&Sarawagi’2006]...52 A Goregaon West Mumbai... Here probabilities are meaningful ≈20% of such extractions are correct

7 Example 3 RFID Ecosystem at UW [Welbourne’2007]

8 RFID data = noisy –SIGHTING(tagID, antennaID, time) Derived data = Probabilistic –“John entered Room 524 at 9:15” prob=0.6 –“John carried laptop x77 at 11:03” prob=0.8 –... Queries –“Which people were in Room 478 yesterday ?” Massive amounts of probabilistic data from RFIDs, sensors

9 A Model for Uncertainties Data is probabilistic Queries formulated in a standard language Answers are annotated with probabilities This talk: Probabilistic Databases

10 Probabilistic databases: Long History Cavallo&Pitarelli:1987 Barbara,Garcia-Molina, Porter:1992 Lakshmanan,Leone,Ross&Subrahmanian:1997 Fuhr&Roellke:1997 Dalvi&S:2004 Widom:2005 Focus today: the Query Evaluation Problem

11 AIDatabases Deterministic Theorem prover Query processing Probabilistic Probabilistic inference [this talk] Has this been solved by AI ? Input: KB Fix q Input: DB

12 Outline Data model Query evaluation Challenges

13 What is a Probabilistic Database (PDB) ? ObjectTime PersonP Laptop779:07 John0.62 Jim0.34 Book3029:18 Mary0.45 John0.33 Fred0.11 HasObject p What does it mean ? Keys Probability [Barbara et al.1992] Non-keys

Background Finite probability space = ( , P)  = {  1,...,  n } = set of outcomes P :   [0,1] P(  1 ) P(  n ) = 1 Event: E  , P(E) =   E P(  ) “Independent”: P(E 1 E 2 ) = P(E 1 ) P(E 2 ) “Mutual exclusive” or “disjoint”: P(E 1 E 2 ) = 0

15 Possible Worlds Semantics ObjectTime PersonP Laptop779:07 Johnp1p1 Jimp2p2 Book3029:18 Maryp3p3 Johnp4p4 Fredp5p5  ={ ObjectTime Person Laptop779:07John Book3029:18Mary ObjectTime Person Laptop779:07John Book3029:18John ObjectTime Person Laptop779:07John Book3029:18Fred ObjectTime Person Laptop779:07Jim Book3029:18Mary ObjectTime Person Laptop779:07Jim Book3029:18John ObjectTime Person Laptop779:07Jim Book3029:18Fred ObjectTime Person Laptop779:07John ObjectTime Person Laptop779:07Jim ObjectTime Person Book3029:18Mary ObjectTime Person Book3029:18John ObjectTime Person Book3029:18Fred ObjectTime Person } p1p3p1p3 p1p4p1p4 p 1 (1- p 3 -p 4 -p 5 ) Possible worlds PDB

16 Definition: A tuple-disjoint/independent table is: R( A1, A2, …, Am, B1, …, Bn, P) Definition: A tuple-independent table is: R( A1, A2, …, Am, P) Definition: Semantics is given by possible worlds Definitions

17 ObjectTime PersonP Laptop779:07 Johnp1p1 Jimp2p2 Book3029:18 Maryp3p3 Johnp4p4 Fredp5p5 HasObject( Object, Time, Person, P) Meets( Person1, Person2, Time, P) Person1Person2Time P JohnJim9.12p1p1 MarySue9:20p2p2 JohnMary9:20p3p3 Disjoint Independent Inde- pen- dent Disjoint

18 Query Semantics P(q) =   |= q P(  ) A boolean query q is an event: {  |  |= q } HasObject(‘MyBook’,x,t), EnterRoom(x,’CoffeeRoom’,t) Did someone take MyBook to the CoffeeRoom ? P(q) =0.96 (meaning: quite likely !) q= 

19 Discussion of Data Model Tuple-disjoint/independent tables: Simple model, can store in any DBMS More advanced models: Symbolic boolean expressions Trio: add lineage Probabilistic Relational Models Graphical models Fuhr and Roellke [Widom05, Das Sarma’06, Benjelloun 06] [Getoor’2006] [Sen&Desphande’07]

20 Outline Data model Query evaluation –Probability of Boolean expressions –From queries to Boolean expressions –Data complexity of query evaluation Challenges

21  = X 1 X 2 Ç X 1 X 3 Ç X 2 X 3 Pr(  )=(1-p 1 )p 2 p 3 + p 1 (1-p 2 )p 3 + p 1 p 2 (1-p 3 ) + p 1 p 2 p 3 X1X1 X2X2 X3X3 P  (1-p 1 )p 2 p p 1 (1-p 2 )p p 1 p 2 (1-p 3 )1 111p1p2p3p1p2p3 1 Probability of Boolean Expressions P(X 1 )= p 1, P(X 2 )= p 2, P(X 3 )= p 3 Compute P(  ) ==

22 Background [Karp&Luby:1983] Theorem For DNF  Approximation of Pr(  ) is in PTIME (FPTRAS) Theorem For DNF  Approximation of Pr(  ) is in PTIME (FPTRAS) Theorem Exact evaluation of Pr(  ) is #P-complete [Valiant:1979] Fix P(X 1 )= P(X 2 )=... = P(X n )= 1/2 [Graedel,Gurevitch,Hirsch:1998] Both theorems extend to rational P(X 1 ),..., P(X n )

23 Query q + Database PDB   R( x, y ), S( x, z ) PDB= AB P a1a1 b1b1 p1p1 X1X1 a2a2 b2b2 p2p2 X2X2 AC P a1a1 c1c1 q1q1 Y1Y1 a1a1 c2c2 q2q2 Y2Y2 a2a2 c3c3 q3q3 Y3Y3 a2a2 c4c4 q4q4 Y4Y4 a2a2 c5c5 q5q5 Y5Y5 X 1 Y 1 Ç X 1 Y 2 Ç X 2 Y 3 Ç X 2 Y 4 Ç X 2 Y 5 q=   = RpRp SpSp

24 Application to Query Evaluation Corollary Fix FO query q Exact evaluation of Pr( q ) on input PDB is in #P Corollary Fix a conjunctive query q. Approximation of Pr( q ) on input PDB is in PTIME (FPTRAS) [Graedel,Gurevitch,Hirsch:1998]

Background: Probabilistic Networks Inference: hard in general KR techniques exploit local properties: E.g. bounded treewidth  PTIME  = X 1 Y 1 ÇX 1 Y 2 ÇX 2 Y 3 ÇX 2 Y 4 ÇX 2 Y 5 X1X1 X2X2 Y2Y2 Y1Y1 Y3Y3 ÆÆÆÆÆ ÇÇ Ç Y4Y4 Y5Y5 p1p1 p2p2 q1q1 q2q2 q3q3 q4q4 q5q5 R( x, y ), S( x, z ) [Zabiyaka&Darwiche’0 6] Note: for this query the treewidth is unbounded

26 AB P a1a1 b1b1 p1p1 a2a2 b2b2 p2p2 AC P a1a1 c1c1 q1q1 a1a1 c2c2 q2q2 a2a2 c3c3 q3q3 a2a2 c4c4 q4q4 a2a2 c5c5 q5q5 q = AP a1a1 1-(1-q 1 )(1-q 2 ) a2a2 1-(1-q 3 )(1-q 4 )(1-q 5 ) AP a1a1 p 1 (1-(1-q 1 )(1-q 2 )) a2a2 p 2 (1-(1-q 3 )(1-q 4 )(1-q 5 )) P(q) = 1 - (1-p 1 (1-(1-q 1 )(1-q 2 ))) * (1-p 2 (1-(1-q 3 )(1-q 4 )(1-q 5 ))) The data complexity of this query is PTIME Rp(A,B)Rp(A,B) Sp(A,C)Sp(A,C) AA Join  AA [D&S’2004] R( x, y ), S( x, z ) “safe plan”

27 Theorem One of the following holds: (1)Either q is in PTIME (2)Or q is #P hard Theorem One of the following holds: (1)Either q is in PTIME (2)Or q is #P hard [D&S’2004] [Andritsos et al’2006] Dichotomy Theorem Let q be a conjunctive query without self-joins In Case (1) q can be computed by a “safe plan” and we call it a “safe query”

#P-Hard QueriesPTIME Queries R( x, y ), S( x, z ) R( x, y), S( y ), T( ‘a’, y) R( x ), S( x, y ), T( y ), U( u, y), W( ‘a’, u)... h1 = R( x ), S( x, y ), T( y ) h2 = R( x,y), S( y ) h3 = R( x,y), S(x, y ) How do we decide if a query is in PTIME or #P hard ?

29 Hierarchical Queries sg(x) = set of subgoals containing the variable x in a key position Definition A query q is hierarchical if forall x, y: sg(x)  sg(y) or sg(x)  sg(y) or sg(x)  sg(y) =  Definition A query q is hierarchical if forall x, y: sg(x)  sg(y) or sg(x)  sg(y) or sg(x)  sg(y) =  h1 = R( x ), S( x, y ), T( y ) R S xy T Non-hierarchical q = R( x, y ), S( x, z ) R S x z Hierarchical y

30 Case 1: Independent Tuples Only Fact If q is hierarchical then q is in PTIME [D&S’2004] q = R( x, y ), S( x, z ) R S x z y R p (x,y) S p (x,z)  -z Join x  -x  -y  Independent project PTIME Queries: The hierarchy gives the safe plan ! 1.Root variable u   -u 2.Connected components  Join

31 Case 1: Independent Tuples Only Fact If q is non-hierarchical then it is #P-hard. [D&S’2004] h1 = R( x ), S( x, y ), T( y ) Proof: it “contains” h1: q =... R( x,...), S( x, y,...), T( y,...)... Theorem Testing if q is PTIME or #P-hard is in AC 0 Recall: #P-hard Queries: h1 is #P-hard (reduction from Partitioned Positive 2DNF) [Provan&Ball’83]

32 Case 2: Independent/disjoint Tuples R( x ), S( x, y ), T( y ), U( u, y), W( ‘a’, u) R S x y T W R p (x)S p (x,y) Join y  -u D  -y D Join u  -x I u W p (‘a’,u) U p (u,y) Join x Independent project PTIME Queries: Disjoint project U 1.Root variable   I 2.CC’s  Join 3.Constant key attrs   D T p (y)

33 Case 2: Independent/disjoint Tuples Theorem Testing if q is PTIME or #P-hard is PTIME complete Recall: h1 = R( x ), S( x, y ), T( y ) h2 = R( x,y), S( y ) h3 = R( x,y), S(x, y ) #P-hard Queries: If the safe-plan algorithm fails on q, then q can be “rewritten” to either h1 or h2 or h3 and hence is #P-hard (see paper for details) #P-hard by reduction from PERMANENT

34 Summary on Query Evaluation We understand completely only queries w/o self-joins Lessons learned from our system MystiQ: When the query is safe: –Evaluate it exactly, in the database engine –Performance: close to regular SQL When the query is unsafe –Approximate it, compute only top-k –Performance: one or two orders of magnitude worse [Re’2007]

35 Outline Data model Query evaluation Challenges

36 Query Optimization Even a #P-hard query often has subqueries that are in PTIME. Needed: Combine safe plans + probabilistic inference “Interesting indepence/disjointness” Model a probabilistic engine as black-box CHALLENGE: Integrate a black-box probabilistic inference in a query processor. [Re’2007,Re’2007b]

37 Probabilistic Inference Algorithms Open the box ! Logical to physical Examine specific algorithms from KR: Variable elimination Junction trees Bounded treewidth [Sen&Deshpande’2007] [Bravo&Ramakrishnan’2007] CHALLENGE: (1) Study the space of optimization alternatives. (2) Estimate the cost of specific probabilistic inference algorithms.

38 Open Theory Problems Self-joins are much harder to study –Solved only for independent tuples Extend to richer query language –Unions, predicates (<, ≤, ≠), aggregates Do hardness results still hold for Pr = 1/2 ? CHALLENGE: Complete the analysis of the query complexity over probabilistic databases [D&S’2007]

39 Complex Probabilistic Model Independent and disjoint tuples are insufficient for real applications Capturing complex correlations: –Lineage –Graphical models [Getoor’06,Sen&Deshpande’07] [Das Sarma’06,Benjelloum’06] CHALLENGE: Explore the connection between complex models and views [Verma&Pearl’1990]

40 Constraints Needed to clean uncertainties in the data Hard constraints: –Semantics = conditional probability Soft constraints: –What is the semantics ? Lots of prior work, but still little understood [Shen’06, Andritsos’06, Richardson’06,Chaudhuri’07] CHALLENGE: Study the impact of hard/soft constraints on query evaluation

41 Information Leakage A view V should not leak information about a secret S Issues: Which prior P ? What is ≈ ? Probability Logic: U  V means P(V | U) ≈ 1 CHALLENGE: Define a probability logic for reasoning about information leakage [Evfimievski’03,Miklau&S’04,DMS’05] [Pearl’88, Adams’98] P(S) ≈ P(S | V)

42 Conclusions Prohibitive cost of cleaning data Represent uncertainties explicitly Need to re-examine many assumptions A call to arms: The management of probabilistic data A call to arms: The management of probabilistic data