Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington.

Slides:



Advertisements
Similar presentations
University of Washington Database Group The Complexity of Causality and Responsibility for Query Answers and non-Answers Alexandra Meliou, Wolfgang Gatterbauer,
Advertisements

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
SECTION 21.5 Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.
Efficient Query Evaluation on Probabilistic Databases
E FFICIENT T OP - K Q UERY E VALUATION ON P ROBABILISTIC D ATA P APER B Y C HRISTOPHER R´ E N ILESH D ALVI D AN S UCIU Presented By Chandrashekar Vijayarenu.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
CS151 Complexity Theory Lecture 7 April 20, 2004.
1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.
1 9. Evaluation of Queries Query evaluation – Quantifier Elimination and Satisfiability Example: Logical Level: r   y 1,…y n  r’ Constraint.
Derandomization: New Results and Applications Emanuele Viola Harvard University March 2006.
1 Polynomial Church-Turing thesis A decision problem can be solved in polynomial time by using a reasonable sequential model of computation if and only.
1 Management of Probabilistic Data: Foundations and Challenges Nilesh Dalvi and Dan Suciu Univerisity of Washington.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Job Scheduling Lecture 19: March 19. Job Scheduling: Unrelated Multiple Machines There are n jobs, each job has: a processing time p(i,j) (the time to.
MystiQ The HusQies* *Nilesh Dalvi, Brian Harris, Chris Re, Dan Suciu University of Washington.
On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008.
1 Probabilistic/Uncertain Data Management -- IV 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’ Sen, Deshpande. “Representing.
Propositional Logic Reasoning correctly computationally Chapter 7 or 8.
CS848: Topics in Databases: Foundations of Query Optimization Topics Covered  Databases  QL  Query containment  More on QL.
A D ICHOTOMY ON T HE C OMPLEXITY OF C ONSISTENT Q UERY A NSWERING FOR A TOMS W ITH S IMPLE K EYS Paris Koutris Dan Suciu University of Washington.
1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.
Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart BDA 2011.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
Querying Structured Text in an XML Database By Xuemei Luo.
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
Structured Querying of Web Text A Technical Challenge Kulsawasd Jitkajornwanich University of Texas at Arlington CSE6339 Web Mining.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
1 Bisimulations as a Technique for State Space Reductions.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
1 Relational Algebra and Calculas Chapter 4, Part A.
Webdamlog and Contradictions Daniel Deutch Tel Aviv University Joint work with Serge Abiteboul, Meghyn Bienvenu, Victor Vianu.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Lecture 7: Foundations of Query Languages Tuesday, January 23, 2001.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
1 Finite Model Theory Lecture 16 L  1  Summary and 0/1 Laws.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
A Course on Probabilistic Databases
Approximate Lineage for Probabilistic Databases
Queries with Difference on Probabilistic Databases
Data Integration with Dependent Sources
Lecture 16: Probabilistic Databases
Rank Aggregation.
Probabilistic Databases
Probabilistic Databases with MarkoViews
Presentation transcript:

Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington

2 Two Lectures Today: probabilistic database to model imprecisions probabilistic logics Tomorrow: probabilistic database to model incompletness random graphs

3 Motivation Record reconciliation Information extraction Constraint violations Schema matching

4 Databases 101 Tables: titleyear Twelve Monkeys1995 Monkey Love1997 Monkey Love1935 Monkey Love Planet2005 Queries: SELECT title, rating FROM Movie, Review WHERE title = name Answers: titlerating Twelve Monkeys3 Monkey Love5 Monkey Love Planet5 Movie

5 nameratingp Monkey Lovegood.5 fair.2 fair.6 poor.9 Review Queries: A(x,y) :- Review(x,y), Movie(x,z), z > 1991 Movie(x,z), z > 1991 Problem Setting Tables: titleyearp Twelve Monkeys Monkey Love Monkey Love Monkey Love Pl Answers: titleratingp Twelve Monkeysfair.53 Monkey Lovegood.42 Monkey Love Plfair.15 Movie Top k Problem: complexity of query evaluation

6 Two Problems Fix answer tuple (a,b) Given database I, compute Pr(Q(a,b)) Query evaluation problem Fixed schema S, conjunctive query Q(x,y) Fix k > 0 Given database I, find k answer tuples with highest probabilities Top-k answering problem

7 Related Work: DB Cavallo&Pitarelli:1987 Barbara,Garcia-Molina, Porter:1992 Lakshmanan,Leone,Ross&Subrahmanian:199 7 Fuhr&Roellke:1997 Dalvi&S:2004 Widom:2005

8 Related Work: Logic Query reliability [Gradel,Gurevitch,Hirsch’98] Degrees of belief [Bacchus,Grove,Halpern,Koller’96] Probabilistic Logic [Nielson] Probabilistic model checking [Kwiatkowska’02] Probabilistic Relational Model [Taskar,Abbeel,Koller’02]

9 Outline Definitions Query Evaluation Top-k answering (joint with Chris Re) Conclusions

10 Application 1: Record Linkage Title Monk 12 Monkeys Twelve Monkey Movies Data cleaning remains expensive, critical: Prob dbs: “garbage in, ranked answers out” Extensive research area Which records match ? Review Twelve Monkeys 12 Monkeys (1996) Monkey Love Planet of the Apes Reviews ? Today: “garbage in, garbage out”

11 Application 1: Fuzzy Object Matching TitleMatch titlereviewp Twelve Monkeys12 Monkeys0.7 Monkey Love Monkeys0.45 Monkey Love 1935Monkey Love0.82 Monkey Love 1935Monkey Boy0.68 Monkey Love PlanetMonkey Love0.8 Table q-gram or edit dist.

12 Application 1: Fuzzy Object Matching Intensively studied: “record linkage”, “deduplication”, “object reconciliation”, etc. Current usage: score v.s. threshold New usage: scores as probabilities

13 Application 1: Fuzzy Object Matching Queries: Find all movies rated highly by both Joe and Jim titleyearp Monkey Love Twelve Monkeys Monkey Love Monkey Love Planet Answers: top k

14 Application 2: Information Extraction Collection of unstructured documents Define tables, populate them ➔ scores Variety machine learning techniques

15 Application 2: Information Extraction Posted By Review Text Monkeys is an OK movie is one of the best movies I've seen never seen 12 Monkeys but I love Monk. Which movie is the review about ? Is this review positive or negative ?

16 Application 2: Information Extraction I've never seen 12 Monkeys but I love Monk. Review text: Extensive research area Probabilistic dbs: SQL queries, rank answers MovieActorRating 12 MonkeysMonkfair Facts Table: Avatar (IBM)documentscorporate facts Textrunner (UW) 100M Web pages 1BN facts 0.3 Today: facts can be used only in isolation

17 Application 2: Information Extraction Extensive area: From text segmentation to extraction from WWW AVATAR (IBM): corporate data from docs TextRunner (UW): 1,000,000,000 facts from WWW ATTENEX (startup): discovery for law offices Inherently imprecise ! Probabilistic dbs: keep and use scores

18 Application 3: Activity Recognition Probabilistic dbs: use scores to rank answers NameTimeActivity Suetrun | walk Suet+1walk | stand | sit [Lester05,Liao05] Equip People with sensors, classify activity Elderly health care (e.g. Alzheimer's) Has Mr. Johnson eaten in the last 12 hours?

19 Application 3: Activity Recognition subjecttimeactp Barbaratrun0.3 Barbarateat0.45 Jimt+1eat0.82 Barbarat+1eat0.45 Table

20 Other Applications Sensor data, RFID [Madden] Constraint violations [Bertossi02,Fuxman06] Schema matching [Doan03] Security/privacy [Evfimievski03, Miklau04] Bio-, medical-, clinical informatics [DataGrid (a startup)] Personal information management [Karger, Halevy]

21 Summary of Applications Large range of apps with imprecise data Specific techniques exists Imprecisions handled at the application level Goal of probabilistic databases: manage all imprecisions uniformly

22 Outline Applications Data model Queries Multisimulation Conclusions

23 Pr : Inst → [0,1], ∑ I Pr[I] = 1 Probabilistic Database Schema S, Domain D, Set of instances Inst Definition Probabilistic database is a probability distribution If Pr[I] > 0 then I is called “possible world”

24 Probabilistic Database Representation: Independent tuples: I-database DB over some schema S i Independent and disjoint tuples: ID-database DB over some schema S id Semantics: DB “means” probability distribution Pr over schema S

25 Independent Events A tuple is in the database with probability p Any two tuples are independent events

26 I-Databases MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Pr[I 1 ] + Pr[I 2 ] Pr[I 8 ] = 1 Reviews i (M,S,p) MovScor (1-p 1 )*(1-p 2 )*(1-p 3 ) Pr[I 1 ]= MovScor m42good m99good p 1 *p 2 *(1-p 3 ) Pr[I 4 ]= p 1 *p 2 *p 3 Pr[I 8 ]= Representation Possible worlds semantics Reviews(M,S) MovScor m42good m99good m76poor MovScor m99good MovScor m42good MovScor m m76poor MovScor m42good m76poor MovScor m76poor

27 Disjoint Events Needed in Many-to-1 matchings Possible values for attributes [Barbara’92] NameAgeP John John Mary NameAge John 34 (0.3) 43 (0.7) Mary 25

28 ID-Databases Time d ActivityP t walk p1p1 t run p2p2 t+1 walk p3p3 Pr[I 1 ] + Pr[I 2 ] Pr[I 6 ] = 1 Activities id TimeActTimeAct t run TimeAct t walk t+1 walk TimeAct t walk TimeAct t+1 walk (1-p 1 -p 2 )*(1-p 3 ) Pr[I 1 ]= p 2 *(1-p 3 ) Pr[I 3 ]= p 1 *p 3 Pr[I 5 ]= Activities TimeAct t run t+1 walk

29 ID subsumes I Movie d Score d P m42good p1p1 m99good p2p2 m76poor p3p3 Reviews id MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Reviews i = Note: MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Reviews id means all tuples are disjoint

30 Queries idyearP m m m m midratingp m m m m m Movie i Review i Q(y) :- Movie(x,y), Review(x,z), z >= 3 Syntax: conjunctive queries over schema S

31 Two Query Semantics Possible answer sets Given set A: Used for views Possible tuples Given tuple t: Used for query evaluation and top-k Pr[{t | I ⊨ Q(t)} = A] Pr[I ⊨ Q(t)] Thistalk

32 Possible Answer Sets ActorMatch ActoramzAct TitleMatch TitleReview p2p2 ActorMatch ActoramzAct TitleMatch TitleReview p3p3 MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) MoviewReviewMatch midridmidrid MoviewReviewMatch midrid MoviewReviewMatch P 1 + P 2 P3P3 P 4 + P 5 + P 6 ActorMatch ActoramzAct TitleMatch TitleReview p4p4 ActorMatch ActoramzAct TitleReview p1p1 TitleMatch ActorMatch ActoramzAct TitleMatch TitleReview p5p5

33 p2p2 year p1p1 idyear m m m p2p2 idyear m m p4p4 idyear m m p3p3 idyear m m m Query Semantics 1 year p4p4 p 1 + p 3 year 2004 Q(y) :- Movie(x,y), Review(x,z) Possible answer setsUseful for Views

34 ID + Views = Complete Theorem. ID-databases plus views are “complete” for representing possible worlds distributions R A a c d f g p2p2 R A b c d g p3p3 R A a c p4p4 R A a b c p1p1 R A b c d f g p5p5 RA AW aw1w1 bw1w1 cw1w1 aw2w2 cw2w2 dw2w2 f... WP w1w1 p1p1 w2w2 p2p2 w3w3 p3p3 w4w4 p4p4 w5w5 p5p5 PWD di di= ∅ R(x) :- RA(x,w), PWD di (w) PrDB

35 p1p1 idyear m m m p2p2 idyear m m p4p4 idyear m m p3p3 idyear m m m Q(y) :- Movie(x,y), Review(x,z) top k yearp 1935p 2 + p 3 = p 1 + p 3 = p 3 = Query Semantics Tuple probabilities

36 Complex Correlations MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) From atomic events to complex events Views

37 Complex Correlations ActorMatch ActoramzAct TitleMatch TitleReview p2p2 ActorMatch ActoramzAct TitleMatch TitleReview p3p3 MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) MoviewReviewMatch midridmidrid MoviewReviewMatch midrid MoviewReviewMatch P 1 + P 2 P3P3 P 4 + P 5 + P 6 ActorMatch ActoramzAct TitleMatch TitleReview p4p4 ActorMatch ActoramzAct TitleReview p1p1 TitleMatch ActorMatch ActoramzAct TitleMatch TitleReview p5p5

38 Summary on Data Model Data Model: Semantics = possible worlds Syntax = I-databases or ID-databases Queries: Syntax = unchanged (conjunctive queries) Semantics = tuple probabilities

39 Outline Definitions Query evaluation Top-k answering Conclusions

40 Problem Definition Fix schema S, query Q, answer tuple t Problem: given I/ID-database DB, compute Pr[I ⊨ Q(t)] Conventions: For upper bounds (P or #P): probabilities are rationals For lower bounds (#P): probabilities are 1/2 Pr[Q(t)] notation:

41 Query Evaluation on I-Databases Outline Intuition Extensional plans: PTIME case Hard queries: #P-complete case Dichotomy Theorem

42 Intuition Yearp p 1 × (1 - (1 - q 1 ) ×(1 - q 2 )×(1 - q 3 )) 1 - (1 - ) × (1 - ) p 2 × (1 - (1 - q 4 )×(1 - q 5 )) p 3 × q 6 idyearp m421995p1p1 m992002p2p2 m762002p3p3 m052005p4p4 midrate p m424 q1q1 2 q2q2 3 q3q3 m991 q4q4 3 q5q5 m76 5 q6q6 Movie i Review i Answer Q(y) :- Movie(x,y), Review(x,z) Review(x,z)

43 Add Join ⋈ p = p 1 * p 2 Projection ∏ p = 1-(1-p 1 )(1-p 2 )...(1-p n ) Selection σ p = p Note: data complexity is PTIME p I-Extensional Plans [Barbara92,Lakshmanan97]

44 Extensional Query Plans ⋈ xpx’qx pq xp1 xp2 xp3 ∏ x 1-(1-p1)(1-p2)(1- p3) xp σ xp

45 Extensional Query Plans Each tuple t has a probability t.P Algebra operators compute t.P Data complexity: PTIME

46 ⋈ ∏ Movie Review CORREC T INCORRECT! 1995m1pq m1pq m1pq (1-pq 1 )(1-pq 2 )(1- pq 3 ) ⋈ ∏ ∏ Movie Review m1 1 - (1-q 1 )(1-q 2 )(1- q 3 ) 1995m1 p(1-(1-q 1 )(1-q 2 )(1- q 3 )) m1q1q1 q2q2 q3q3 1995p m1q1q1 q2q2 q3q3 1995p Q(y) :- Movie(x,y), Review(x,z) Review(x,z)

47 Observation 1 The answer depends on the query plan

48 Q bad :- R i (x), S(x,y), T i (y) Ap p1p1 p2p2 p3p3 p4p4 Bp q1q1 q2q2 q3q3 q4q4 AB RiRi STiTi Theorem: Data complexity is #P-complete #P-Complete Queries

49 Proof: Ap x1x1 1/2 x2x2 x3x3 x4x4 Bp y1y1 y2y2 y3y3 AB x2x2 y3y3 x1x1 y2y2 x4x4 y3y3 x3x3 y1y1 RiRi STiTi Reduction: x 2 y 3 V x 1 y 2 V x 4 y 3 V x 3 y 1 Q bad :- R i (x), S(x,y), T i (y) Theorem [Provan&Ball83] Counting the number of satisfying assignments for bipartite DNF is #P- complete

50 Observation 2 Some queries (like Q bad ) don’t admit a correct extensional plan !

I-Dichotomy Definition 1. For each variable x: goals(x) = set of goals that contain x goals(x) = set of goals that contain x Q = boolean conjunctive query Definition 2. Q is hierarchical if forall x, y: (a) goals(x) ∩ goals(y) = ∅, or (b) goals(x) ⊆ goals(y), or (c) goals(y) ⊆ goals(x) (a) goals(x) ∩ goals(y) = ∅, or (b) goals(x) ⊆ goals(y), or (c) goals(y) ⊆ goals(x)

52 Q :- R(x),S(x,y),T(x,y,z),K(x,v) Q :- R(x), S(x,y), T(y) xy z R S T v K xy R S T “hierarchical” “non-hierarchical”

53 I-Dichotomy [Dalvi&S.’04] Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan Q is hierarchical or: Q is #P-complete Q has subgoals R(x,...),S(x,y,...),T(y,...) Schema S i = {R 1 i, R 2 i,..., R m i }

54 Proof Lemma 1. If Q is non-hierarchical, then #P-complete Proof: x y R S T z K v Q :- R i (v,x), S i (x,y), T i (y,z), K i (z) rest is like for Q bad

55 Proof Lemma 2. If Q is hierarchical, then PTIME Proof: Case 1: has no root Pr(Q) = Pr(Q 1 ) Pr(Q 2 ) Pr(Q 3 ) This is extensional join ⋈

56 Proof Case 2: has root x x Pr(Q) = 1 - (1-Pr(Q(a 1 /x))(1-Pr(Q(a 2 /x))...(1-Pr(Q(a n /x))) This is an extensional projection: ∏ Dom={a 1, a 2,..., a n } QED

57 Query Evaluation on ID-Databases ID-extensional plans #P-complete queries Dichotomoy Theorem

58 Only difference: two kinds of projections: independent 1-(1-p 1 )...(1-p n ) disjoint p p n Extensional Plans for ID-DBs

59 #P-Complete Queries Q 2 :- R d (x d,y), S d (y d,z) Q 1 :- R i (x), S i (x,y), T i (y) Q 3 :- R d (x d,y), S d (z d,y)

60 I-DB Dichotomy [Dalvi&S.’04] Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan or: Q is #P-complete Q has one of Q 1, Q 2, Q 3 as subqueries Schema S id s.t. each table is either R i or R id

61 Extensions Extensions of the dichotomoy theorem exists for: Mixed schemas (some relations are deterministic) Functional dependencies

62 Summary on Query Evaluation Extensional plans: popular, efficient, BUT “Equivalent” plans lead to different results Some queries admit “correct” plans Some simple queries: #P-complete complexity Dichotomy theorem Future work: remove ‘no-self-join’ restriction

63 Summary on Queries Extensional plans: popular in the past But not all are correct Some queries have no correct ext. plans Need extensions to the DBMS [Barbara92,Lakshmanan97] [Dalvi&S.’04]

64 Summary of Query Complexity Probabilistic databases have high complexity: #P Extensional plans: popular and efficient BUT: answer depends on the plan When ∄, query has high complexity [Barbara92,Lakshmanan97] [Dalvi&S.’04]

65 Outline Definitions Query evaluation Top-k answering (joint with Chris Re) Conclusions

66 Event Expressions Atomic events: e 1, e 2,... Probabilities: p 1, p 2,... Event expressions: e 1 ⋀ e 2 ⋁ e 1 ⋀ e 3

67 Intensional Query Plans ⋈ xp x’q x p⋀qp⋀q xp1 xp2 xp3 ∏ x p1 ⋁ p2 ⋁ p3 xp σ xp [Fuhr97]

68 Probabilities of Boolean Expressions p 1 = Pr(e 1 ) p 2 = Pr(e 2 ) p 3 = Pr(e 3 ) Compute probability p = Pr(E) ? A: E = e 1 ⋀ (e 2 ⋁ e 3 ) p = p 1 (1-(1-p 2 )(1-p 3 )) Given E= e 1 ⋀ e 2 ⋁ e 1 ⋀ e 3

69 Top-k Ranking Problem Fix schema S, query Q, number k > 0 Problem: given I- or ID-database DB, find k answers t 1,...,t k with highest probabilities Note: Checking Pr[Q(t i )] > Pr[Q(t j )] is PP-complete Goal: efficient polynomial time approximation Pr[Q(t 1 )] > Pr[Q(t 2 )] >.... > Pr[Q(t k )] >...

70 Probabilities of Boolean Expressions What is the probability of e 1 ⋀ e 2 ⋁ e 1 ⋀ e 2 ⋁ e 1 ⋀ e 3 ? (1-p 1 )p 2 p 3 + p 1 (1-p 2 )p 3 + p 1 p 2 (1-p 3 ) + p 1 p 2 p 3 Theorem #P-hard [Valiant] Ap e1e1 p1p1 e2e2 p2p2 e3e3 p3p3

71 Monte Carlo Simulation Better: PTAS Pr( |p’-p| 1-δ [Karp&Luby’83] Algorithm: radomly pick each e 1, e 2, e 3 = false or true radomly pick each e 1, e 2, e 3 = false or true compute e 1 ∧ e 2 ∨ e 1 ∧ e 3 ∨ e 2 ∧ e 3 : true or false ? repeat compute e 1 ∧ e 2 ∨ e 1 ∧ e 3 ∨ e 2 ∧ e 3 : true or false ? repeat Approximate probability p with frequency p’ p’ p’- ε p’+ ε p

72 Monte Carlo Simulation N=0 01 p N=1 N=2 N=3

73 The Multisimulation Problem YearP 1995?? 2002?? 1933?? 1984?? Schedule simulation steps to find top-k 01

74 Multisimulation How to find the top k out of n ? Example: looking for top k=2; Which one simulate next ? p5p5 p1p1 p4p4 p2p2 p3p3

75 Multisimulation Critical region: (k’th left, k+1’th right) 01 k=2

76 Multisimulation Algorithm Case 1: pick a “double crosser” and simulate it 01 this k=2

77 Multisimulation Algorithm Case 2: pick both a “left” AND a “right” crosser k=2 01 this and this

78 Multisimulation Algorithm Case 3: pick a “max crosser” and simulate it 01 this k=2

79 Multisimulation Algorithm End: when critical region is “empty” 01 k=2 To sort the top k, find the top k-1, etc

80 Multisimulation Algorithm Theorem (1) It runs in < 2 Optimal # steps (1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm does better (2) no other deterministic algorithm does better

81 Experiments IMDB+AMZN: about 10M tuples 60% probabilistic k=10n = simulate all multisim plus optimization engine time

82 Experiments

83 Summary on Top-k Answering Simple algorithm, optimal (x2) w.r.t. a very powerful standard Marriage of probabilistic and top-k answers make probabilistic databases practical

84 Experiments

85 Outline Definitions Query evaluation Top-k answering Conclusions

86 Related Work Probabilistic databases: Cavallo87, Barbara92, Laskhmanan97, Fuhr97, Dalvi04, Widom05 Extensional/intensional plans: Fuhr97 Probabilities for degrees of belief Fagin90, Bacchus96 Simulation of boolean functions: Karp&Luby Complexity of boolean function probability Valiant79

87 Conclusions Strong motivation from practical applications Opportunity to merge query and search technologies Probabilistic DB’s are hard ! Great opportunity for impactful theory work Tomorrow: applications of random graphs to model incompleteness in databases

88 Research at UW finish the complexity dichotomy aggregate queries constraints incomplete databases (random graphs)

Thank you ! Questions ?