Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington
2 Two Lectures Today: probabilistic database to model imprecisions probabilistic logics Tomorrow: probabilistic database to model incompletness random graphs
3 Motivation Record reconciliation Information extraction Constraint violations Schema matching
4 Databases 101 Tables: titleyear Twelve Monkeys1995 Monkey Love1997 Monkey Love1935 Monkey Love Planet2005 Queries: SELECT title, rating FROM Movie, Review WHERE title = name Answers: titlerating Twelve Monkeys3 Monkey Love5 Monkey Love Planet5 Movie
5 nameratingp Monkey Lovegood.5 fair.2 fair.6 poor.9 Review Queries: A(x,y) :- Review(x,y), Movie(x,z), z > 1991 Movie(x,z), z > 1991 Problem Setting Tables: titleyearp Twelve Monkeys Monkey Love Monkey Love Monkey Love Pl Answers: titleratingp Twelve Monkeysfair.53 Monkey Lovegood.42 Monkey Love Plfair.15 Movie Top k Problem: complexity of query evaluation
6 Two Problems Fix answer tuple (a,b) Given database I, compute Pr(Q(a,b)) Query evaluation problem Fixed schema S, conjunctive query Q(x,y) Fix k > 0 Given database I, find k answer tuples with highest probabilities Top-k answering problem
7 Related Work: DB Cavallo&Pitarelli:1987 Barbara,Garcia-Molina, Porter:1992 Lakshmanan,Leone,Ross&Subrahmanian:199 7 Fuhr&Roellke:1997 Dalvi&S:2004 Widom:2005
8 Related Work: Logic Query reliability [Gradel,Gurevitch,Hirsch’98] Degrees of belief [Bacchus,Grove,Halpern,Koller’96] Probabilistic Logic [Nielson] Probabilistic model checking [Kwiatkowska’02] Probabilistic Relational Model [Taskar,Abbeel,Koller’02]
9 Outline Definitions Query Evaluation Top-k answering (joint with Chris Re) Conclusions
10 Application 1: Record Linkage Title Monk 12 Monkeys Twelve Monkey Movies Data cleaning remains expensive, critical: Prob dbs: “garbage in, ranked answers out” Extensive research area Which records match ? Review Twelve Monkeys 12 Monkeys (1996) Monkey Love Planet of the Apes Reviews ? Today: “garbage in, garbage out”
11 Application 1: Fuzzy Object Matching TitleMatch titlereviewp Twelve Monkeys12 Monkeys0.7 Monkey Love Monkeys0.45 Monkey Love 1935Monkey Love0.82 Monkey Love 1935Monkey Boy0.68 Monkey Love PlanetMonkey Love0.8 Table q-gram or edit dist.
12 Application 1: Fuzzy Object Matching Intensively studied: “record linkage”, “deduplication”, “object reconciliation”, etc. Current usage: score v.s. threshold New usage: scores as probabilities
13 Application 1: Fuzzy Object Matching Queries: Find all movies rated highly by both Joe and Jim titleyearp Monkey Love Twelve Monkeys Monkey Love Monkey Love Planet Answers: top k
14 Application 2: Information Extraction Collection of unstructured documents Define tables, populate them ➔ scores Variety machine learning techniques
15 Application 2: Information Extraction Posted By Review Text Monkeys is an OK movie is one of the best movies I've seen never seen 12 Monkeys but I love Monk. Which movie is the review about ? Is this review positive or negative ?
16 Application 2: Information Extraction I've never seen 12 Monkeys but I love Monk. Review text: Extensive research area Probabilistic dbs: SQL queries, rank answers MovieActorRating 12 MonkeysMonkfair Facts Table: Avatar (IBM)documentscorporate facts Textrunner (UW) 100M Web pages 1BN facts 0.3 Today: facts can be used only in isolation
17 Application 2: Information Extraction Extensive area: From text segmentation to extraction from WWW AVATAR (IBM): corporate data from docs TextRunner (UW): 1,000,000,000 facts from WWW ATTENEX (startup): discovery for law offices Inherently imprecise ! Probabilistic dbs: keep and use scores
18 Application 3: Activity Recognition Probabilistic dbs: use scores to rank answers NameTimeActivity Suetrun | walk Suet+1walk | stand | sit [Lester05,Liao05] Equip People with sensors, classify activity Elderly health care (e.g. Alzheimer's) Has Mr. Johnson eaten in the last 12 hours?
19 Application 3: Activity Recognition subjecttimeactp Barbaratrun0.3 Barbarateat0.45 Jimt+1eat0.82 Barbarat+1eat0.45 Table
20 Other Applications Sensor data, RFID [Madden] Constraint violations [Bertossi02,Fuxman06] Schema matching [Doan03] Security/privacy [Evfimievski03, Miklau04] Bio-, medical-, clinical informatics [DataGrid (a startup)] Personal information management [Karger, Halevy]
21 Summary of Applications Large range of apps with imprecise data Specific techniques exists Imprecisions handled at the application level Goal of probabilistic databases: manage all imprecisions uniformly
22 Outline Applications Data model Queries Multisimulation Conclusions
23 Pr : Inst → [0,1], ∑ I Pr[I] = 1 Probabilistic Database Schema S, Domain D, Set of instances Inst Definition Probabilistic database is a probability distribution If Pr[I] > 0 then I is called “possible world”
24 Probabilistic Database Representation: Independent tuples: I-database DB over some schema S i Independent and disjoint tuples: ID-database DB over some schema S id Semantics: DB “means” probability distribution Pr over schema S
25 Independent Events A tuple is in the database with probability p Any two tuples are independent events
26 I-Databases MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Pr[I 1 ] + Pr[I 2 ] Pr[I 8 ] = 1 Reviews i (M,S,p) MovScor (1-p 1 )*(1-p 2 )*(1-p 3 ) Pr[I 1 ]= MovScor m42good m99good p 1 *p 2 *(1-p 3 ) Pr[I 4 ]= p 1 *p 2 *p 3 Pr[I 8 ]= Representation Possible worlds semantics Reviews(M,S) MovScor m42good m99good m76poor MovScor m99good MovScor m42good MovScor m m76poor MovScor m42good m76poor MovScor m76poor
27 Disjoint Events Needed in Many-to-1 matchings Possible values for attributes [Barbara’92] NameAgeP John John Mary NameAge John 34 (0.3) 43 (0.7) Mary 25
28 ID-Databases Time d ActivityP t walk p1p1 t run p2p2 t+1 walk p3p3 Pr[I 1 ] + Pr[I 2 ] Pr[I 6 ] = 1 Activities id TimeActTimeAct t run TimeAct t walk t+1 walk TimeAct t walk TimeAct t+1 walk (1-p 1 -p 2 )*(1-p 3 ) Pr[I 1 ]= p 2 *(1-p 3 ) Pr[I 3 ]= p 1 *p 3 Pr[I 5 ]= Activities TimeAct t run t+1 walk
29 ID subsumes I Movie d Score d P m42good p1p1 m99good p2p2 m76poor p3p3 Reviews id MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Reviews i = Note: MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Reviews id means all tuples are disjoint
30 Queries idyearP m m m m midratingp m m m m m Movie i Review i Q(y) :- Movie(x,y), Review(x,z), z >= 3 Syntax: conjunctive queries over schema S
31 Two Query Semantics Possible answer sets Given set A: Used for views Possible tuples Given tuple t: Used for query evaluation and top-k Pr[{t | I ⊨ Q(t)} = A] Pr[I ⊨ Q(t)] Thistalk
32 Possible Answer Sets ActorMatch ActoramzAct TitleMatch TitleReview p2p2 ActorMatch ActoramzAct TitleMatch TitleReview p3p3 MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) MoviewReviewMatch midridmidrid MoviewReviewMatch midrid MoviewReviewMatch P 1 + P 2 P3P3 P 4 + P 5 + P 6 ActorMatch ActoramzAct TitleMatch TitleReview p4p4 ActorMatch ActoramzAct TitleReview p1p1 TitleMatch ActorMatch ActoramzAct TitleMatch TitleReview p5p5
33 p2p2 year p1p1 idyear m m m p2p2 idyear m m p4p4 idyear m m p3p3 idyear m m m Query Semantics 1 year p4p4 p 1 + p 3 year 2004 Q(y) :- Movie(x,y), Review(x,z) Possible answer setsUseful for Views
34 ID + Views = Complete Theorem. ID-databases plus views are “complete” for representing possible worlds distributions R A a c d f g p2p2 R A b c d g p3p3 R A a c p4p4 R A a b c p1p1 R A b c d f g p5p5 RA AW aw1w1 bw1w1 cw1w1 aw2w2 cw2w2 dw2w2 f... WP w1w1 p1p1 w2w2 p2p2 w3w3 p3p3 w4w4 p4p4 w5w5 p5p5 PWD di di= ∅ R(x) :- RA(x,w), PWD di (w) PrDB
35 p1p1 idyear m m m p2p2 idyear m m p4p4 idyear m m p3p3 idyear m m m Q(y) :- Movie(x,y), Review(x,z) top k yearp 1935p 2 + p 3 = p 1 + p 3 = p 3 = Query Semantics Tuple probabilities
36 Complex Correlations MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) From atomic events to complex events Views
37 Complex Correlations ActorMatch ActoramzAct TitleMatch TitleReview p2p2 ActorMatch ActoramzAct TitleMatch TitleReview p3p3 MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) MoviewReviewMatch midridmidrid MoviewReviewMatch midrid MoviewReviewMatch P 1 + P 2 P3P3 P 4 + P 5 + P 6 ActorMatch ActoramzAct TitleMatch TitleReview p4p4 ActorMatch ActoramzAct TitleReview p1p1 TitleMatch ActorMatch ActoramzAct TitleMatch TitleReview p5p5
38 Summary on Data Model Data Model: Semantics = possible worlds Syntax = I-databases or ID-databases Queries: Syntax = unchanged (conjunctive queries) Semantics = tuple probabilities
39 Outline Definitions Query evaluation Top-k answering Conclusions
40 Problem Definition Fix schema S, query Q, answer tuple t Problem: given I/ID-database DB, compute Pr[I ⊨ Q(t)] Conventions: For upper bounds (P or #P): probabilities are rationals For lower bounds (#P): probabilities are 1/2 Pr[Q(t)] notation:
41 Query Evaluation on I-Databases Outline Intuition Extensional plans: PTIME case Hard queries: #P-complete case Dichotomy Theorem
42 Intuition Yearp p 1 × (1 - (1 - q 1 ) ×(1 - q 2 )×(1 - q 3 )) 1 - (1 - ) × (1 - ) p 2 × (1 - (1 - q 4 )×(1 - q 5 )) p 3 × q 6 idyearp m421995p1p1 m992002p2p2 m762002p3p3 m052005p4p4 midrate p m424 q1q1 2 q2q2 3 q3q3 m991 q4q4 3 q5q5 m76 5 q6q6 Movie i Review i Answer Q(y) :- Movie(x,y), Review(x,z) Review(x,z)
43 Add Join ⋈ p = p 1 * p 2 Projection ∏ p = 1-(1-p 1 )(1-p 2 )...(1-p n ) Selection σ p = p Note: data complexity is PTIME p I-Extensional Plans [Barbara92,Lakshmanan97]
44 Extensional Query Plans ⋈ xpx’qx pq xp1 xp2 xp3 ∏ x 1-(1-p1)(1-p2)(1- p3) xp σ xp
45 Extensional Query Plans Each tuple t has a probability t.P Algebra operators compute t.P Data complexity: PTIME
46 ⋈ ∏ Movie Review CORREC T INCORRECT! 1995m1pq m1pq m1pq (1-pq 1 )(1-pq 2 )(1- pq 3 ) ⋈ ∏ ∏ Movie Review m1 1 - (1-q 1 )(1-q 2 )(1- q 3 ) 1995m1 p(1-(1-q 1 )(1-q 2 )(1- q 3 )) m1q1q1 q2q2 q3q3 1995p m1q1q1 q2q2 q3q3 1995p Q(y) :- Movie(x,y), Review(x,z) Review(x,z)
47 Observation 1 The answer depends on the query plan
48 Q bad :- R i (x), S(x,y), T i (y) Ap p1p1 p2p2 p3p3 p4p4 Bp q1q1 q2q2 q3q3 q4q4 AB RiRi STiTi Theorem: Data complexity is #P-complete #P-Complete Queries
49 Proof: Ap x1x1 1/2 x2x2 x3x3 x4x4 Bp y1y1 y2y2 y3y3 AB x2x2 y3y3 x1x1 y2y2 x4x4 y3y3 x3x3 y1y1 RiRi STiTi Reduction: x 2 y 3 V x 1 y 2 V x 4 y 3 V x 3 y 1 Q bad :- R i (x), S(x,y), T i (y) Theorem [Provan&Ball83] Counting the number of satisfying assignments for bipartite DNF is #P- complete
50 Observation 2 Some queries (like Q bad ) don’t admit a correct extensional plan !
I-Dichotomy Definition 1. For each variable x: goals(x) = set of goals that contain x goals(x) = set of goals that contain x Q = boolean conjunctive query Definition 2. Q is hierarchical if forall x, y: (a) goals(x) ∩ goals(y) = ∅, or (b) goals(x) ⊆ goals(y), or (c) goals(y) ⊆ goals(x) (a) goals(x) ∩ goals(y) = ∅, or (b) goals(x) ⊆ goals(y), or (c) goals(y) ⊆ goals(x)
52 Q :- R(x),S(x,y),T(x,y,z),K(x,v) Q :- R(x), S(x,y), T(y) xy z R S T v K xy R S T “hierarchical” “non-hierarchical”
53 I-Dichotomy [Dalvi&S.’04] Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan Q is hierarchical or: Q is #P-complete Q has subgoals R(x,...),S(x,y,...),T(y,...) Schema S i = {R 1 i, R 2 i,..., R m i }
54 Proof Lemma 1. If Q is non-hierarchical, then #P-complete Proof: x y R S T z K v Q :- R i (v,x), S i (x,y), T i (y,z), K i (z) rest is like for Q bad
55 Proof Lemma 2. If Q is hierarchical, then PTIME Proof: Case 1: has no root Pr(Q) = Pr(Q 1 ) Pr(Q 2 ) Pr(Q 3 ) This is extensional join ⋈
56 Proof Case 2: has root x x Pr(Q) = 1 - (1-Pr(Q(a 1 /x))(1-Pr(Q(a 2 /x))...(1-Pr(Q(a n /x))) This is an extensional projection: ∏ Dom={a 1, a 2,..., a n } QED
57 Query Evaluation on ID-Databases ID-extensional plans #P-complete queries Dichotomoy Theorem
58 Only difference: two kinds of projections: independent 1-(1-p 1 )...(1-p n ) disjoint p p n Extensional Plans for ID-DBs
59 #P-Complete Queries Q 2 :- R d (x d,y), S d (y d,z) Q 1 :- R i (x), S i (x,y), T i (y) Q 3 :- R d (x d,y), S d (z d,y)
60 I-DB Dichotomy [Dalvi&S.’04] Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan or: Q is #P-complete Q has one of Q 1, Q 2, Q 3 as subqueries Schema S id s.t. each table is either R i or R id
61 Extensions Extensions of the dichotomoy theorem exists for: Mixed schemas (some relations are deterministic) Functional dependencies
62 Summary on Query Evaluation Extensional plans: popular, efficient, BUT “Equivalent” plans lead to different results Some queries admit “correct” plans Some simple queries: #P-complete complexity Dichotomy theorem Future work: remove ‘no-self-join’ restriction
63 Summary on Queries Extensional plans: popular in the past But not all are correct Some queries have no correct ext. plans Need extensions to the DBMS [Barbara92,Lakshmanan97] [Dalvi&S.’04]
64 Summary of Query Complexity Probabilistic databases have high complexity: #P Extensional plans: popular and efficient BUT: answer depends on the plan When ∄, query has high complexity [Barbara92,Lakshmanan97] [Dalvi&S.’04]
65 Outline Definitions Query evaluation Top-k answering (joint with Chris Re) Conclusions
66 Event Expressions Atomic events: e 1, e 2,... Probabilities: p 1, p 2,... Event expressions: e 1 ⋀ e 2 ⋁ e 1 ⋀ e 3
67 Intensional Query Plans ⋈ xp x’q x p⋀qp⋀q xp1 xp2 xp3 ∏ x p1 ⋁ p2 ⋁ p3 xp σ xp [Fuhr97]
68 Probabilities of Boolean Expressions p 1 = Pr(e 1 ) p 2 = Pr(e 2 ) p 3 = Pr(e 3 ) Compute probability p = Pr(E) ? A: E = e 1 ⋀ (e 2 ⋁ e 3 ) p = p 1 (1-(1-p 2 )(1-p 3 )) Given E= e 1 ⋀ e 2 ⋁ e 1 ⋀ e 3
69 Top-k Ranking Problem Fix schema S, query Q, number k > 0 Problem: given I- or ID-database DB, find k answers t 1,...,t k with highest probabilities Note: Checking Pr[Q(t i )] > Pr[Q(t j )] is PP-complete Goal: efficient polynomial time approximation Pr[Q(t 1 )] > Pr[Q(t 2 )] >.... > Pr[Q(t k )] >...
70 Probabilities of Boolean Expressions What is the probability of e 1 ⋀ e 2 ⋁ e 1 ⋀ e 2 ⋁ e 1 ⋀ e 3 ? (1-p 1 )p 2 p 3 + p 1 (1-p 2 )p 3 + p 1 p 2 (1-p 3 ) + p 1 p 2 p 3 Theorem #P-hard [Valiant] Ap e1e1 p1p1 e2e2 p2p2 e3e3 p3p3
71 Monte Carlo Simulation Better: PTAS Pr( |p’-p| 1-δ [Karp&Luby’83] Algorithm: radomly pick each e 1, e 2, e 3 = false or true radomly pick each e 1, e 2, e 3 = false or true compute e 1 ∧ e 2 ∨ e 1 ∧ e 3 ∨ e 2 ∧ e 3 : true or false ? repeat compute e 1 ∧ e 2 ∨ e 1 ∧ e 3 ∨ e 2 ∧ e 3 : true or false ? repeat Approximate probability p with frequency p’ p’ p’- ε p’+ ε p
72 Monte Carlo Simulation N=0 01 p N=1 N=2 N=3
73 The Multisimulation Problem YearP 1995?? 2002?? 1933?? 1984?? Schedule simulation steps to find top-k 01
74 Multisimulation How to find the top k out of n ? Example: looking for top k=2; Which one simulate next ? p5p5 p1p1 p4p4 p2p2 p3p3
75 Multisimulation Critical region: (k’th left, k+1’th right) 01 k=2
76 Multisimulation Algorithm Case 1: pick a “double crosser” and simulate it 01 this k=2
77 Multisimulation Algorithm Case 2: pick both a “left” AND a “right” crosser k=2 01 this and this
78 Multisimulation Algorithm Case 3: pick a “max crosser” and simulate it 01 this k=2
79 Multisimulation Algorithm End: when critical region is “empty” 01 k=2 To sort the top k, find the top k-1, etc
80 Multisimulation Algorithm Theorem (1) It runs in < 2 Optimal # steps (1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm does better (2) no other deterministic algorithm does better
81 Experiments IMDB+AMZN: about 10M tuples 60% probabilistic k=10n = simulate all multisim plus optimization engine time
82 Experiments
83 Summary on Top-k Answering Simple algorithm, optimal (x2) w.r.t. a very powerful standard Marriage of probabilistic and top-k answers make probabilistic databases practical
84 Experiments
85 Outline Definitions Query evaluation Top-k answering Conclusions
86 Related Work Probabilistic databases: Cavallo87, Barbara92, Laskhmanan97, Fuhr97, Dalvi04, Widom05 Extensional/intensional plans: Fuhr97 Probabilities for degrees of belief Fagin90, Bacchus96 Simulation of boolean functions: Karp&Luby Complexity of boolean function probability Valiant79
87 Conclusions Strong motivation from practical applications Opportunity to merge query and search technologies Probabilistic DB’s are hard ! Great opportunity for impactful theory work Tomorrow: applications of random graphs to model incompleteness in databases
88 Research at UW finish the complexity dichotomy aggregate queries constraints incomplete databases (random graphs)
Thank you ! Questions ?