Download presentation
Presentation is loading. Please wait.
Published byGerard Charles Modified over 9 years ago
1
Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington
2
2 Two Lectures Today: probabilistic database to model imprecisions probabilistic logics Tomorrow: probabilistic database to model incompletness random graphs
3
3 Motivation Record reconciliation Information extraction Constraint violations Schema matching
4
4 Databases 101 Tables: titleyear Twelve Monkeys1995 Monkey Love1997 Monkey Love1935 Monkey Love Planet2005 Queries: SELECT title, rating FROM Movie, Review WHERE title = name Answers: titlerating Twelve Monkeys3 Monkey Love5 Monkey Love Planet5 Movie
5
5 nameratingp Monkey Lovegood.5 fair.2 fair.6 poor.9 Review Queries: A(x,y) :- Review(x,y), Movie(x,z), z > 1991 Movie(x,z), z > 1991 Problem Setting Tables: titleyearp Twelve Monkeys1995.8 Monkey Love1997.4 Monkey Love1935.9 Monkey Love Pl2005.7 Answers: titleratingp Twelve Monkeysfair.53 Monkey Lovegood.42 Monkey Love Plfair.15 Movie Top k Problem: complexity of query evaluation
6
6 Two Problems Fix answer tuple (a,b) Given database I, compute Pr(Q(a,b)) Query evaluation problem Fixed schema S, conjunctive query Q(x,y) Fix k > 0 Given database I, find k answer tuples with highest probabilities Top-k answering problem
7
7 Related Work: DB Cavallo&Pitarelli:1987 Barbara,Garcia-Molina, Porter:1992 Lakshmanan,Leone,Ross&Subrahmanian:199 7 Fuhr&Roellke:1997 Dalvi&S:2004 Widom:2005
8
8 Related Work: Logic Query reliability [Gradel,Gurevitch,Hirsch’98] Degrees of belief [Bacchus,Grove,Halpern,Koller’96] Probabilistic Logic [Nielson] Probabilistic model checking [Kwiatkowska’02] Probabilistic Relational Model [Taskar,Abbeel,Koller’02]
9
9 Outline Definitions Query Evaluation Top-k answering (joint with Chris Re) Conclusions
10
10 Application 1: Record Linkage Title Monk 12 Monkeys Twelve Monkey Movies Data cleaning remains expensive, critical: Prob dbs: “garbage in, ranked answers out” Extensive research area Which records match ? Review Twelve Monkeys 12 Monkeys (1996) Monkey Love Planet of the Apes Reviews 0.3 0.1 0.9 ? Today: “garbage in, garbage out” 0.3 0.1 0.9
11
11 Application 1: Fuzzy Object Matching TitleMatch titlereviewp Twelve Monkeys12 Monkeys0.7 Monkey Love 199712 Monkeys0.45 Monkey Love 1935Monkey Love0.82 Monkey Love 1935Monkey Boy0.68 Monkey Love PlanetMonkey Love0.8 Table q-gram or edit dist.
12
12 Application 1: Fuzzy Object Matching Intensively studied: “record linkage”, “deduplication”, “object reconciliation”, etc. Current usage: score v.s. threshold New usage: scores as probabilities
13
13 Application 1: Fuzzy Object Matching Queries: Find all movies rated highly by both Joe and Jim titleyearp Monkey Love19970.73 Twelve Monkeys19950.68 Monkey Love19350.43 Monkey Love Planet20050.12 Answers: top k
14
14 Application 2: Information Extraction Collection of unstructured documents Define tables, populate them ➔ scores Variety machine learning techniques
15
15 Application 2: Information Extraction Posted By Review Text john@.12 Monkeys is an OK movie joe@.Monkeys is one of the best movies I've seen mjs42@.I've never seen 12 Monkeys but I love Monk. Which movie is the review about ? Is this review positive or negative ?
16
16 Application 2: Information Extraction I've never seen 12 Monkeys but I love Monk. Review text: Extensive research area Probabilistic dbs: SQL queries, rank answers MovieActorRating 12 MonkeysMonkfair Facts Table: Avatar (IBM)documentscorporate facts Textrunner (UW) 100M Web pages 1BN facts 0.3 Today: facts can be used only in isolation
17
17 Application 2: Information Extraction Extensive area: From text segmentation to extraction from WWW AVATAR (IBM): corporate data from docs TextRunner (UW): 1,000,000,000 facts from WWW ATTENEX (startup): discovery for law offices Inherently imprecise ! Probabilistic dbs: keep and use scores
18
18 Application 3: Activity Recognition Probabilistic dbs: use scores to rank answers NameTimeActivity Suetrun | walk Suet+1walk | stand | sit [Lester05,Liao05] Equip People with sensors, classify activity Elderly health care (e.g. Alzheimer's) Has Mr. Johnson eaten in the last 12 hours? 0.30.6 0.20.4
19
19 Application 3: Activity Recognition subjecttimeactp Barbaratrun0.3 Barbarateat0.45 Jimt+1eat0.82 Barbarat+1eat0.45 Table
20
20 Other Applications Sensor data, RFID [Madden] Constraint violations [Bertossi02,Fuxman06] Schema matching [Doan03] Security/privacy [Evfimievski03, Miklau04] Bio-, medical-, clinical informatics [DataGrid (a startup)] Personal information management [Karger, Halevy]
21
21 Summary of Applications Large range of apps with imprecise data Specific techniques exists Imprecisions handled at the application level Goal of probabilistic databases: manage all imprecisions uniformly
22
22 Outline Applications Data model Queries Multisimulation Conclusions
23
23 Pr : Inst → [0,1], ∑ I Pr[I] = 1 Probabilistic Database Schema S, Domain D, Set of instances Inst Definition Probabilistic database is a probability distribution If Pr[I] > 0 then I is called “possible world”
24
24 Probabilistic Database Representation: Independent tuples: I-database DB over some schema S i Independent and disjoint tuples: ID-database DB over some schema S id Semantics: DB “means” probability distribution Pr over schema S
25
25 Independent Events A tuple is in the database with probability p Any two tuples are independent events
26
26 I-Databases MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Pr[I 1 ] + Pr[I 2 ] +... + Pr[I 8 ] = 1 Reviews i (M,S,p) MovScor (1-p 1 )*(1-p 2 )*(1-p 3 ) Pr[I 1 ]= MovScor m42good m99good p 1 *p 2 *(1-p 3 ) Pr[I 4 ]= p 1 *p 2 *p 3 Pr[I 8 ]= Representation Possible worlds semantics Reviews(M,S) MovScor m42good m99good m76poor MovScor m99good MovScor m42good MovScor m421995 m76poor MovScor m42good m76poor MovScor m76poor
27
27 Disjoint Events Needed in Many-to-1 matchings Possible values for attributes [Barbara’92] NameAgeP John 340.3 John 430.7 Mary 251.0 NameAge John 34 (0.3) 43 (0.7) Mary 25
28
28 ID-Databases Time d ActivityP t walk p1p1 t run p2p2 t+1 walk p3p3 Pr[I 1 ] + Pr[I 2 ] +... + Pr[I 6 ] = 1 Activities id TimeActTimeAct t run TimeAct t walk t+1 walk TimeAct t walk TimeAct t+1 walk (1-p 1 -p 2 )*(1-p 3 ) Pr[I 1 ]= p 2 *(1-p 3 ) Pr[I 3 ]= p 1 *p 3 Pr[I 5 ]= Activities TimeAct t run t+1 walk
29
29 ID subsumes I Movie d Score d P m42good p1p1 m99good p2p2 m76poor p3p3 Reviews id MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Reviews i = Note: MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Reviews id means all tuples are disjoint
30
30 Queries idyearP m4219950.95 m9920020.65 m7620020.1 m0520050.7 midratingp m4240.7 m4250.45 m9950.82 m9940.68 m0550.79 Movie i Review i Q(y) :- Movie(x,y), Review(x,z), z >= 3 Syntax: conjunctive queries over schema S
31
31 Two Query Semantics Possible answer sets Given set A: Used for views Possible tuples Given tuple t: Used for query evaluation and top-k Pr[{t | I ⊨ Q(t)} = A] Pr[I ⊨ Q(t)] Thistalk
32
32 Possible Answer Sets ActorMatch ActoramzAct TitleMatch TitleReview p2p2 ActorMatch ActoramzAct TitleMatch TitleReview p3p3 MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) MoviewReviewMatch midridmidrid MoviewReviewMatch midrid MoviewReviewMatch P 1 + P 2 P3P3 P 4 + P 5 + P 6 ActorMatch ActoramzAct TitleMatch TitleReview p4p4 ActorMatch ActoramzAct TitleReview p1p1 TitleMatch ActorMatch ActoramzAct TitleMatch TitleReview p5p5
33
33 p2p2 year 2004 1901 1902 p1p1 idyear m422004 m991901 m761902 p2p2 idyear m991935 m051903 p4p4 idyear m871934 m441904 p3p3 idyear m761995 m991935 m052004 Query Semantics 1 year 1934 1904 p4p4 p 1 + p 3 year 2004 Q(y) :- Movie(x,y), Review(x,z) Possible answer setsUseful for Views
34
34 ID + Views = Complete Theorem. ID-databases plus views are “complete” for representing possible worlds distributions R A a c d f g p2p2 R A b c d g p3p3 R A a c p4p4 R A a b c p1p1 R A b c d f g p5p5 RA AW aw1w1 bw1w1 cw1w1 aw2w2 cw2w2 dw2w2 f... WP w1w1 p1p1 w2w2 p2p2 w3w3 p3p3 w4w4 p4p4 w5w5 p5p5 PWD di di= ∅ R(x) :- RA(x,w), PWD di (w) PrDB
35
35 p1p1 idyear m422004 m991901 m761902 p2p2 idyear m991935 m051903 p4p4 idyear m871934 m441904 p3p3 idyear m761995 m991935 m052004 Q(y) :- Movie(x,y), Review(x,z) top k yearp 1935p 2 + p 3 = 0.6 2004p 1 + p 3 = 0.5 1995p 3 = 0.2... Query Semantics Tuple probabilities
36
36 Complex Correlations MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) From atomic events to complex events Views
37
37 Complex Correlations ActorMatch ActoramzAct TitleMatch TitleReview p2p2 ActorMatch ActoramzAct TitleMatch TitleReview p3p3 MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) MoviewReviewMatch midridmidrid MoviewReviewMatch midrid MoviewReviewMatch P 1 + P 2 P3P3 P 4 + P 5 + P 6 ActorMatch ActoramzAct TitleMatch TitleReview p4p4 ActorMatch ActoramzAct TitleReview p1p1 TitleMatch ActorMatch ActoramzAct TitleMatch TitleReview p5p5
38
38 Summary on Data Model Data Model: Semantics = possible worlds Syntax = I-databases or ID-databases Queries: Syntax = unchanged (conjunctive queries) Semantics = tuple probabilities
39
39 Outline Definitions Query evaluation Top-k answering Conclusions
40
40 Problem Definition Fix schema S, query Q, answer tuple t Problem: given I/ID-database DB, compute Pr[I ⊨ Q(t)] Conventions: For upper bounds (P or #P): probabilities are rationals For lower bounds (#P): probabilities are 1/2 Pr[Q(t)] notation:
41
41 Query Evaluation on I-Databases Outline Intuition Extensional plans: PTIME case Hard queries: #P-complete case Dichotomy Theorem
42
42 Intuition Yearp 1995 2002 p 1 × (1 - (1 - q 1 ) ×(1 - q 2 )×(1 - q 3 )) 1 - (1 - ) × (1 - ) p 2 × (1 - (1 - q 4 )×(1 - q 5 )) p 3 × q 6 idyearp m421995p1p1 m992002p2p2 m762002p3p3 m052005p4p4 midrate p m424 q1q1 2 q2q2 3 q3q3 m991 q4q4 3 q5q5 m76 5 q6q6 Movie i Review i Answer Q(y) :- Movie(x,y), Review(x,z) Review(x,z)
43
43 Add Join ⋈ p = p 1 * p 2 Projection ∏ p = 1-(1-p 1 )(1-p 2 )...(1-p n ) Selection σ p = p Note: data complexity is PTIME p I-Extensional Plans [Barbara92,Lakshmanan97]
44
44 Extensional Query Plans ⋈ xpx’qx pq xp1 xp2 xp3 ∏ x 1-(1-p1)(1-p2)(1- p3) xp σ xp
45
45 Extensional Query Plans Each tuple t has a probability t.P Algebra operators compute t.P Data complexity: PTIME
46
46 ⋈ ∏ Movie Review CORREC T INCORRECT! 1995m1pq 1 1995m1pq 2 1995m1pq 3 1995 1-(1-pq 1 )(1-pq 2 )(1- pq 3 ) ⋈ ∏ ∏ Movie Review m1 1 - (1-q 1 )(1-q 2 )(1- q 3 ) 1995m1 p(1-(1-q 1 )(1-q 2 )(1- q 3 )) m1q1q1 q2q2 q3q3 1995p m1q1q1 q2q2 q3q3 1995p Q(y) :- Movie(x,y), Review(x,z) Review(x,z)
47
47 Observation 1 The answer depends on the query plan
48
48 Q bad :- R i (x), S(x,y), T i (y) Ap p1p1 p2p2 p3p3 p4p4 Bp q1q1 q2q2 q3q3 q4q4 AB RiRi STiTi Theorem: Data complexity is #P-complete #P-Complete Queries
49
49 Proof: Ap x1x1 1/2 x2x2 x3x3 x4x4 Bp y1y1 y2y2 y3y3 AB x2x2 y3y3 x1x1 y2y2 x4x4 y3y3 x3x3 y1y1 RiRi STiTi Reduction: x 2 y 3 V x 1 y 2 V x 4 y 3 V x 3 y 1 Q bad :- R i (x), S(x,y), T i (y) Theorem [Provan&Ball83] Counting the number of satisfying assignments for bipartite DNF is #P- complete
50
50 Observation 2 Some queries (like Q bad ) don’t admit a correct extensional plan !
51
I-Dichotomy Definition 1. For each variable x: goals(x) = set of goals that contain x goals(x) = set of goals that contain x Q = boolean conjunctive query Definition 2. Q is hierarchical if forall x, y: (a) goals(x) ∩ goals(y) = ∅, or (b) goals(x) ⊆ goals(y), or (c) goals(y) ⊆ goals(x) (a) goals(x) ∩ goals(y) = ∅, or (b) goals(x) ⊆ goals(y), or (c) goals(y) ⊆ goals(x)
52
52 Q :- R(x),S(x,y),T(x,y,z),K(x,v) Q :- R(x), S(x,y), T(y) xy z R S T v K xy R S T “hierarchical” “non-hierarchical”
53
53 I-Dichotomy [Dalvi&S.’04] Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan Q is hierarchical or: Q is #P-complete Q has subgoals R(x,...),S(x,y,...),T(y,...) Schema S i = {R 1 i, R 2 i,..., R m i }
54
54 Proof Lemma 1. If Q is non-hierarchical, then #P-complete Proof: x y R S T z K v Q :- R i (v,x), S i (x,y), T i (y,z), K i (z) rest is like for Q bad
55
55 Proof Lemma 2. If Q is hierarchical, then PTIME Proof: Case 1: has no root Pr(Q) = Pr(Q 1 ) Pr(Q 2 ) Pr(Q 3 ) This is extensional join ⋈
56
56 Proof Case 2: has root x x Pr(Q) = 1 - (1-Pr(Q(a 1 /x))(1-Pr(Q(a 2 /x))...(1-Pr(Q(a n /x))) This is an extensional projection: ∏ Dom={a 1, a 2,..., a n } QED
57
57 Query Evaluation on ID-Databases ID-extensional plans #P-complete queries Dichotomoy Theorem
58
58 Only difference: two kinds of projections: independent 1-(1-p 1 )...(1-p n ) disjoint p 1 +... + p n Extensional Plans for ID-DBs
59
59 #P-Complete Queries Q 2 :- R d (x d,y), S d (y d,z) Q 1 :- R i (x), S i (x,y), T i (y) Q 3 :- R d (x d,y), S d (z d,y)
60
60 I-DB Dichotomy [Dalvi&S.’04] Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan or: Q is #P-complete Q has one of Q 1, Q 2, Q 3 as subqueries Schema S id s.t. each table is either R i or R id
61
61 Extensions Extensions of the dichotomoy theorem exists for: Mixed schemas (some relations are deterministic) Functional dependencies
62
62 Summary on Query Evaluation Extensional plans: popular, efficient, BUT “Equivalent” plans lead to different results Some queries admit “correct” plans Some simple queries: #P-complete complexity Dichotomy theorem Future work: remove ‘no-self-join’ restriction
63
63 Summary on Queries Extensional plans: popular in the past But not all are correct Some queries have no correct ext. plans Need extensions to the DBMS [Barbara92,Lakshmanan97] [Dalvi&S.’04]
64
64 Summary of Query Complexity Probabilistic databases have high complexity: #P Extensional plans: popular and efficient BUT: answer depends on the plan When ∄, query has high complexity [Barbara92,Lakshmanan97] [Dalvi&S.’04]
65
65 Outline Definitions Query evaluation Top-k answering (joint with Chris Re) Conclusions
66
66 Event Expressions Atomic events: e 1, e 2,... Probabilities: p 1, p 2,... Event expressions: e 1 ⋀ e 2 ⋁ e 1 ⋀ e 3
67
67 Intensional Query Plans ⋈ xp x’q x p⋀qp⋀q xp1 xp2 xp3 ∏ x p1 ⋁ p2 ⋁ p3 xp σ xp [Fuhr97]
68
68 Probabilities of Boolean Expressions p 1 = Pr(e 1 ) p 2 = Pr(e 2 ) p 3 = Pr(e 3 ) Compute probability p = Pr(E) ? A: E = e 1 ⋀ (e 2 ⋁ e 3 ) p = p 1 (1-(1-p 2 )(1-p 3 )) Given E= e 1 ⋀ e 2 ⋁ e 1 ⋀ e 3
69
69 Top-k Ranking Problem Fix schema S, query Q, number k > 0 Problem: given I- or ID-database DB, find k answers t 1,...,t k with highest probabilities Note: Checking Pr[Q(t i )] > Pr[Q(t j )] is PP-complete Goal: efficient polynomial time approximation Pr[Q(t 1 )] > Pr[Q(t 2 )] >.... > Pr[Q(t k )] >...
70
70 Probabilities of Boolean Expressions What is the probability of e 1 ⋀ e 2 ⋁ e 1 ⋀ e 2 ⋁ e 1 ⋀ e 3 ? (1-p 1 )p 2 p 3 + p 1 (1-p 2 )p 3 + p 1 p 2 (1-p 3 ) + p 1 p 2 p 3 Theorem #P-hard [Valiant] Ap e1e1 p1p1 e2e2 p2p2 e3e3 p3p3
71
71 Monte Carlo Simulation Better: PTAS Pr( |p’-p| 1-δ [Karp&Luby’83] Algorithm: radomly pick each e 1, e 2, e 3 = false or true radomly pick each e 1, e 2, e 3 = false or true compute e 1 ∧ e 2 ∨ e 1 ∧ e 3 ∨ e 2 ∧ e 3 : true or false ? repeat compute e 1 ∧ e 2 ∨ e 1 ∧ e 3 ∨ e 2 ∧ e 3 : true or false ? repeat Approximate probability p with frequency p’ p’ p’- ε p’+ ε p
72
72 Monte Carlo Simulation N=0 01 p N=1 N=2 N=3
73
73 The Multisimulation Problem YearP 1995?? 2002?? 1933?? 1984?? Schedule simulation steps to find top-k 01
74
74 Multisimulation How to find the top k out of n ? Example: looking for top k=2; 01 1 2 34 5 Which one simulate next ? p5p5 p1p1 p4p4 p2p2 p3p3
75
75 Multisimulation Critical region: (k’th left, k+1’th right) 01 k=2
76
76 Multisimulation Algorithm Case 1: pick a “double crosser” and simulate it 01 this k=2
77
77 Multisimulation Algorithm Case 2: pick both a “left” AND a “right” crosser k=2 01 this and this
78
78 Multisimulation Algorithm Case 3: pick a “max crosser” and simulate it 01 this k=2
79
79 Multisimulation Algorithm End: when critical region is “empty” 01 k=2 To sort the top k, find the top k-1, etc
80
80 Multisimulation Algorithm Theorem (1) It runs in < 2 Optimal # steps (1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm does better (2) no other deterministic algorithm does better
81
81 Experiments IMDB+AMZN: about 10M tuples 60% probabilistic k=10n = 33 16 3259 1475 simulate all multisim plus optimization engine time
82
82 Experiments
83
83 Summary on Top-k Answering Simple algorithm, optimal (x2) w.r.t. a very powerful standard Marriage of probabilistic and top-k answers make probabilistic databases practical
84
84 Experiments
85
85 Outline Definitions Query evaluation Top-k answering Conclusions
86
86 Related Work Probabilistic databases: Cavallo87, Barbara92, Laskhmanan97, Fuhr97, Dalvi04, Widom05 Extensional/intensional plans: Fuhr97 Probabilities for degrees of belief Fagin90, Bacchus96 Simulation of boolean functions: Karp&Luby Complexity of boolean function probability Valiant79
87
87 Conclusions Strong motivation from practical applications Opportunity to merge query and search technologies Probabilistic DB’s are hard ! Great opportunity for impactful theory work Tomorrow: applications of random graphs to model incompleteness in databases
88
88 Research at UW finish the complexity dichotomy aggregate queries constraints incomplete databases (random graphs)
89
Thank you ! Questions ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.