Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington.

Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington

2 Two Lectures Today: probabilistic database to model imprecisions probabilistic logics Tomorrow: probabilistic database to model incompletness random graphs

3 Motivation Record reconciliation Information extraction Constraint violations Schema matching

4 Databases 101 Tables: titleyear Twelve Monkeys1995 Monkey Love1997 Monkey Love1935 Monkey Love Planet2005 Queries: SELECT title, rating FROM Movie, Review WHERE title = name Answers: titlerating Twelve Monkeys3 Monkey Love5 Monkey Love Planet5 Movie

5 nameratingp Monkey Lovegood.5 fair.2 fair.6 poor.9 Review Queries: A(x,y) :- Review(x,y), Movie(x,z), z > 1991 Movie(x,z), z > 1991 Problem Setting Tables: titleyearp Twelve Monkeys1995.8 Monkey Love1997.4 Monkey Love1935.9 Monkey Love Pl2005.7 Answers: titleratingp Twelve Monkeysfair.53 Monkey Lovegood.42 Monkey Love Plfair.15 Movie Top k Problem: complexity of query evaluation

6 Two Problems Fix answer tuple (a,b) Given database I, compute Pr(Q(a,b)) Query evaluation problem Fixed schema S, conjunctive query Q(x,y) Fix k > 0 Given database I, find k answer tuples with highest probabilities Top-k answering problem

7 Related Work: DB Cavallo&Pitarelli:1987 Barbara,Garcia-Molina, Porter:1992 Lakshmanan,Leone,Ross&Subrahmanian:199 7 Fuhr&Roellke:1997 Dalvi&S:2004 Widom:2005

8 Related Work: Logic Query reliability [Gradel,Gurevitch,Hirsch’98] Degrees of belief [Bacchus,Grove,Halpern,Koller’96] Probabilistic Logic [Nielson] Probabilistic model checking [Kwiatkowska’02] Probabilistic Relational Model [Taskar,Abbeel,Koller’02]

9 Outline Definitions Query Evaluation Top-k answering (joint with Chris Re) Conclusions

10 Application 1: Record Linkage Title Monk 12 Monkeys Twelve Monkey Movies Data cleaning remains expensive, critical: Prob dbs: “garbage in, ranked answers out” Extensive research area Which records match ? Review Twelve Monkeys 12 Monkeys (1996) Monkey Love Planet of the Apes Reviews 0.3 0.1 0.9 ? Today: “garbage in, garbage out” 0.3 0.1 0.9

11 Application 1: Fuzzy Object Matching TitleMatch titlereviewp Twelve Monkeys12 Monkeys0.7 Monkey Love 199712 Monkeys0.45 Monkey Love 1935Monkey Love0.82 Monkey Love 1935Monkey Boy0.68 Monkey Love PlanetMonkey Love0.8 Table q-gram or edit dist.

12 Application 1: Fuzzy Object Matching Intensively studied: “record linkage”, “deduplication”, “object reconciliation”, etc. Current usage: score v.s. threshold New usage: scores as probabilities

13 Application 1: Fuzzy Object Matching Queries: Find all movies rated highly by both Joe and Jim titleyearp Monkey Love19970.73 Twelve Monkeys19950.68 Monkey Love19350.43 Monkey Love Planet20050.12 Answers: top k

14 Application 2: Information Extraction Collection of unstructured documents Define tables, populate them ➔ scores Variety machine learning techniques

15 Application 2: Information Extraction Posted By Review Text john@.12 Monkeys is an OK movie joe@.Monkeys is one of the best movies I've seen mjs42@.I've never seen 12 Monkeys but I love Monk. Which movie is the review about ? Is this review positive or negative ?

16 Application 2: Information Extraction I've never seen 12 Monkeys but I love Monk. Review text: Extensive research area Probabilistic dbs: SQL queries, rank answers MovieActorRating 12 MonkeysMonkfair Facts Table: Avatar (IBM)documentscorporate facts Textrunner (UW) 100M Web pages 1BN facts 0.3 Today: facts can be used only in isolation

17 Application 2: Information Extraction Extensive area: From text segmentation to extraction from WWW AVATAR (IBM): corporate data from docs TextRunner (UW): 1,000,000,000 facts from WWW ATTENEX (startup): discovery for law offices Inherently imprecise ! Probabilistic dbs: keep and use scores

18 Application 3: Activity Recognition Probabilistic dbs: use scores to rank answers NameTimeActivity Suetrun | walk Suet+1walk | stand | sit [Lester05,Liao05] Equip People with sensors, classify activity Elderly health care (e.g. Alzheimer's) Has Mr. Johnson eaten in the last 12 hours? 0.30.6 0.20.4

19 Application 3: Activity Recognition subjecttimeactp Barbaratrun0.3 Barbarateat0.45 Jimt+1eat0.82 Barbarat+1eat0.45 Table

20 Other Applications Sensor data, RFID [Madden] Constraint violations [Bertossi02,Fuxman06] Schema matching [Doan03] Security/privacy [Evfimievski03, Miklau04] Bio-, medical-, clinical informatics [DataGrid (a startup)] Personal information management [Karger, Halevy]

21 Summary of Applications Large range of apps with imprecise data Specific techniques exists Imprecisions handled at the application level Goal of probabilistic databases: manage all imprecisions uniformly

22 Outline Applications Data model Queries Multisimulation Conclusions

23 Pr : Inst → [0,1], ∑ I Pr[I] = 1 Probabilistic Database Schema S, Domain D, Set of instances Inst Definition Probabilistic database is a probability distribution If Pr[I] > 0 then I is called “possible world”

24 Probabilistic Database Representation: Independent tuples: I-database DB over some schema S i Independent and disjoint tuples: ID-database DB over some schema S id Semantics: DB “means” probability distribution Pr over schema S

25 Independent Events A tuple is in the database with probability p Any two tuples are independent events

26 I-Databases MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Pr[I 1 ] + Pr[I 2 ] +... + Pr[I 8 ] = 1 Reviews i (M,S,p) MovScor (1-p 1 )*(1-p 2 )*(1-p 3 ) Pr[I 1 ]= MovScor m42good m99good p 1 *p 2 *(1-p 3 ) Pr[I 4 ]= p 1 *p 2 *p 3 Pr[I 8 ]= Representation Possible worlds semantics Reviews(M,S) MovScor m42good m99good m76poor MovScor m99good MovScor m42good MovScor m421995 m76poor MovScor m42good m76poor MovScor m76poor

27 Disjoint Events Needed in Many-to-1 matchings Possible values for attributes [Barbara’92] NameAgeP John 340.3 John 430.7 Mary 251.0 NameAge John 34 (0.3) 43 (0.7) Mary 25

28 ID-Databases Time d ActivityP t walk p1p1 t run p2p2 t+1 walk p3p3 Pr[I 1 ] + Pr[I 2 ] +... + Pr[I 6 ] = 1 Activities id TimeActTimeAct t run TimeAct t walk t+1 walk TimeAct t walk TimeAct t+1 walk (1-p 1 -p 2 )*(1-p 3 ) Pr[I 1 ]= p 2 *(1-p 3 ) Pr[I 3 ]= p 1 *p 3 Pr[I 5 ]= Activities TimeAct t run t+1 walk

29 ID subsumes I Movie d Score d P m42good p1p1 m99good p2p2 m76poor p3p3 Reviews id MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Reviews i = Note: MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Reviews id means all tuples are disjoint

30 Queries idyearP m4219950.95 m9920020.65 m7620020.1 m0520050.7 midratingp m4240.7 m4250.45 m9950.82 m9940.68 m0550.79 Movie i Review i Q(y) :- Movie(x,y), Review(x,z), z >= 3 Syntax: conjunctive queries over schema S

31 Two Query Semantics Possible answer sets Given set A: Used for views Possible tuples Given tuple t: Used for query evaluation and top-k Pr[{t | I ⊨ Q(t)} = A] Pr[I ⊨ Q(t)] Thistalk

32 Possible Answer Sets ActorMatch ActoramzAct TitleMatch TitleReview p2p2 ActorMatch ActoramzAct TitleMatch TitleReview p3p3 MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) MoviewReviewMatch midridmidrid MoviewReviewMatch midrid MoviewReviewMatch P 1 + P 2 P3P3 P 4 + P 5 + P 6 ActorMatch ActoramzAct TitleMatch TitleReview p4p4 ActorMatch ActoramzAct TitleReview p1p1 TitleMatch ActorMatch ActoramzAct TitleMatch TitleReview p5p5

33 p2p2 year 2004 1901 1902 p1p1 idyear m422004 m991901 m761902 p2p2 idyear m991935 m051903 p4p4 idyear m871934 m441904 p3p3 idyear m761995 m991935 m052004 Query Semantics 1 year 1934 1904 p4p4 p 1 + p 3 year 2004 Q(y) :- Movie(x,y), Review(x,z) Possible answer setsUseful for Views

34 ID + Views = Complete Theorem. ID-databases plus views are “complete” for representing possible worlds distributions R A a c d f g p2p2 R A b c d g p3p3 R A a c p4p4 R A a b c p1p1 R A b c d f g p5p5 RA AW aw1w1 bw1w1 cw1w1 aw2w2 cw2w2 dw2w2 f... WP w1w1 p1p1 w2w2 p2p2 w3w3 p3p3 w4w4 p4p4 w5w5 p5p5 PWD di di= ∅ R(x) :- RA(x,w), PWD di (w) PrDB

35 p1p1 idyear m422004 m991901 m761902 p2p2 idyear m991935 m051903 p4p4 idyear m871934 m441904 p3p3 idyear m761995 m991935 m052004 Q(y) :- Movie(x,y), Review(x,z) top k yearp 1935p 2 + p 3 = 0.6 2004p 1 + p 3 = 0.5 1995p 3 = 0.2... Query Semantics Tuple probabilities

36 Complex Correlations MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) From atomic events to complex events Views

37 Complex Correlations ActorMatch ActoramzAct TitleMatch TitleReview p2p2 ActorMatch ActoramzAct TitleMatch TitleReview p3p3 MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) MoviewReviewMatch midridmidrid MoviewReviewMatch midrid MoviewReviewMatch P 1 + P 2 P3P3 P 4 + P 5 + P 6 ActorMatch ActoramzAct TitleMatch TitleReview p4p4 ActorMatch ActoramzAct TitleReview p1p1 TitleMatch ActorMatch ActoramzAct TitleMatch TitleReview p5p5

38 Summary on Data Model Data Model: Semantics = possible worlds Syntax = I-databases or ID-databases Queries: Syntax = unchanged (conjunctive queries) Semantics = tuple probabilities

39 Outline Definitions Query evaluation Top-k answering Conclusions

40 Problem Definition Fix schema S, query Q, answer tuple t Problem: given I/ID-database DB, compute Pr[I ⊨ Q(t)] Conventions: For upper bounds (P or #P): probabilities are rationals For lower bounds (#P): probabilities are 1/2 Pr[Q(t)] notation:

41 Query Evaluation on I-Databases Outline Intuition Extensional plans: PTIME case Hard queries: #P-complete case Dichotomy Theorem

42 Intuition Yearp 1995 2002 p 1 × (1 - (1 - q 1 ) ×(1 - q 2 )×(1 - q 3 )) 1 - (1 - ) × (1 - ) p 2 × (1 - (1 - q 4 )×(1 - q 5 )) p 3 × q 6 idyearp m421995p1p1 m992002p2p2 m762002p3p3 m052005p4p4 midrate p m424 q1q1 2 q2q2 3 q3q3 m991 q4q4 3 q5q5 m76 5 q6q6 Movie i Review i Answer Q(y) :- Movie(x,y), Review(x,z) Review(x,z)

43 Add Join ⋈ p = p 1 * p 2 Projection ∏ p = 1-(1-p 1 )(1-p 2 )...(1-p n ) Selection σ p = p Note: data complexity is PTIME p I-Extensional Plans [Barbara92,Lakshmanan97]

44 Extensional Query Plans ⋈ xpx’qx pq xp1 xp2 xp3 ∏ x 1-(1-p1)(1-p2)(1- p3) xp σ xp

45 Extensional Query Plans Each tuple t has a probability t.P Algebra operators compute t.P Data complexity: PTIME

46 ⋈ ∏ Movie Review CORREC T INCORRECT! 1995m1pq 1 1995m1pq 2 1995m1pq 3 1995 1-(1-pq 1 )(1-pq 2 )(1- pq 3 ) ⋈ ∏ ∏ Movie Review m1 1 - (1-q 1 )(1-q 2 )(1- q 3 ) 1995m1 p(1-(1-q 1 )(1-q 2 )(1- q 3 )) m1q1q1 q2q2 q3q3 1995p m1q1q1 q2q2 q3q3 1995p Q(y) :- Movie(x,y), Review(x,z) Review(x,z)

47 Observation 1 The answer depends on the query plan

48 Q bad :- R i (x), S(x,y), T i (y) Ap p1p1 p2p2 p3p3 p4p4 Bp q1q1 q2q2 q3q3 q4q4 AB RiRi STiTi Theorem: Data complexity is #P-complete #P-Complete Queries

49 Proof: Ap x1x1 1/2 x2x2 x3x3 x4x4 Bp y1y1 y2y2 y3y3 AB x2x2 y3y3 x1x1 y2y2 x4x4 y3y3 x3x3 y1y1 RiRi STiTi Reduction: x 2 y 3 V x 1 y 2 V x 4 y 3 V x 3 y 1 Q bad :- R i (x), S(x,y), T i (y) Theorem [Provan&Ball83] Counting the number of satisfying assignments for bipartite DNF is #P- complete

50 Observation 2 Some queries (like Q bad ) don’t admit a correct extensional plan !

I-Dichotomy Definition 1. For each variable x: goals(x) = set of goals that contain x goals(x) = set of goals that contain x Q = boolean conjunctive query Definition 2. Q is hierarchical if forall x, y: (a) goals(x) ∩ goals(y) = ∅, or (b) goals(x) ⊆ goals(y), or (c) goals(y) ⊆ goals(x) (a) goals(x) ∩ goals(y) = ∅, or (b) goals(x) ⊆ goals(y), or (c) goals(y) ⊆ goals(x)

52 Q :- R(x),S(x,y),T(x,y,z),K(x,v) Q :- R(x), S(x,y), T(y) xy z R S T v K xy R S T “hierarchical” “non-hierarchical”

53 I-Dichotomy [Dalvi&S.’04] Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan Q is hierarchical or: Q is #P-complete Q has subgoals R(x,...),S(x,y,...),T(y,...) Schema S i = {R 1 i, R 2 i,..., R m i }

54 Proof Lemma 1. If Q is non-hierarchical, then #P-complete Proof: x y R S T z K v Q :- R i (v,x), S i (x,y), T i (y,z), K i (z) rest is like for Q bad

55 Proof Lemma 2. If Q is hierarchical, then PTIME Proof: Case 1: has no root Pr(Q) = Pr(Q 1 ) Pr(Q 2 ) Pr(Q 3 ) This is extensional join ⋈

56 Proof Case 2: has root x x Pr(Q) = 1 - (1-Pr(Q(a 1 /x))(1-Pr(Q(a 2 /x))...(1-Pr(Q(a n /x))) This is an extensional projection: ∏ Dom={a 1, a 2,..., a n } QED

57 Query Evaluation on ID-Databases ID-extensional plans #P-complete queries Dichotomoy Theorem

58 Only difference: two kinds of projections: independent 1-(1-p 1 )...(1-p n ) disjoint p 1 +... + p n Extensional Plans for ID-DBs

59 #P-Complete Queries Q 2 :- R d (x d,y), S d (y d,z) Q 1 :- R i (x), S i (x,y), T i (y) Q 3 :- R d (x d,y), S d (z d,y)

60 I-DB Dichotomy [Dalvi&S.’04] Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan or: Q is #P-complete Q has one of Q 1, Q 2, Q 3 as subqueries Schema S id s.t. each table is either R i or R id

61 Extensions Extensions of the dichotomoy theorem exists for: Mixed schemas (some relations are deterministic) Functional dependencies

62 Summary on Query Evaluation Extensional plans: popular, efficient, BUT “Equivalent” plans lead to different results Some queries admit “correct” plans Some simple queries: #P-complete complexity Dichotomy theorem Future work: remove ‘no-self-join’ restriction

63 Summary on Queries Extensional plans: popular in the past But not all are correct Some queries have no correct ext. plans Need extensions to the DBMS [Barbara92,Lakshmanan97] [Dalvi&S.’04]

64 Summary of Query Complexity Probabilistic databases have high complexity: #P Extensional plans: popular and efficient BUT: answer depends on the plan When ∄, query has high complexity [Barbara92,Lakshmanan97] [Dalvi&S.’04]

65 Outline Definitions Query evaluation Top-k answering (joint with Chris Re) Conclusions

66 Event Expressions Atomic events: e 1, e 2,... Probabilities: p 1, p 2,... Event expressions: e 1 ⋀ e 2 ⋁ e 1 ⋀ e 3

67 Intensional Query Plans ⋈ xp x’q x p⋀qp⋀q xp1 xp2 xp3 ∏ x p1 ⋁ p2 ⋁ p3 xp σ xp [Fuhr97]

68 Probabilities of Boolean Expressions p 1 = Pr(e 1 ) p 2 = Pr(e 2 ) p 3 = Pr(e 3 ) Compute probability p = Pr(E) ? A: E = e 1 ⋀ (e 2 ⋁ e 3 ) p = p 1 (1-(1-p 2 )(1-p 3 )) Given E= e 1 ⋀ e 2 ⋁ e 1 ⋀ e 3

69 Top-k Ranking Problem Fix schema S, query Q, number k > 0 Problem: given I- or ID-database DB, find k answers t 1,...,t k with highest probabilities Note: Checking Pr[Q(t i )] > Pr[Q(t j )] is PP-complete Goal: efficient polynomial time approximation Pr[Q(t 1 )] > Pr[Q(t 2 )] >.... > Pr[Q(t k )] >...

70 Probabilities of Boolean Expressions What is the probability of e 1 ⋀ e 2 ⋁ e 1 ⋀ e 2 ⋁ e 1 ⋀ e 3 ? (1-p 1 )p 2 p 3 + p 1 (1-p 2 )p 3 + p 1 p 2 (1-p 3 ) + p 1 p 2 p 3 Theorem #P-hard [Valiant] Ap e1e1 p1p1 e2e2 p2p2 e3e3 p3p3

71 Monte Carlo Simulation Better: PTAS Pr( |p’-p| 1-δ [Karp&Luby’83] Algorithm: radomly pick each e 1, e 2, e 3 = false or true radomly pick each e 1, e 2, e 3 = false or true compute e 1 ∧ e 2 ∨ e 1 ∧ e 3 ∨ e 2 ∧ e 3 : true or false ? repeat compute e 1 ∧ e 2 ∨ e 1 ∧ e 3 ∨ e 2 ∧ e 3 : true or false ? repeat Approximate probability p with frequency p’ p’ p’- ε p’+ ε p

72 Monte Carlo Simulation N=0 01 p N=1 N=2 N=3

73 The Multisimulation Problem YearP 1995?? 2002?? 1933?? 1984?? Schedule simulation steps to find top-k 01

74 Multisimulation How to find the top k out of n ? Example: looking for top k=2; 01 1 2 34 5 Which one simulate next ? p5p5 p1p1 p4p4 p2p2 p3p3

75 Multisimulation Critical region: (k’th left, k+1’th right) 01 k=2

76 Multisimulation Algorithm Case 1: pick a “double crosser” and simulate it 01 this k=2

77 Multisimulation Algorithm Case 2: pick both a “left” AND a “right” crosser k=2 01 this and this

78 Multisimulation Algorithm Case 3: pick a “max crosser” and simulate it 01 this k=2

79 Multisimulation Algorithm End: when critical region is “empty” 01 k=2 To sort the top k, find the top k-1, etc

80 Multisimulation Algorithm Theorem (1) It runs in < 2 Optimal # steps (1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm does better (2) no other deterministic algorithm does better

81 Experiments IMDB+AMZN: about 10M tuples 60% probabilistic k=10n = 33 16 3259 1475 simulate all multisim plus optimization engine time

82 Experiments

83 Summary on Top-k Answering Simple algorithm, optimal (x2) w.r.t. a very powerful standard Marriage of probabilistic and top-k answers make probabilistic databases practical

84 Experiments

85 Outline Definitions Query evaluation Top-k answering Conclusions

86 Related Work Probabilistic databases: Cavallo87, Barbara92, Laskhmanan97, Fuhr97, Dalvi04, Widom05 Extensional/intensional plans: Fuhr97 Probabilities for degrees of belief Fagin90, Bacchus96 Simulation of boolean functions: Karp&Luby Complexity of boolean function probability Valiant79

87 Conclusions Strong motivation from practical applications Opportunity to merge query and search technologies Probabilistic DB’s are hard ! Great opportunity for impactful theory work Tomorrow: applications of random graphs to model incompleteness in databases

88 Research at UW finish the complexity dichotomy aggregate queries constraints incomplete databases (random graphs)

Thank you ! Questions ?

Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington.

Similar presentations

Presentation on theme: "Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington.

Similar presentations

Presentation on theme: "Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington."— Presentation transcript:

Similar presentations

About project

Feedback