Probabilistic Databases with MarkoViews

Probabilistic Databases with MarkoViews
Abhay Jha Dan Suciu Presented by: Alon Vizel, 15/1/2017 Soft-Logic Seminar in Computer Science, Technion

Lecture layout: Definitions & Background
INDB - Tuple independent database MLN - Markov Logic Network MarkoViews & MVDB Translating MVDB to INDB Experimental Evaluation Summary

Definitions A database instance I of a relational schema R is a k-tuple ( 𝑅 1 𝐼 , , 𝑅 𝑘 𝐼 ), where 𝑅 𝑖 𝐼 is an instance of the relation 𝑅 𝑖 probabilistic database D = (W, P), where W = { 𝐼 1 , , 𝐼 𝑁 } is a set of instances, called possible worlds, and P : W → [0, 1] We denote Tup the set of possible tuples, i.e. the set of all tuples occurring in all possible worlds 𝐼 1 , , 𝐼 𝑁

Definitions cont. A conjunctive query (CQ) is a query Q of the form (∃ 𝑦 )( 𝑅 1 ( 𝑥 1 ) ∧ ∧ 𝑅 𝑡 ( 𝑥 𝑡 )) A union of conjunctive queries (UCQ) is a query Q of the form 𝑄 1 ∨ ∨ 𝑄 𝑘 , where each 𝑄 𝑖 ∈ CQ

Background Many query processing techniques. Short running time.
Dealing successfully with large databases. Problem: Most scalable query processing techniques assume that the tuples are independent. Most processing techniques are based UCQ. Insufficient for complex knowledge extraction tasks.

What do we want? Represent complex correlations
Efficient query evaluation: Easy translation (our main goal today) Fast evaluation

Tuple independent database (INDB)
A probabilistic database is tuple-independent if, for any set of possible tuples 𝑡 1 ,…, 𝑡 𝑛 , the events 𝑋 𝑡 1 ,…, 𝑋 𝑡 𝑛 are independent. We write 𝐷 0 =(𝑻𝒖𝒑, 𝑝) 𝑻𝒖𝒑 is the set of possible tuples p : Tup → [0, 1]. The possible worlds are all subsets I ⊆ Tup, and their probabilities are 𝑃 𝐼 = 𝑡∈𝐼 𝑝(𝑡) ∙ 𝑡∈𝑇𝑢𝑝−𝐼 (1−𝑝 𝑡 )

INDB example S Prob. B A 0.6 1 m s1 0.5 n s2 T Prob. D C 0.4 p 1 t1
Possible worlds probability Instance 0.12 {s1, s2, t1} 0.18 {s1, s2} {s1, t1} {s1} 0.08 {s2, t1} {s2} {t1} ∅ S Prob. B A 0.6 1 m s1 0.5 n s2 T Prob. D C 0.4 p 1 t1

Alternative INDB definition
𝐷 0 = ( 𝑻𝒖𝒑 0 , 𝑤 0 ), 𝑻𝒖𝒑 0 is a set of possible tuples 𝑤 0 (t) associates a real number to each tuple t. This definition is equivalent to the one given earlier, by setting the tuple probability to p(t) = 𝑤 0 (t) 1 + 𝑤 0 (t) . In a tuple-independent database, a weight represents the odds, w = p 1−𝑝 .

Markov Logic Networks (MLNs)
A Markov Logic Network is a set L = {( 𝐹 1 , 𝑤 1 ), ,( 𝐹 𝑚 , 𝑤 𝑚 )}, where each 𝐹 𝑖 is a formula over a relational schema R, in First Order Logic, called a feature. 𝑤 𝑖 is a weight. A grounding of a formula 𝐹 𝑖 is a formula where the free variables 𝑥 of 𝐹 𝑖 are substituted with some constants 𝑎 , denote G( 𝐹 𝑖 ) So, the grounded MLN is G(L) = {(G, 𝑤 𝑖 ) | ∃( 𝐹 𝑖 , 𝑤 𝑖 ) ∈ L : G ∈ G( 𝐹 𝑖 )}

The semantics of an MLN L is the probabilistic database 𝑫 𝑳 = (W, P), where W = {I | I ⊆ Tup} and
P(I) = 𝜑 𝐼 /Z for all I ⊆ Tup 𝜑(I) is the weight of a possible world: 𝜑 𝐼 = 𝐺,𝑤 ∈𝐺 𝐿 :𝐼⊨𝐺 𝑤 Z is a partition function: 𝑍= 𝐼⊆𝑻𝒖𝒑 𝜑 𝐼 w > 1 means that worlds where the feature holds are more likely w < 1 means that worlds were the feature holds are less likely w = 1 means indifference. w = ∞ is interpreted as a hard constraint

MLN examples 1. w=4.5: notSame( 𝑖 1 , 𝑖 2 ):- Person( 𝑖 1 , 𝑓 1 , 𝑙 1 , 𝑎 1 ) ⋀ Person( 𝑖 2 , 𝑓 2 , 𝑙 2 , 𝑎 2 ) ⋀ ¬SameCountry( 𝑎 1 , 𝑎 2 ) w=0.5: Same( 𝑖 1 , 𝑖 2 ) :- Person( 𝑖 1 , 𝑓 1 , 𝑙 1 , 𝑎 1 ) ⋀ Person( 𝑖 2 , 𝑓 2 , 𝑙 2 , 𝑎 2 ) ⋀ Similar( 𝑓 1 , 𝑓 2 ) ⋀ Similar( 𝑙 1 , 𝑙 2 ) ⋀ Close( 𝑎 1 , 𝑎 2 ) We are more likely to have a world (Instance of R) where if 2 persons are not from the same country, they are not the same person. We are less likely to have a world (Instance of R) with 2 persons with same name who live close on the same city.

2. Consider the MLN consisting of features: (R(𝑎), 𝑤 1 ),(S(𝑎), 𝑤 2 )
2. Consider the MLN consisting of features: (R(𝑎), 𝑤 1 ),(S(𝑎), 𝑤 2 ). We remind that w = 𝑝 /(1 − 𝑝) 𝑝 = 𝑤 /(1+ 𝑤) This MLN defines a tuple-independent database, so the probabilities are R(𝒂), S(𝒂) S(𝒂) R(𝒂) ∅ Possible worlds 𝑤 1 𝑤 2 𝑤 2 𝑤 1 1 weights 𝜑 ( 𝐼 𝑖 ) 1 + 𝑤 𝑤 𝑤 1 𝑤 2 = (1 + 𝑤 1 )(1+ 𝑤 2 ) Partition Z 𝑤 1 𝑤 2 (1 + 𝑤 1 )(1+ 𝑤 2 ) 𝑤 2 (1 + 𝑤 1 )(1+ 𝑤 2 ) 𝑤 1 (1 + 𝑤 1 )(1+ 𝑤 2 ) 1 (1 + 𝑤 1 )(1+ 𝑤 2 ) P( 𝐼 𝑖 ) 𝑝 1 𝑝 2 (1 − 𝑝 1 ) 𝑝 2 𝑝 1 (1 − 𝑝 2 ) (1 − 𝑝 1 )(1 − 𝑝 2 ) P( 𝐼 𝑖 )

Markov View (MarkoView)
𝑽 𝒙 𝒘 𝒆𝒙𝒑𝒓 :−𝑸 V is the view name Q is a Union of Conjunctive Query (UCQ) 𝑥 are variables 𝒘 𝒆𝒙𝒑𝒓 is an expression representing a non-negative weight MarkoViews are defined over a probabilistic databases, and introduce a correlation between all tuples in the lineage expression

Example: 𝑉 1 (id1,id2)[w= count(pid)/2] :- Advisor p (id1,id2), Student p (id1,year), Wrote(id1,pid), Wrote(id2,pid), Pub(pid,title,year) The more they published together while id1 was a student, the more likely id2 was his advisor

MarkoView Database (MVDB)
Let R be a relational schema. An MVDB is a triple (Tup, W, V) Tup is a set of possible tuples over the schema R W : Tup → [0, ∞] - weight function V is a set of MarkoViews Its semantics is given by the probabilistic database 𝐷 𝐿 associated to the MLN L = {( 𝐹 𝑡 , 𝑤 𝑡 ) | t ∈ Tup ∪ 𝑻𝒖𝒑 𝑽 } 𝑻𝒖𝒑 𝑽 is the set of all possible tuples in all views.

MarkoView Database - example
Consider the MVDB consisting of features: (R(𝑎), 𝑤 1 ),(S(𝑎), 𝑤 2 ),(V (𝑎), 𝑤 3 ), Where V (x)[ 𝑤 3 ] : −R(x), S(x) R(𝒂), S(𝒂) S(𝒂) R(𝒂) ∅ Possible worlds 𝑤 3 𝑤 1 𝑤 2 𝑤 2 𝑤 1 1 weights 𝜑 ( 𝐼 𝑖 ) Z = 1 + 𝑤 𝑤 2 + 𝑤 3 𝑤 1 𝑤 2 Partition Z 𝑤 3 𝑤 1 𝑤 2 Z 𝑤 2 Z 𝑤 1 Z 1 Z P( 𝐼 𝑖 )

Example Consider MVDB D with (R(𝑎), 𝑤 1 ),(S(𝑎), 𝑤 2 ), and the MarkoView V (x)[𝑤] : −R(x), S(x), where w is a constant. The four possible worlds have weights: 1, 𝑤 1 , 𝑤 2 , 𝑤 𝑤 1 𝑤 2 if Q = R(a) ∨ S(a) , then φ(Q) = 𝑤 1 + 𝑤 2 + 𝑤 𝑤 1 𝑤 2 , and P(Q) = ( 𝑤 1 + 𝑤 2 + 𝑤 𝑤 1 𝑤 2 )/(1 + 𝑤 1 + 𝑤 2 + 𝑤 𝑤 1 𝑤 2 ).

Example cont. The INDB associated to D is 𝐷 0 over R, S, NV: (R(𝑎), 𝑤 1 ),(S(𝑎), 𝑤 2 ),(NV (𝑎), 𝑤 0 ) If defining W =R(a) ∧ S(a) ∧ NV (a), Then we get hard constraint ¬W with the meaning: ¬W = R(a), S(a) ⇒ V (a), where V(a) = ¬NV(a) in matter that if V(a) is satisfied, φ(I) gets a factor of w= 𝑤 0 Seven out of the eight possible worlds of the INDB satisfy ¬W, and their weights are: 𝒘 𝟏 𝒘 𝟐 𝒘 𝟐 𝒘 𝟏 1 ¬𝑵𝑽(𝒂) - 𝒘 𝟎 𝒘 𝟐 𝒘 𝟎 𝒘 𝟏 𝒘 𝟎 𝑵𝑽(𝒂) (𝟏+𝒘 𝟎 ) 𝒘 𝟐 (𝟏+𝒘 𝟎 ) 𝒘 𝟏 𝟏+𝒘 𝟎 Total:

Example cont. For this INDB – φ 0 weight function
𝑍 0 (= φ 0 (true)) partition function 𝑃 0 probability function We want to compute P(Q) over the schema R, S, for some query Q over the MVDB, by translate it to query over INDB

Example cont. For example, Q = R(a) ∨ S(a)
φ 0 (Q ∧ ¬W) = (1 + 𝑤 0 ) 𝑤 1 + (1 + 𝑤 0 ) 𝑤 𝑤 1 𝑤 = = (1 + 𝑤 0 ) · ( 𝑤 𝑤 𝑤 0 𝑤 1 𝑤 2 ) = = (1 + 𝑤 0 ) · φ(Q) , when defining w= 𝑤 0 Therefore: P(Q) = φ(Q) 𝑍 = φ 0 (Q ∧ ¬W) φ 0 (¬W) = = 𝑃 0 (Q ∧ ¬W) 𝑃 0 (¬W) = 𝑃 0 (Q ∨ W) − 𝑃 0 (W) 1 − 𝑃 0 (W)

Translating MVDB to INDB
MVDB D = (𝑻𝒖𝒑, w, V) Let NV ={ NV i | V 𝑖 ∈ V} The INDB associated to D is the following database over the schema R∪NV: 𝐷 0 = ( Tup 0 , 𝑤 0 ), Tup 0 =Tup ∪ Tup 𝑁𝑉 Tup 𝑁𝑉 ={ NV i ( 𝑎 ) | V 𝑖 ( 𝑎 ) ∈ Tup V 𝑖 } 𝑤 0 (t) = w(t) if t ∈ Tup 1−𝑤 𝑉 (t) 𝑤 𝑉 (t) if t ∈ Tup 𝑉

Translating MVDB to INDB cont.
Let Q 𝑖 be the UCQ defining the view V 𝑖 . Then each W 𝑖 is: W 𝑖 = NV i ( 𝑥 𝑖 ) ∧ Q 𝑖 ( 𝑥 𝑖 ) And W = 𝑖 W 𝑖 Then, for every Boolean query Q, the following holds: P(Q) = 𝑃 0 (Q ∨ W) − 𝑃 0 (W) 1 − 𝑃 0 (W)

Constructing and compiling MV-index
An MV-Index consists of a set of OBDD augmented with certain pre- computations and indices that we describe below. CUDD- a widely popular package for OBDDs. More details at - F. Somenzi. CUDD: CU Decision Diagram Package Release OBDD**: An Ordered Binary Decision Diagrams, is a rooted DAG, where internal nodes are labeled with Boolean variables and have two outgoing edges, labeled 0 and 1; sink nodes (leaves) are labeled 0 or 1. **More details at - R. E. Bryant. Symbolic manipulation of boolean functions using a graphical representation. In DAC, pages 688–694, 1985.

Experimental Evaluation
For experimental evaluation, an MV-index for MVDB was constructed, based on an extended CUDD package. The new approache was compared with Alchemy, the de-facto standard inference engine for MLN. It was also compared for construction with native CUDD.

Reminder of our old MVDB: Author(aid, name) FirstPub(aid,year) Wrote(aid, pid) DBLPAffiliation(aid,inst) Pub(pid, title, year) HomePage(aid, url) 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 𝑝 (aid,year) [ 𝑝 1 ] 𝐴𝑑𝑣𝑖𝑠𝑜𝑟 𝑝 (aid1,aid2) [ 𝑝 2 ] 𝐴𝑓𝑓𝑖𝑙𝑖𝑎𝑡𝑖𝑜𝑛 𝑝 (aid,inst) [ 𝑝 3 ] 𝑉 1 (aid1,aid2)[count(pid)/2] :- Advisor p (aid1,aid2), Student p (aid1,year), Wrote(aid1,pid), Wrote(aid2,pid), Pub(pid,title,year)

Experimental Evaluation
Two main questions were asked: How do MarkoViews, and MV-index compare to other approaches for probabilistic inference on large Markov Networks? How effective is the MV-index construction algorithm compared to the standard approach for constructing OBDDs?

Alchemy vs MV for querying advisor of a student

Alchemy vs MV for querying all students of an advisor

Cudd vs MV : OBDD construction time

Summary We made two contributions that allow queries to be processed very efficiently on such databases: First, and main contribution, is a translation from MarkoViews into tuple-independent databases. Second, compilation of the MarkoViews into OBDDs, which dramatically speeds up query execution.

Questions?

Some of the probabilities in 𝐷 0 may be negative: if w > 1, then 𝑤 0 = (1−w)/w < 0, and the probability 𝑝 0 = 𝑤 0 /(1 + 𝑤 0 ) = 1 − w is negative. Negative probabilities have already been considered before. It has been proven that probability theory can be consistently extended to allow for negative probabilities, and there is interest in applying them to quantum mechanics and financial modeling Every query answer P(Q) will be a correct probability, in [0, 1], even if the probabilities P0 on the right are negative.

Link to the paper

Probabilistic Databases with MarkoViews

Similar presentations

Presentation on theme: "Probabilistic Databases with MarkoViews"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Databases with MarkoViews

Similar presentations

Presentation on theme: "Probabilistic Databases with MarkoViews"— Presentation transcript:

Similar presentations

About project

Feedback