Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Bayesian Networks Some slides have been edited from Nir Friedman’s lectures which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger.

Similar presentations


Presentation on theme: ". Bayesian Networks Some slides have been edited from Nir Friedman’s lectures which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger."— Presentation transcript:

1 . Bayesian Networks Some slides have been edited from Nir Friedman’s lectures which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger. www.cs.huji.ac.il Background Readings: An Introduction to Bayesian Networks, Finn Jensen, UCL Press, 1997.

2 2 The “Visit-to-Asia” Example Visit to Asia Smoking Lung Cancer Tuberculosis Abnormality in Chest Bronchitis X-Ray Dyspnea What are the relevant variables and their dependencies ?

3 3 Verifying the (in)Dependencies u We can now ask the expert: do the following assertion hold? l I ( S; V ) I ( T; S | V ) I ( l; {T, V} | S ) … I ( X; { V,S,T,L,B,D} | A) V S L T A B XD Alternative verification: Is each variable becoming independent of the rest, given its Markov boundary ? Take-Home Question: Are other variable construction orders as good ?

4 4 Quantifying the Bayesian Network p(t|v) Bayesian network = Directed Acyclic Graph (DAG), annotated with conditional probability distributions. V S L T A B XD p(x|a) p(d|a,b) p(a|t,l) p(b|s) p(l|s) p(s)p(s) p(v)p(v)

5 5 Local distributions Conditional Probability Table: p(A=y|L=n, T=n) = 0.02 p(A=y|L=n, T=y) = 0.60 p(A=y|L=y, T=n) = 0.99 p(A=y|L=y, T=y) = 0.99 L (Yes/No) T (Yes/No) A (Yes/no) p(A|T,L) Asymmetric independence in the CPT

6 6 Queries There are several types of queries. Most queries involve evidence An evidence e is an assignment of values to a set E of variables in the domain Example, A Posteriori belief: P(D=yes | V = yes ) Or in general: P(H=h | E = e ) where H and E are subsets of variables. Equivalent to computing P(H=h, E = e ) and then dividing.

7 7 A posteriori belief This query is useful in many cases: u Prediction: what is the probability of an outcome given the starting condition u Diagnosis: what is the probability of disease/fault given symptoms V S L T A B XD

8 8 Example: Predictive+Diagnostic P(T = Yes | Visit_to_Asia = Yes, Dyspnea = Yes ) V S L T A B XD Probabilistic inference can combine evidence form all parts of the network, Diagnostic and Predictive, regardless of the directions of edges in the model.

9 9 Queries: MAP  Find the maximum a posteriori assignment for some variable of interest (say H 1,…,H l )  That is, h 1,…,h l maximize the conditional probability P(h 1,…,h l | e)  Equivalent to maximizing the joint P(h 1,…,h l, e)

10 10 Queries: MAP We can use MAP for: u Explanation l What is the most likely joint event, given the evidence (e.g., a set of likely diseases given the symptoms) l What is the most likely scenario, given the evidence (e.g., a series of likely malfunctions that trigger a fault). D1 D2 S2 S1 D3 D4 S4 S3 Dead battery Not charging Bad battery Bad magneto Bad alternator

11 11 How Expressive are Bayesian Networks 1. Check the diamond example via all boundary bases. 2.The following property holds for d-separation but does not hold for conditional independence: I D (X,{},Y) and I D (X, ,Y)  I D (X,{},  ) or I D ( ,{},Y)

12 12 Markov Networks that represents probability distributions (rather than just independence) 1. Define for each (maximal) clique C i a non- negative function g(C i ) called the compatibility function. 2. Take the product  i g(C i ) over all cliques. 3. Define P(X 1,…,X n ) = K·  i g(C i ) where K is a normalizing factor (inverse sum of the product).

13 13 The two males and females example

14 14 Theorem 6 [Hammersley and Clifford 1971]: If a probability function P is formed by a normalized product of non negative functions on the cliques of G, then G is an I-map of P. Proof: It suffices to show (Theorem 5) that the neighborhood basis of G holds in P. Namely, show that I( ,B G (  ), U-  -B G (  ) hold in P, or just that: Let J  stand for the set of indices marking all cliques in G that include . P( , B G (  ), U-  -B G (  )) = f 1 ( ,B G (  )) f 2 (U-  ) (*) The first product contains only variables adjacent to  because C j is a clique. The second product does not contain . Hence (*) holds. = f 1 ( ,B G (  )) f 2 (U-  )

15 15 Theorem X: Every undirected graph G has a distribution P such that G is a perfect map of P. (In light of previous notes, it must have the form of a product over cliques). Note: The theorem and converse hold also for extreme probabilities but the presented proof does not apply due to the use of Intersection in Theorem 5.

16 16 Drawback: Interpreting the Links is not simple Another drawback is the difficulty with extreme probabilities. There is no local test for I-mapness. Both drawbacks disappear in the class of decomposable models, which are a special case of Bayesian networks

17 17 Decomposable Models Example: Markov Chains and Markov Trees Assume the following chain is an I-map of some P(x 1,x 2,x 3,x 4 ) and was constructed using the methods we just described. The “compatibility functions” on all links can be easily interpreted in the case of chains. Same also for trees. This idea actually works for all chordal graphs.

18 18 Chordal Graphs

19 19 Interpretation of the links Clique 1 Clique 2 Clique 3 A probability distribution that can be written as a product of low order marginals divided by a product of low order marginals is said to be decomposable.

20 20 Importance of Decomposability When assigning compatibility functions it suffices to use marginal probabilities on cliques and just make sure to be locally consistent. Marginals can be assessed from experts or estimated directly from data.

21 21 The Diamond Example – The smallest non chordal graph Adding one more link will turn the graph to become chordal. Turning a general undirected graph into a chordal graph in some optimal way is the key for all exact computations done on Markov and Bayesian networks.

22 22

23 23 Extra Slides with more details If times allows

24 24 Complexity of Inference Theorem: Computing P(X = x) in a Bayesian network is NP- hard. Main idea: conditional probability tables with zeros and ones are equivalent to logical gates. Hence reducibility to 3-SAT is the easiest to pursue.

25 25 Proof We reduce 3-SAT to Bayesian network computation Assume we are given a 3-SAT problem:  Q 1,…,Q n be propositions,   1,...,  k be clauses, such that  i = l i1  l i2  l i3 where each l ij is a literal over Q 1,…,Q n (e.g., Q 1 = true ) u  =  1 ...  k We will construct a Bayesian network s.t. P(X=t) > 0 iff  is satisfiable

26 26  P(Q i = true) = 0.5,  P(  I = true | Q i, Q j, Q l ) = 1 iff Q i, Q j, Q l satisfy the clause  I  A 1, A 2, …, are simple binary AND gates... 11 Q1Q1 Q3Q3 Q2Q2 Q4Q4 QnQn 22 33 kk A1A1  k-1 A2A2 X A k-2

27 27 u It is easy to check l Polynomial number of variables l Each Conditional Probability Table can be described by a small table (8 parameters at most) P(X = true) > 0 if and only if there exists a satisfying assignment to Q 1,…,Q n u Conclusion: polynomial reduction of 3-SAT... 11 Q1Q1 Q3Q3 Q2Q2 Q4Q4 QnQn 22 33 kk A1A1  k-1 A2A2 X A k-2

28 28 Inference is even #P-hard  P(X = t) is the fraction of satisfying assignments to   Hence 2 n P(X = t) is the number of satisfying assignments to   Thus, if we know to compute P(X = t), we know to count the number of satisfying assignments to . u Consequently, computing P(X = t) is #P-hard.

29 29 Hardness - Notes u We need not use deterministic relations in our construction.  The construction shows that hardness follows even with a small degree graphs. u Hardness does not mean we cannot do inference l It implies that we cannot find a general procedure that works efficiently for all networks l For particular families of networks, we can have provably efficient procedures (e.g., trees, HMMs). l Variable elimination algorithms.

30 30 Proof of Theorem X Given a graph G, it is sufficient to show that for an independence statement  = I( ,Z,  ) that does NOT hold in G, there exists a probability distribution that satisfies all independence statements that hold in the graph and does not satisfy  = I( ,Z,  ). Well, simply pick a path in G between  and  that does not contain a node from Z. Define a probability distribution that is a perfect map of the chain and multiply it by any marginal probabilities on all other nodes forming P . Now “multiply” all P  (Armstrong relation) to obtain P. Interesting task (Replacing HMW #4): Given an undirected graph over binary variables construct a perfect map probability distribution. (Note: most Markov random fields are perfect maps !).

31 31 Interesting conclusion of Theorem X: All independence statements that follow for strictly-positive probability from the neighborhood basis are derivable via symmetry, decomposition, intersection, and weak union. These axioms are (sound and) complete for neighborhood bases. These axioms are (sound and) complete also for pairwise bases. In fact for saturated statements conditional independence and separation have the same characterization. See paper P2.

32 32 Chordal Graphs

33 33 Example of the Theorem 1.Each cycle has a chord. 2.There is a way to direct edges legally, namely, A  B, A  C, B  C, B  D, C  D, C  E 3.Legal removal order (eg): start with E, than D, than the rest. 4.The maximal cliques form a join (clique) tree.


Download ppt ". Bayesian Networks Some slides have been edited from Nir Friedman’s lectures which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger."

Similar presentations


Ads by Google