Oct 15th, 2001Copyright © 2001, Andrew W. Moore Bayes Networks Basado en notas de los profesores Andrew W. Moore, Nir Friedman, y Daphne Kolle.

Bayes Nets: Slide 2 Copyright © 2001, Andrew W. Moore Finding Joint distributions Suppose there are two events: M: Maria teaches the class S: It is sunny The joint p.d.f. for these events contain four entries. Knowing only P(M) and P(S) is not enough to derive the joint distribution. We need to make an extra assumption.

Bayes Nets: Slide 3 Copyright © 2001, Andrew W. Moore Independence “The sunshine levels do not depend on and do not influence who is teaching.” This can be specified very simply: P(S  M) = P(S) This is a powerful statement! It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.

Bayes Nets: Slide 4 Copyright © 2001, Andrew W. Moore Independence From P(S  M) = P(S), the rules of probability imply: (can you prove these?) P(~S  M) = P(~S) P(M  S) = P(M) P(M ^ S) = P(M) P(S) P(~M ^ S) = P(~M) P(S), (PM^~S) = P(M)P(~S), P(~M^~S) = P(~M)P(~S)

Bayes Nets: Slide 5 Copyright © 2001, Andrew W. Moore Independence From P(S  M) = P(S), the rules of probability imply: (can you prove these?) P(~S  M) = P(~S) P(M  S) = P(M) P(M ^ S) = P(M) P(S) P(~M ^ S) = P(~M) P(S), (PM^~S) = P(M)P(~S), P(~M^~S) = P(~M)P(~S) And in general: P(M=u ^ S=v) = P(M=u) P(S=v) for each of the four combinations of u=True/False v=True/False

Bayes Nets: Slide 6 Copyright © 2001, Andrew W. Moore Independence Suppose that: P(M) = 0.6 P(S) = 0.3 P(S  M) = P(S) M SProb TT TF FT FF And since we now have the joint pdf, we can make any queries we like. From these statements, we can derive the full joint pdf.

Bayes Nets: Slide 7 Copyright © 2001, Andrew W. Moore A more interesting case M : Maria teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers (Jose and Maria) are sometimes delayed by bad weather. Jose is more likely to arrive late than Maria.

Bayes Nets: Slide 8 Copyright © 2001, Andrew W. Moore A more interesting case M : Maria teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers (Jose and Maria) are sometimes delayed by bad weather. Jose is more likely to arrive late than Maria. Let’s begin with writing down knowledge we’re happy about: P(S  M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer.

Bayes Nets: Slide 9 Copyright © 2001, Andrew W. Moore A more interesting case M : Maria teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers (Jose and Maria) are sometimes delayed by bad weather. Jose is more likely to arrive late than Maria. Let’s begin with writing down knowledge we’re happy about: P(S  M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer. We already know the Joint of S and M, so for knowing the joint pdf of L, S and M, all we need now is P(L  S=u, M=v) in the 4 cases of u/v = True/False.

Bayes Nets: Slide 10 Copyright © 2001, Andrew W. Moore A more interesting case M : Maria teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Jose is more likely to arrive late than Maria. P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 Now we can derive a full joint p.d.f. with a “mere” six numbers instead of seven* *Savings are larger for larger numbers of variables.

Bayes Nets: Slide 11 Copyright © 2001, Andrew W. Moore A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 Question: Express P(L=x ^ M=y ^ S=z) in terms that only need the above expressions, where x,y and z may each be True or False.

Bayes Nets: Slide 12 Copyright © 2001, Andrew W. Moore A bit of notation P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 SM L P(s)=0.3 P(M)=0.6 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

Bayes Nets: Slide 13 Copyright © 2001, Andrew W. Moore A bit of notation P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 SM L P(s)=0.3 P(M)=0.6 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Read the absence of an arrow between S and M to mean “it would not help me predict M if I knew the value of S” Read the two arrows into L to mean that if I want to know the value of L it may help me to know M and to know S. This kind of stuff will be thoroughly formalized later

Bayes Nets: Slide 14 Copyright © 2001, Andrew W. Moore More examples Suppose we have these three events: M : Lecture taught by Maria L : Lecturer arrives late R : Lecture concerns robots Suppose: Jose has a higher chance of being late than Maria. Jose has a higher chance of giving robotics lectures. What kind of independence can we find? How about: P(L  M) = P(L) ? No P(R  M) = P(R) ? No P(L  R) = P(L) ?

Bayes Nets: Slide 15 Copyright © 2001, Andrew W. Moore Conditional independence Once you know who the lecturer is, then whether they arrive late doesn’t affect whether the lecture concerns robots. P(R  M,L) = P(R  M) and P(R  ~M,L) = P(R  ~M) We express this in the following way: “R and L are conditionally independent given M” M L R Given knowledge of M, knowing anything else in the diagram won’t help us with L, etc...which is also notated by the following diagram.

Bayes Nets: Slide 16 Copyright © 2001, Andrew W. Moore Conditional Independence formalized R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments)

Bayes Nets: Slide 17 Copyright © 2001, Andrew W. Moore Example: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments) “Shoe-size is conditionally independent of Glove-size given height weight and age” means forall s,g,h,w,a P(ShoeSize=s|Height=h,Weight=w,Age=a) = P(ShoeSize=s|Height=h,Weight=w,Age=a,GloveSize=g)

Bayes Nets: Slide 18 Copyright © 2001, Andrew W. Moore Example: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments) “Shoe-size is conditionally independent of Glove-size given height weight and age” does not mean forall s,g,h P(ShoeSize=s|Height=h) = P(ShoeSize=s|Height=h, GloveSize=g)

Bayes Nets: Slide 19 Copyright © 2001, Andrew W. Moore Conditional independence M L R We can write down P(M). And then, since we know L is only directly influenced by M, we can write down the values of P(L  M) and P(L  ~M) and know we’ve fully specified L’s behavior. Ditto for R. P(M) = 0.6 P(L  M) = 0.085 P(L  ~M) = 0.17 P(R  M) = 0.3 P(R  ~M) = 0.6 ‘R and L conditionally independent given M’

Bayes Nets: Slide 20 Copyright © 2001, Andrew W. Moore Conditional independence M L R P(M) = 0.6 P(L  M) = 0.085 P(L  ~M) = 0.17 P(R  M) = 0.3 P(R  ~M) = 0.6 Conditional Independence: P(R  M,L) = P(R  M), P(R  ~M,L) = P(R  ~M) Again, we can obtain any member of the Joint prob dist that we desire: P(L=x ^ R=y ^ M=z) =

Bayes Nets: Slide 21 Copyright © 2001, Andrew W. Moore Assume five variables T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Maria S: It is sunny T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L) L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S) R only directly influenced by M (i.e. R is conditionally independent of L,S, given M) M and S are independent

Bayes Nets: Slide 22 Copyright © 2001, Andrew W. Moore Making a Bayes net S M R L T Step One: add variables. Just choose the variables you’d like to be included in the net. T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Maria S: It is sunny

Bayes Nets: Slide 23 Copyright © 2001, Andrew W. Moore Making a Bayes net S M R L T Step Two: add links. The link structure must be acyclic. If node X has parents Q 1,Q 2,..Q n then any variable that’s a non-descendent of X is conditionally independent of X given {Q 1,Q 2,..Q n } T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Maria S: It is sunny

Bayes Nets: Slide 24 Copyright © 2001, Andrew W. Moore Making a Bayes net S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step Three: add a probability table for each node. The table for node X must list P(X|Parent Values) for each possible combination of parent values. This table is called Conditional Probability Table (CPT) T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Maria S: It is sunny

Bayes Nets: Slide 25 Copyright © 2001, Andrew W. Moore Making a Bayes net S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Two unconnected variables may still be correlated Each node is conditionally independent of all non- descendants in the tree, given its parents. You can deduce many other conditional independence relations from a Bayes net. T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Maria S: It is sunny

Bayes Nets: Slide 26 Copyright © 2001, Andrew W. Moore Naïve Bayes: a simple Bayes Net J RZC JPerson is a Junior CBrought Coat to Classroom ZLive in zipcode 15213 RSaw “Return of the King” more than once

Bayes Nets: Slide 27 Copyright © 2001, Andrew W. Moore Naïve Bayes: A simple Bayes Net J RZC JPerson is a Junior CBrought Coat to Classroom ZLive in zipcode 15213 RSaw “Return of the King” more than once What parameters are stored in the CPTs of this Bayes Net? CPT=Conditional Probability Table

Bayes Nets: Slide 28 Copyright © 2001, Andrew W. Moore Naïve bayes: A simple Bayes Net J RZC JPerson is a Junior CBrought Coat to Classroom ZLive in zipcode 15213 RSaw “Return of the King” more than once P(J) = P(C|J) = P(C|~J)= P(R|J) = P(R|~J)= P(Z|J) = P(Z|~J)= Suppose we have a database from 20 people who attended a lecture. How could we use that to estimate the values in this CPT?

Bayes Nets: Slide 29 Copyright © 2001, Andrew W. Moore Naïve Bayes: A simple Bayes Net J RFC JPerson is a Junior CBrought Coat to Classroom ZLive in zipcode 15213 RSaw “Return of the King” more than once P(J) = P(C|J) = P(C|~J)= P(R|J) = P(R|~J)= P(Z|J) = P(Z|~J)= Suppose we have a database from 20 people who attended a lecture. How could we use that to estimate the values in this CPT?

Bayes Nets: Slide 30 Copyright © 2001, Andrew W. Moore A Naïve Bayes Classifier J RZC JWalked to School CBrought Coat to Classroom ZLive in zipcode 15213 RSaw “Return of the King” more than once P(J) = P(C|J) = P(C|~J)= P(R|J) = P(R|~J)= P(Z|J) = P(Z|~J)= Input Attributes Output Attribute A new person shows up at class wearing an “I live right above the Manor Theater where I saw all the Lord of The Rings Movies every night” overcoat. What is the probability that they are a Junior?

Bayes Nets: Slide 32 Copyright © 2001, Andrew W. Moore The General Case Y X1X1 XmXm X2X2... 1.Estimate P(Y=v) as fraction of records with Y=v 2.Estimate P(X i =u | Y=v) as fraction of “Y=v” records that also have X=u. 3.To predict the Y value given observations of all the X i values, compute

Bayes Nets: Slide 33 Copyright © 2001, Andrew W. Moore Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V, E where: V is a set of vertices. E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: The name of a random variable A probability distribution table indicating how the probability of this variable’s values depends on all possible combinations of parental values.

Bayes Nets: Slide 34 Copyright © 2001, Andrew W. Moore Building a Bayes Net 1.Choose a set of relevant variables. 2.Choose an ordering for them 3.Assume they’re called X 1.. X m (where X 1 is the first in the ordering, X 1 is the second, etc) 4.For i = 1 to m: 1.Add the X i node to the network 2.Set Parents(X i ) to be a minimal subset of { X 1 …X i-1 } such that we have conditional independence of X i and all other members of { X 1 …X i-1 } given Parents(X i ) 3.Define the probability table of P(X i =k  Assignments of Parents(X i ) ).

Bayes Nets: Slide 35 Copyright © 2001, Andrew W. Moore Computing a Joint Entry How to compute an entry in a joint distribution? E.G: What is P(S ^ ~M ^ L ~R ^ T)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

Bayes Nets: Slide 36 Copyright © 2001, Andrew W. Moore Computing with Bayes Net P(T ^ ~R ^ L ^ ~M ^ S) = P(T  ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) = P(T  L) * P(~R ^ L ^ ~M ^ S) = P(T  L) * P(~R  L ^ ~M ^ S) * P(L^~M^S) = P(T  L) * P(~R  ~M) * P(L^~M^S) = P(T  L) * P(~R  ~M) * P(L  ~M^S)*P(~M^S) = P(T  L) * P(~R  ~M) * P(L  ~M^S)*P(~M | S)*P(S) = P(T  L) * P(~R  ~M) * P(L  ~M^S)*P(~M)*P(S). S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

Bayes Nets: Slide 37 Copyright © 2001, Andrew W. Moore The general case P(X 1 = x 1 ^ X 2 =x 2 ^ ….X n-1 =x n-1 ^ X n =x n ) = P(X n =x n ^ X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1 ^…. X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1  …. X 2 =x 2 ^ X 1 =x 1 ) * P(X n-2 =x n-2 ^…. X 2 =x 2 ^ X 1 =x 1 ) = : = So any entry in joint pdf table can be computed. And so any conditional probability can be computed.

Bayes Nets: Slide 38 Copyright © 2001, Andrew W. Moore Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

Bayes Nets: Slide 39 Copyright © 2001, Andrew W. Moore Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) ------------------------------------- P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Bayes Nets: Slide 40 Copyright © 2001, Andrew W. Moore Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) ------------------------------------- P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S) Sum of all the rows in the Joint that match R ^ T ^ ~S Sum of all the rows in the Joint that match ~R ^ T ^ ~S

Bayes Nets: Slide 41 Copyright © 2001, Andrew W. Moore Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) ------------------------------------- P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S) Sum of all the rows in the Joint that match R ^ T ^ ~S Sum of all the rows in the Joint that match ~R ^ T ^ ~S Each of these obtained by the “computing a joint probability entry” method of the earlier slides 4 joint computes

Bayes Nets: Slide 43 Copyright © 2001, Andrew W. Moore The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values ) Suppose you have m binary-valued variables in your Bayes Net and expression E 2 mentions k variables. How much work is the above computation?

Bayes Nets: Slide 44 Copyright © 2001, Andrew W. Moore The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables.

Bayes Nets: Slide 45 Copyright © 2001, Andrew W. Moore The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks too, right?

Bayes Nets: Slide 46 Copyright © 2001, Andrew W. Moore The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks too, right? Sadder and worse news: General querying of Bayes nets is NP-complete.

Bayes Nets: Slide 47 Copyright © 2001, Andrew W. Moore Real-sized Bayes Nets How do you build them? From Experts and/or from Data! How do you use them? Predict values that are expensive or impossible to measure. Decide which possible problems to investigate first.

Bayes Nets: Slide 48 Copyright © 2001, Andrew W. Moore Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients 37 variables 509 parameters …instead of 2 54 PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

Bayes Nets: Slide 49 Copyright © 2001, Andrew W. Moore Building Bayes Nets Bayes nets are sometimes built manually, consulting domain experts for structure and probabilities. More often the structure is supplied by experts, but the probabilities learned from data. And in some cases the structure, as well as the probabilities, are learned from data.

Bayes Nets: Slide 52 Copyright © 2001, Andrew W. Moore Family of Alarm Bayesian Networks Qualitative part: Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence Quantitative part: Set of conditional probability distributions 0.90.1 e b e 0.20.8 0.01 0.99 0.90.1 be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call Compact representation of probability distributions via conditional independence Together: Define a unique distribution in a factored form

Bayes Nets: Slide 53 Copyright © 2001, Andrew W. Moore Why learning? Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert Data is cheap Amount of available information growing rapidly Learning allows us to construct models from raw data

Bayes Nets: Slide 54 Copyright © 2001, Andrew W. Moore Why Learn Bayesian Networks? Conditional independencies & graphical language capture structure of many real- world distributions Graph structure provides much insight into domain Allows “knowledge discovery” Learned model can be used for many tasks Supports all the features of probabilistic learning Model selection criteria Dealing with missing data & hidden variables

Bayes Nets: Slide 56 Copyright © 2001, Andrew W. Moore Known Structure, Complete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A Network structure is specified Inducer needs to estimate parameters Data does not contain missing values Learner E, B, A.

Bayes Nets: Slide 57 Copyright © 2001, Andrew W. Moore Unknown Structure, Complete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A Network structure is not specified Inducer needs to select arcs & estimate parameters Data does not contain missing values E, B, A. Learner

Bayes Nets: Slide 58 Copyright © 2001, Andrew W. Moore Known Structure, Incomplete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A Network structure is specified Data contains missing values Need to consider assignments to missing values E, B, A. Learner

Bayes Nets: Slide 59 Copyright © 2001, Andrew W. Moore Unknown Structure, Incomplete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A Network structure is not specified Data contains missing values Need to consider assignments to missing values E, B, A. Learner

Bayes Nets: Slide 60 Copyright © 2001, Andrew W. Moore Learning BN Structure from Data (Maragitis, D., Ph.D Thesis 2003,CMU) Score Metrics Most implemented method Define a quality metric to be maximized Use greedy search to determine the next best arc to add Stop when metric does not increase by adding an arc Entropy Methods Earliest method Formulated for trees and polytrees Conditional Independence (CI) Define conditional independencies for each node (Markov boundaries) Infer dependencies within Markov boundary Simulated Annealing & Genetic Algorithms

Bayes Nets: Slide 64 Copyright © 2001, Andrew W. Moore Sample Score Metrics Bayesian score: p(network structure | database) Information criterion: log p(database | network structure and parameter set) Favors complete networks Commonly add a penalty term on the number of arcs Minimum description length: equivalent to the information criterion with a penalty function Derived using coding theory

Bayes Nets: Slide 65 Copyright © 2001, Andrew W. Moore Scorebased Learning E, B, A. E B A E B A E B A Search for a structure that maximizes the score Define scoring function that evaluates how well a structure matches the data

Bayes Nets: Slide 66 Copyright © 2001, Andrew W. Moore Likelihood Score for Structure Larger dependence of X i on Pa i  higher score Adding arcs always helps I(X; Y)  I(X; {Y,Z}) Max score attained by fully connected network Overfitting: A bad idea… Mutual information between X i and its parents

Bayes Nets: Slide 67 Copyright © 2001, Andrew W. Moore Max likelihood params Bayesian Score Likelihood score: Bayesian approach: Deal with uncertainty by assigning probability to all possibilities Likelihood Prior over parameters Marginal Likelihood

Bayes Nets: Slide 68 Copyright © 2001, Andrew W. Moore Scoring a structure Number of non- redundant parameters defining the net #Records #Attributes Sums over all the rows in the probability table for X j The parent values in the k’th row of X j ’s probability table All these values estimated from data

Bayes Nets: Slide 69 Copyright © 2001, Andrew W. Moore Scoring a structure All these values estimated from data This is called a BIC (Bayes Information Criterion) estimate This part is a penalty for too many parameters This part is the training set log- likelihood BIC asymptotically tries to get the structure right. (There’s a lot of heavy emotional debate about whether this is the best scoring criterion)

Bayes Nets: Slide 70 Copyright © 2001, Andrew W. Moore As M (amount of data) grows, Increasing pressure to fit dependencies in distribution Complexity term avoids fitting noise Asymptotic equivalence to MDL score Bayesian score is consistent Observed data eventually overrides prior Fit dependencies in empirical distribution Complexity penalty Bayesian Score: Asymptotic Behavior

Bayes Nets: Slide 72 Copyright © 2001, Andrew W. Moore Structure Discovery Task: Discover structural properties Is there a direct connection between X & Y Does X separate between two “subsystems” Does X causally effect Y Example: scientific data mining Disease properties and symptoms Interactions between the expression of genes

Bayes Nets: Slide 73 Copyright © 2001, Andrew W. Moore Discovering Structure Current practice: model selection Pick a single high-scoring model Use that model to infer domain structure E R B A C P(G|D)

Bayes Nets: Slide 74 Copyright © 2001, Andrew W. Moore Discovering Structure Problem Small sample size  many high scoring models Answer based on one model often useless Want features common to many models E R B A C E R B A C E R B A C E R B A C E R B A C P(G|D)

Bayes Nets: Slide 75 Copyright © 2001, Andrew W. Moore Incomplete Data Data is often incomplete Some variables of interest are not assigned values This phenomenon happens when we have Missing values: Some variables unobserved in some instances Hidden variables: Some variables are never observed We might not even know they exist

Bayes Nets: Slide 76 Copyright © 2001, Andrew W. Moore Incomplete Data In the presence of incomplete data, the likelihood can have multiple maxima Example: We can rename the values of hidden variable H If H has two values, likelihood has two maxima In practice, many local maxima HY

Bayes Nets: Slide 77 Copyright © 2001, Andrew W. Moore Software Many software packages available See Kevin Murphy page Deal: Deal: Learning Bayesian Networks in R http://www.math.aau.dk/novo/dealhttp://www.math.aau.dk/novo/deal MSNBx Bayes Net Toolbox for Matlab Hugin www.hugin.dk Good user interface Implements continuous variables

Bayes Nets: Slide 79 Copyright © 2001, Andrew W. Moore Chow y Lui (1968) Aproximación de distribuciones de probabilidad discreta usando árboles de dependencia Se asignan pesos a cada arco usando la cantidad de información mutua

Bayes Nets: Slide 80 Copyright © 2001, Andrew W. Moore Chow y Lui (1968) Se busca el árbol de dependencia con máximo peso Existen p-1 arcos No existen ciclos X4 X2 X1 X5 X3 X4 X2 X1 X5 X3 Solo considera dependencias de primer orden

Bayes Nets: Slide 81 Copyright © 2001, Andrew W. Moore Algoritmo Chow y Lui Calcular entre cada par de variables Construir un gráfico completo no dirigido en el que los nodos correspondan a las variables y el peso de cada arco que conecta con sea Extraer el árbol de dependencia con máximo peso Escoger una variable raíz y direccionar todos los arcos de tal forma que sean externos a el

Bayes Nets: Slide 82 Copyright © 2001, Andrew W. Moore Tree Augmented Naive Bayes Friedman, Geiger y Goldszmidt (1997) Adaptación del algoritmo Chow y Lui Naive Bayes con arcos aumentados entre variables X4 X2 X1 X3 C X6 X5

Bayes Nets: Slide 83 Copyright © 2001, Andrew W. Moore Algoritmo Construct-TAN Calcular entre cada par de variables Construir un gráfico completo no dirigido en el que los nodos correspondan a las variables y el peso de cada arco que conecta con sea Extraer el árbol de dependencia de máximo peso Escoger una variable raíz y direccionar todos los arcos de tal forma que sean externos a el Adicionar la variable de clase C y adicionar un arcos hacia cada variable

Bayes Nets: Slide 84 Copyright © 2001, Andrew W. Moore Enlace con el algoritmo CL TAN usa la cantidad de información mutua condicional Tiene el mismo tiempo de complejidad Solo considera dependencias de primer orden

Bayes Nets: Slide 85 Copyright © 2001, Andrew W. Moore Diabetes NB 768 instancias 8 variables 2 clases Discretización DPF Diastolic Age Mass Pregnant Insulin Triceps Glucose naiveBayes(Diabetes) La tasa de error es 0.203125

Bayes Nets: Slide 86 Copyright © 2001, Andrew W. Moore Diabetes TAN GlucoseDiastolicTricepsInsulinMassDPFAge Pregnat0.37977990.38029390.34633700.39372180.39653840.35172500.8545171 Glucose0.40265130.29847750.46876990.43651040.37798300.3918154 Diastolic0.35555640.38306520.43016990.34794640.4123081 Triceps0.83474530.59456250.36842790.3688592 Insulin0.37274690.41536440.3548544 Mass0.41456520.3589600 DPF0.3756110 Cantidad de información mutua

Bayes Nets: Slide 97 Copyright © 2001, Andrew W. Moore Basic References Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. San Mateo, CA: Morgan Kauffman. Oliver, R.M. and Smith, J.Q. (eds.) (1990). Influence Diagrams, Belief Nets, and Decision Analysis, Chichester, Wiley. Neapolitan, R.E. (1990). Probabilistic Reasoning in Expert Systems, New York: Wiley. Schum, D.A. (1994). The Evidential Foundations of Probabilistic Reasoning, New York: Wiley. Jensen, F.V. (1996). An Introduction to Bayesian Networks, New York: Springer. Chang, K.C. and Fung, R. (1995). Symbolic Probabilistic Inference with Both Discrete and Continuous Variables, IEEE SMC, 25(6), 910-916.

Bayes Nets: Slide 98 Copyright © 2001, Andrew W. Moore Algorithm References Cooper, G.F. (1990) The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence, 42, 393-405, Jensen, F.V, Lauritzen, S.L., and Olesen, K.G. (1990). Bayesian Updating in Causal Probabilistic Networks by Local Computations. Computational Statistics Quarterly, 269-282. Lauritzen, S.L. and Spiegelhalter, D.J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. J. Royal Statistical Society B, 50(2), 157-224.

Oct 15th, 2001Copyright © 2001, Andrew W. Moore Bayes Networks Basado en notas de los profesores Andrew W. Moore, Nir Friedman, y Daphne Kolle.

Similar presentations

Presentation on theme: "Oct 15th, 2001Copyright © 2001, Andrew W. Moore Bayes Networks Basado en notas de los profesores Andrew W. Moore, Nir Friedman, y Daphne Kolle."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Oct 15th, 2001Copyright © 2001, Andrew W. Moore Bayes Networks Basado en notas de los profesores Andrew W. Moore, Nir Friedman, y Daphne Kolle.

Similar presentations

Presentation on theme: "Oct 15th, 2001Copyright © 2001, Andrew W. Moore Bayes Networks Basado en notas de los profesores Andrew W. Moore, Nir Friedman, y Daphne Kolle."— Presentation transcript:

Similar presentations

About project

Feedback