Oct 15th, 2001Copyright © 2001, Andrew W. Moore Bayes Networks Basado en notas de los profesores Andrew W. Moore, Nir Friedman, y Daphne Kolle.

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

A Tutorial on Learning with Bayesian Networks
Learning with Missing Data
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Learning: Parameter Estimation
Dynamic Bayesian Networks (DBNs)
1 22c:145 Artificial Intelligence Bayesian Networks Reading: Ch 14. Russell & Norvig.
Graphical Models - Learning -
Bayesian Networks - Intro - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP HREKG.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Graphical Models - Inference -
Review: Bayesian learning and inference
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Bayesian networks Chapter 14 Section 1 – 2.
Bayes Nets Rong Jin. Hidden Markov Model  Inferring from observations (o i ) to hidden variables (q i )  This is a general framework for representing.
. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.
Bayesian Belief Networks
Goal: Reconstruct Cellular Networks Biocarta. Conditions Genes.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
Copyright © 2004, Andrew W. Moore Naïve Bayes Classifiers Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
Learning Bayesian Networks
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Oct 15th, 2001Copyright © 2001, Andrew W. Moore Bayes Nets for representing and reasoning about uncertainty Andrew W. Moore Associate Professor School.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
1 Monte Carlo Artificial Intelligence: Bayesian Networks.
Introduction to Bayesian Networks
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
COMP 538 Reasoning and Decision under Uncertainty Introduction Readings: Pearl (1998, Chapter 1 Shafer and Pearl, Chapter 1.
Slides for “Data Mining” by I. H. Witten and E. Frank.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Guidance: Assignment 3 Part 1 matlab functions in statistics toolbox  betacdf, betapdf, betarnd, betastat, betafit.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Oct 15th, 2001Copyright © 2001, Andrew W. Moore Bayes Nets for representing and reasoning about uncertainty Andrew W. Moore Associate Professor School.
Zaawansowana Analiza Danych Wykład 3: Sieci Bayesowskie Piotr Synak.
Oct 29th, 2001Copyright © 2001, Andrew W. Moore Bayes Net Structure Learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) Nov, 13, 2013.
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
CS 2750: Machine Learning Directed Graphical Models
Qian Liu CSE spring University of Pennsylvania
Introduction to Artificial Intelligence
Learning Bayesian Network Models from Data
CS 4/527: Artificial Intelligence
Introduction to Artificial Intelligence
Introduction to Artificial Intelligence
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
Bayesian Learning Chapter
Class #16 – Tuesday, October 26
CS 188: Artificial Intelligence Spring 2007
Read R&N Ch Next lecture: Read R&N
Presentation transcript:

Oct 15th, 2001Copyright © 2001, Andrew W. Moore Bayes Networks Basado en notas de los profesores Andrew W. Moore, Nir Friedman, y Daphne Kolle

Bayes Nets: Slide 2 Copyright © 2001, Andrew W. Moore Finding Joint distributions Suppose there are two events: M: Maria teaches the class S: It is sunny The joint p.d.f. for these events contain four entries. Knowing only P(M) and P(S) is not enough to derive the joint distribution. We need to make an extra assumption.

Bayes Nets: Slide 3 Copyright © 2001, Andrew W. Moore Independence “The sunshine levels do not depend on and do not influence who is teaching.” This can be specified very simply: P(S  M) = P(S) This is a powerful statement! It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.

Bayes Nets: Slide 4 Copyright © 2001, Andrew W. Moore Independence From P(S  M) = P(S), the rules of probability imply: (can you prove these?) P(~S  M) = P(~S) P(M  S) = P(M) P(M ^ S) = P(M) P(S) P(~M ^ S) = P(~M) P(S), (PM^~S) = P(M)P(~S), P(~M^~S) = P(~M)P(~S)

Bayes Nets: Slide 5 Copyright © 2001, Andrew W. Moore Independence From P(S  M) = P(S), the rules of probability imply: (can you prove these?) P(~S  M) = P(~S) P(M  S) = P(M) P(M ^ S) = P(M) P(S) P(~M ^ S) = P(~M) P(S), (PM^~S) = P(M)P(~S), P(~M^~S) = P(~M)P(~S) And in general: P(M=u ^ S=v) = P(M=u) P(S=v) for each of the four combinations of u=True/False v=True/False

Bayes Nets: Slide 6 Copyright © 2001, Andrew W. Moore Independence Suppose that: P(M) = 0.6 P(S) = 0.3 P(S  M) = P(S) M SProb TT TF FT FF And since we now have the joint pdf, we can make any queries we like. From these statements, we can derive the full joint pdf.

Bayes Nets: Slide 7 Copyright © 2001, Andrew W. Moore A more interesting case M : Maria teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers (Jose and Maria) are sometimes delayed by bad weather. Jose is more likely to arrive late than Maria.

Bayes Nets: Slide 8 Copyright © 2001, Andrew W. Moore A more interesting case M : Maria teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers (Jose and Maria) are sometimes delayed by bad weather. Jose is more likely to arrive late than Maria. Let’s begin with writing down knowledge we’re happy about: P(S  M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer.

Bayes Nets: Slide 9 Copyright © 2001, Andrew W. Moore A more interesting case M : Maria teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers (Jose and Maria) are sometimes delayed by bad weather. Jose is more likely to arrive late than Maria. Let’s begin with writing down knowledge we’re happy about: P(S  M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer. We already know the Joint of S and M, so for knowing the joint pdf of L, S and M, all we need now is P(L  S=u, M=v) in the 4 cases of u/v = True/False.

Bayes Nets: Slide 10 Copyright © 2001, Andrew W. Moore A more interesting case M : Maria teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Jose is more likely to arrive late than Maria. P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 Now we can derive a full joint p.d.f. with a “mere” six numbers instead of seven* *Savings are larger for larger numbers of variables.

Bayes Nets: Slide 11 Copyright © 2001, Andrew W. Moore A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 Question: Express P(L=x ^ M=y ^ S=z) in terms that only need the above expressions, where x,y and z may each be True or False.

Bayes Nets: Slide 12 Copyright © 2001, Andrew W. Moore A bit of notation P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 SM L P(s)=0.3 P(M)=0.6 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

Bayes Nets: Slide 13 Copyright © 2001, Andrew W. Moore A bit of notation P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 SM L P(s)=0.3 P(M)=0.6 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Read the absence of an arrow between S and M to mean “it would not help me predict M if I knew the value of S” Read the two arrows into L to mean that if I want to know the value of L it may help me to know M and to know S. This kind of stuff will be thoroughly formalized later

Bayes Nets: Slide 14 Copyright © 2001, Andrew W. Moore More examples Suppose we have these three events: M : Lecture taught by Maria L : Lecturer arrives late R : Lecture concerns robots Suppose: Jose has a higher chance of being late than Maria. Jose has a higher chance of giving robotics lectures. What kind of independence can we find? How about: P(L  M) = P(L) ? No P(R  M) = P(R) ? No P(L  R) = P(L) ?

Bayes Nets: Slide 15 Copyright © 2001, Andrew W. Moore Conditional independence Once you know who the lecturer is, then whether they arrive late doesn’t affect whether the lecture concerns robots. P(R  M,L) = P(R  M) and P(R  ~M,L) = P(R  ~M) We express this in the following way: “R and L are conditionally independent given M” M L R Given knowledge of M, knowing anything else in the diagram won’t help us with L, etc...which is also notated by the following diagram.

Bayes Nets: Slide 16 Copyright © 2001, Andrew W. Moore Conditional Independence formalized R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments)

Bayes Nets: Slide 17 Copyright © 2001, Andrew W. Moore Example: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments) “Shoe-size is conditionally independent of Glove-size given height weight and age” means forall s,g,h,w,a P(ShoeSize=s|Height=h,Weight=w,Age=a) = P(ShoeSize=s|Height=h,Weight=w,Age=a,GloveSize=g)

Bayes Nets: Slide 18 Copyright © 2001, Andrew W. Moore Example: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments) “Shoe-size is conditionally independent of Glove-size given height weight and age” does not mean forall s,g,h P(ShoeSize=s|Height=h) = P(ShoeSize=s|Height=h, GloveSize=g)

Bayes Nets: Slide 19 Copyright © 2001, Andrew W. Moore Conditional independence M L R We can write down P(M). And then, since we know L is only directly influenced by M, we can write down the values of P(L  M) and P(L  ~M) and know we’ve fully specified L’s behavior. Ditto for R. P(M) = 0.6 P(L  M) = P(L  ~M) = 0.17 P(R  M) = 0.3 P(R  ~M) = 0.6 ‘R and L conditionally independent given M’

Bayes Nets: Slide 20 Copyright © 2001, Andrew W. Moore Conditional independence M L R P(M) = 0.6 P(L  M) = P(L  ~M) = 0.17 P(R  M) = 0.3 P(R  ~M) = 0.6 Conditional Independence: P(R  M,L) = P(R  M), P(R  ~M,L) = P(R  ~M) Again, we can obtain any member of the Joint prob dist that we desire: P(L=x ^ R=y ^ M=z) =

Bayes Nets: Slide 21 Copyright © 2001, Andrew W. Moore Assume five variables T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Maria S: It is sunny T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L) L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S) R only directly influenced by M (i.e. R is conditionally independent of L,S, given M) M and S are independent

Bayes Nets: Slide 22 Copyright © 2001, Andrew W. Moore Making a Bayes net S M R L T Step One: add variables. Just choose the variables you’d like to be included in the net. T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Maria S: It is sunny

Bayes Nets: Slide 23 Copyright © 2001, Andrew W. Moore Making a Bayes net S M R L T Step Two: add links. The link structure must be acyclic. If node X has parents Q 1,Q 2,..Q n then any variable that’s a non-descendent of X is conditionally independent of X given {Q 1,Q 2,..Q n } T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Maria S: It is sunny

Bayes Nets: Slide 24 Copyright © 2001, Andrew W. Moore Making a Bayes net S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step Three: add a probability table for each node. The table for node X must list P(X|Parent Values) for each possible combination of parent values. This table is called Conditional Probability Table (CPT) T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Maria S: It is sunny

Bayes Nets: Slide 25 Copyright © 2001, Andrew W. Moore Making a Bayes net S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Two unconnected variables may still be correlated Each node is conditionally independent of all non- descendants in the tree, given its parents. You can deduce many other conditional independence relations from a Bayes net. T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Maria S: It is sunny

Bayes Nets: Slide 26 Copyright © 2001, Andrew W. Moore Naïve Bayes: a simple Bayes Net J RZC JPerson is a Junior CBrought Coat to Classroom ZLive in zipcode RSaw “Return of the King” more than once

Bayes Nets: Slide 27 Copyright © 2001, Andrew W. Moore Naïve Bayes: A simple Bayes Net J RZC JPerson is a Junior CBrought Coat to Classroom ZLive in zipcode RSaw “Return of the King” more than once What parameters are stored in the CPTs of this Bayes Net? CPT=Conditional Probability Table

Bayes Nets: Slide 28 Copyright © 2001, Andrew W. Moore Naïve bayes: A simple Bayes Net J RZC JPerson is a Junior CBrought Coat to Classroom ZLive in zipcode RSaw “Return of the King” more than once P(J) = P(C|J) = P(C|~J)= P(R|J) = P(R|~J)= P(Z|J) = P(Z|~J)= Suppose we have a database from 20 people who attended a lecture. How could we use that to estimate the values in this CPT?

Bayes Nets: Slide 29 Copyright © 2001, Andrew W. Moore Naïve Bayes: A simple Bayes Net J RFC JPerson is a Junior CBrought Coat to Classroom ZLive in zipcode RSaw “Return of the King” more than once P(J) = P(C|J) = P(C|~J)= P(R|J) = P(R|~J)= P(Z|J) = P(Z|~J)= Suppose we have a database from 20 people who attended a lecture. How could we use that to estimate the values in this CPT?

Bayes Nets: Slide 30 Copyright © 2001, Andrew W. Moore A Naïve Bayes Classifier J RZC JWalked to School CBrought Coat to Classroom ZLive in zipcode RSaw “Return of the King” more than once P(J) = P(C|J) = P(C|~J)= P(R|J) = P(R|~J)= P(Z|J) = P(Z|~J)= Input Attributes Output Attribute A new person shows up at class wearing an “I live right above the Manor Theater where I saw all the Lord of The Rings Movies every night” overcoat. What is the probability that they are a Junior?

Bayes Nets: Slide 31 Copyright © 2001, Andrew W. Moore Naïve Bayes Classifier Inference J RZC

Bayes Nets: Slide 32 Copyright © 2001, Andrew W. Moore The General Case Y X1X1 XmXm X2X Estimate P(Y=v) as fraction of records with Y=v 2.Estimate P(X i =u | Y=v) as fraction of “Y=v” records that also have X=u. 3.To predict the Y value given observations of all the X i values, compute

Bayes Nets: Slide 33 Copyright © 2001, Andrew W. Moore Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V, E where: V is a set of vertices. E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: The name of a random variable A probability distribution table indicating how the probability of this variable’s values depends on all possible combinations of parental values.

Bayes Nets: Slide 34 Copyright © 2001, Andrew W. Moore Building a Bayes Net 1.Choose a set of relevant variables. 2.Choose an ordering for them 3.Assume they’re called X 1.. X m (where X 1 is the first in the ordering, X 1 is the second, etc) 4.For i = 1 to m: 1.Add the X i node to the network 2.Set Parents(X i ) to be a minimal subset of { X 1 …X i-1 } such that we have conditional independence of X i and all other members of { X 1 …X i-1 } given Parents(X i ) 3.Define the probability table of P(X i =k  Assignments of Parents(X i ) ).

Bayes Nets: Slide 35 Copyright © 2001, Andrew W. Moore Computing a Joint Entry How to compute an entry in a joint distribution? E.G: What is P(S ^ ~M ^ L ~R ^ T)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

Bayes Nets: Slide 36 Copyright © 2001, Andrew W. Moore Computing with Bayes Net P(T ^ ~R ^ L ^ ~M ^ S) = P(T  ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) = P(T  L) * P(~R ^ L ^ ~M ^ S) = P(T  L) * P(~R  L ^ ~M ^ S) * P(L^~M^S) = P(T  L) * P(~R  ~M) * P(L^~M^S) = P(T  L) * P(~R  ~M) * P(L  ~M^S)*P(~M^S) = P(T  L) * P(~R  ~M) * P(L  ~M^S)*P(~M | S)*P(S) = P(T  L) * P(~R  ~M) * P(L  ~M^S)*P(~M)*P(S). S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

Bayes Nets: Slide 37 Copyright © 2001, Andrew W. Moore The general case P(X 1 = x 1 ^ X 2 =x 2 ^ ….X n-1 =x n-1 ^ X n =x n ) = P(X n =x n ^ X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1 ^…. X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1  …. X 2 =x 2 ^ X 1 =x 1 ) * P(X n-2 =x n-2 ^…. X 2 =x 2 ^ X 1 =x 1 ) = : = So any entry in joint pdf table can be computed. And so any conditional probability can be computed.

Bayes Nets: Slide 38 Copyright © 2001, Andrew W. Moore Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

Bayes Nets: Slide 39 Copyright © 2001, Andrew W. Moore Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Bayes Nets: Slide 40 Copyright © 2001, Andrew W. Moore Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S) Sum of all the rows in the Joint that match R ^ T ^ ~S Sum of all the rows in the Joint that match ~R ^ T ^ ~S

Bayes Nets: Slide 41 Copyright © 2001, Andrew W. Moore Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S) Sum of all the rows in the Joint that match R ^ T ^ ~S Sum of all the rows in the Joint that match ~R ^ T ^ ~S Each of these obtained by the “computing a joint probability entry” method of the earlier slides 4 joint computes

Bayes Nets: Slide 42 Copyright © 2001, Andrew W. Moore The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values )

Bayes Nets: Slide 43 Copyright © 2001, Andrew W. Moore The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values ) Suppose you have m binary-valued variables in your Bayes Net and expression E 2 mentions k variables. How much work is the above computation?

Bayes Nets: Slide 44 Copyright © 2001, Andrew W. Moore The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables.

Bayes Nets: Slide 45 Copyright © 2001, Andrew W. Moore The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks too, right?

Bayes Nets: Slide 46 Copyright © 2001, Andrew W. Moore The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks too, right? Sadder and worse news: General querying of Bayes nets is NP-complete.

Bayes Nets: Slide 47 Copyright © 2001, Andrew W. Moore Real-sized Bayes Nets How do you build them? From Experts and/or from Data! How do you use them? Predict values that are expensive or impossible to measure. Decide which possible problems to investigate first.

Bayes Nets: Slide 48 Copyright © 2001, Andrew W. Moore Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients 37 variables 509 parameters …instead of 2 54 PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

Bayes Nets: Slide 49 Copyright © 2001, Andrew W. Moore Building Bayes Nets Bayes nets are sometimes built manually, consulting domain experts for structure and probabilities. More often the structure is supplied by experts, but the probabilities learned from data. And in some cases the structure, as well as the probabilities, are learned from data.

Bayes Nets: Slide 50 Copyright © 2001, Andrew W. Moore Estimating Probability Tables

Bayes Nets: Slide 51 Copyright © 2001, Andrew W. Moore Estimating Probability Tables

Bayes Nets: Slide 52 Copyright © 2001, Andrew W. Moore Family of Alarm Bayesian Networks Qualitative part: Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence Quantitative part: Set of conditional probability distributions e b e be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call Compact representation of probability distributions via conditional independence Together: Define a unique distribution in a factored form

Bayes Nets: Slide 53 Copyright © 2001, Andrew W. Moore Why learning? Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert Data is cheap Amount of available information growing rapidly Learning allows us to construct models from raw data

Bayes Nets: Slide 54 Copyright © 2001, Andrew W. Moore Why Learn Bayesian Networks? Conditional independencies & graphical language capture structure of many real- world distributions Graph structure provides much insight into domain Allows “knowledge discovery” Learned model can be used for many tasks Supports all the features of probabilistic learning Model selection criteria Dealing with missing data & hidden variables

Bayes Nets: Slide 55 Copyright © 2001, Andrew W. Moore Learning Bayesian networks E R B A C.9.1 e b e be b b e BEP(A | E,B) Data + Prior Information Learner

Bayes Nets: Slide 56 Copyright © 2001, Andrew W. Moore Known Structure, Complete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A Network structure is specified Inducer needs to estimate parameters Data does not contain missing values Learner E, B, A.

Bayes Nets: Slide 57 Copyright © 2001, Andrew W. Moore Unknown Structure, Complete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A Network structure is not specified Inducer needs to select arcs & estimate parameters Data does not contain missing values E, B, A. Learner

Bayes Nets: Slide 58 Copyright © 2001, Andrew W. Moore Known Structure, Incomplete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A Network structure is specified Data contains missing values Need to consider assignments to missing values E, B, A. Learner

Bayes Nets: Slide 59 Copyright © 2001, Andrew W. Moore Unknown Structure, Incomplete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A Network structure is not specified Data contains missing values Need to consider assignments to missing values E, B, A. Learner

Bayes Nets: Slide 60 Copyright © 2001, Andrew W. Moore Learning BN Structure from Data (Maragitis, D., Ph.D Thesis 2003,CMU) Score Metrics Most implemented method Define a quality metric to be maximized Use greedy search to determine the next best arc to add Stop when metric does not increase by adding an arc Entropy Methods Earliest method Formulated for trees and polytrees Conditional Independence (CI) Define conditional independencies for each node (Markov boundaries) Infer dependencies within Markov boundary Simulated Annealing & Genetic Algorithms

Bayes Nets: Slide 61 Copyright © 2001, Andrew W. Moore Learning a BN structure: The score based method

Bayes Nets: Slide 62 Copyright © 2001, Andrew W. Moore Learning a BN structure: The score based method

Bayes Nets: Slide 63 Copyright © 2001, Andrew W. Moore

Bayes Nets: Slide 64 Copyright © 2001, Andrew W. Moore Sample Score Metrics Bayesian score: p(network structure | database) Information criterion: log p(database | network structure and parameter set) Favors complete networks Commonly add a penalty term on the number of arcs Minimum description length: equivalent to the information criterion with a penalty function Derived using coding theory

Bayes Nets: Slide 65 Copyright © 2001, Andrew W. Moore Score­based Learning E, B, A. E B A E B A E B A Search for a structure that maximizes the score Define scoring function that evaluates how well a structure matches the data

Bayes Nets: Slide 66 Copyright © 2001, Andrew W. Moore Likelihood Score for Structure Larger dependence of X i on Pa i  higher score Adding arcs always helps I(X; Y)  I(X; {Y,Z}) Max score attained by fully connected network Overfitting: A bad idea… Mutual information between X i and its parents

Bayes Nets: Slide 67 Copyright © 2001, Andrew W. Moore Max likelihood params Bayesian Score Likelihood score: Bayesian approach: Deal with uncertainty by assigning probability to all possibilities Likelihood Prior over parameters Marginal Likelihood

Bayes Nets: Slide 68 Copyright © 2001, Andrew W. Moore Scoring a structure Number of non- redundant parameters defining the net #Records #Attributes Sums over all the rows in the prob- ability table for X j The parent values in the k’th row of X j ’s probability table All these values estimated from data

Bayes Nets: Slide 69 Copyright © 2001, Andrew W. Moore Scoring a structure All these values estimated from data This is called a BIC (Bayes Information Criterion) estimate This part is a penalty for too many parameters This part is the training set log- likelihood BIC asymptotically tries to get the structure right. (There’s a lot of heavy emotional debate about whether this is the best scoring criterion)

Bayes Nets: Slide 70 Copyright © 2001, Andrew W. Moore As M (amount of data) grows, Increasing pressure to fit dependencies in distribution Complexity term avoids fitting noise Asymptotic equivalence to MDL score Bayesian score is consistent Observed data eventually overrides prior Fit dependencies in empirical distribution Complexity penalty Bayesian Score: Asymptotic Behavior

Bayes Nets: Slide 71 Copyright © 2001, Andrew W. Moore Searching for structure with best score

Bayes Nets: Slide 72 Copyright © 2001, Andrew W. Moore Structure Discovery Task: Discover structural properties Is there a direct connection between X & Y Does X separate between two “subsystems” Does X causally effect Y Example: scientific data mining Disease properties and symptoms Interactions between the expression of genes

Bayes Nets: Slide 73 Copyright © 2001, Andrew W. Moore Discovering Structure Current practice: model selection Pick a single high-scoring model Use that model to infer domain structure E R B A C P(G|D)

Bayes Nets: Slide 74 Copyright © 2001, Andrew W. Moore Discovering Structure Problem Small sample size  many high scoring models Answer based on one model often useless Want features common to many models E R B A C E R B A C E R B A C E R B A C E R B A C P(G|D)

Bayes Nets: Slide 75 Copyright © 2001, Andrew W. Moore Incomplete Data Data is often incomplete Some variables of interest are not assigned values This phenomenon happens when we have Missing values: Some variables unobserved in some instances Hidden variables: Some variables are never observed We might not even know they exist

Bayes Nets: Slide 76 Copyright © 2001, Andrew W. Moore Incomplete Data In the presence of incomplete data, the likelihood can have multiple maxima Example: We can rename the values of hidden variable H If H has two values, likelihood has two maxima In practice, many local maxima HY

Bayes Nets: Slide 77 Copyright © 2001, Andrew W. Moore Software Many software packages available See Kevin Murphy page Deal: Deal: Learning Bayesian Networks in R MSNBx Bayes Net Toolbox for Matlab Hugin Good user interface Implements continuous variables

Oct 15th, 2001Copyright © 2001, Andrew W. Moore Clasificadores por Redes Bayesianas Carlos López de Castilla Vásquez

Bayes Nets: Slide 79 Copyright © 2001, Andrew W. Moore Chow y Lui (1968) Aproximación de distribuciones de probabilidad discreta usando árboles de dependencia Se asignan pesos a cada arco usando la cantidad de información mutua

Bayes Nets: Slide 80 Copyright © 2001, Andrew W. Moore Chow y Lui (1968) Se busca el árbol de dependencia con máximo peso Existen p-1 arcos No existen ciclos X4 X2 X1 X5 X3 X4 X2 X1 X5 X3 Solo considera dependencias de primer orden

Bayes Nets: Slide 81 Copyright © 2001, Andrew W. Moore Algoritmo Chow y Lui Calcular entre cada par de variables Construir un gráfico completo no dirigido en el que los nodos correspondan a las variables y el peso de cada arco que conecta con sea Extraer el árbol de dependencia con máximo peso Escoger una variable raíz y direccionar todos los arcos de tal forma que sean externos a el

Bayes Nets: Slide 82 Copyright © 2001, Andrew W. Moore Tree Augmented Naive Bayes Friedman, Geiger y Goldszmidt (1997) Adaptación del algoritmo Chow y Lui Naive Bayes con arcos aumentados entre variables X4 X2 X1 X3 C X6 X5

Bayes Nets: Slide 83 Copyright © 2001, Andrew W. Moore Algoritmo Construct-TAN Calcular entre cada par de variables Construir un gráfico completo no dirigido en el que los nodos correspondan a las variables y el peso de cada arco que conecta con sea Extraer el árbol de dependencia de máximo peso Escoger una variable raíz y direccionar todos los arcos de tal forma que sean externos a el Adicionar la variable de clase C y adicionar un arcos hacia cada variable

Bayes Nets: Slide 84 Copyright © 2001, Andrew W. Moore Enlace con el algoritmo CL TAN usa la cantidad de información mutua condicional Tiene el mismo tiempo de complejidad Solo considera dependencias de primer orden

Bayes Nets: Slide 85 Copyright © 2001, Andrew W. Moore Diabetes NB 768 instancias 8 variables 2 clases Discretización DPF Diastolic Age Mass Pregnant Insulin Triceps Glucose naiveBayes(Diabetes) La tasa de error es

Bayes Nets: Slide 86 Copyright © 2001, Andrew W. Moore Diabetes TAN GlucoseDiastolicTricepsInsulinMassDPFAge Pregnat Glucose Diastolic Triceps Insulin Mass DPF Cantidad de información mutua

Bayes Nets: Slide 87 Copyright © 2001, Andrew W. Moore Diabetes TAN DPF Diastolic Age Mass Insulin Triceps Glucose Pregnant

Bayes Nets: Slide 88 Copyright © 2001, Andrew W. Moore Diabetes TAN DPF Diastolic Age Mass Insulin Triceps Glucose Pregnant

Bayes Nets: Slide 89 Copyright © 2001, Andrew W. Moore Diabetes TAN DPF Diastolic Age Mass Insulin Triceps Glucose Pregnant

Bayes Nets: Slide 90 Copyright © 2001, Andrew W. Moore Diabetes TAN DPF Diastolic Age Mass Insulin Triceps Glucose Pregnant

Bayes Nets: Slide 91 Copyright © 2001, Andrew W. Moore Diabetes TAN DPF Diastolic Age Mass Insulin Triceps Glucose Pregnant

Bayes Nets: Slide 92 Copyright © 2001, Andrew W. Moore Diabetes TAN DPF Diastolic Age Mass Insulin Triceps Glucose Pregnant

Bayes Nets: Slide 93 Copyright © 2001, Andrew W. Moore Diabetes TAN DPF Diastolic Age Mass Insulin Triceps Glucose Pregnant

Bayes Nets: Slide 94 Copyright © 2001, Andrew W. Moore Diabetes TAN DPF Diastolic Age Mass Insulin Triceps Glucose Pregnant

Bayes Nets: Slide 95 Copyright © 2001, Andrew W. Moore Diabetes TAN DPF Diastolic Age Mass Insulin Triceps Glucose Pregnant

Bayes Nets: Slide 96 Copyright © 2001, Andrew W. Moore Diabetes TAN DPF Diastolic Age Mass Insulin Triceps Glucose Pregnant TAN(Diabetes) La tasa de error es

Bayes Nets: Slide 97 Copyright © 2001, Andrew W. Moore Basic References Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. San Mateo, CA: Morgan Kauffman. Oliver, R.M. and Smith, J.Q. (eds.) (1990). Influence Diagrams, Belief Nets, and Decision Analysis, Chichester, Wiley. Neapolitan, R.E. (1990). Probabilistic Reasoning in Expert Systems, New York: Wiley. Schum, D.A. (1994). The Evidential Foundations of Probabilistic Reasoning, New York: Wiley. Jensen, F.V. (1996). An Introduction to Bayesian Networks, New York: Springer. Chang, K.C. and Fung, R. (1995). Symbolic Probabilistic Inference with Both Discrete and Continuous Variables, IEEE SMC, 25(6),

Bayes Nets: Slide 98 Copyright © 2001, Andrew W. Moore Algorithm References Cooper, G.F. (1990) The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence, 42, , Jensen, F.V, Lauritzen, S.L., and Olesen, K.G. (1990). Bayesian Updating in Causal Probabilistic Networks by Local Computations. Computational Statistics Quarterly, Lauritzen, S.L. and Spiegelhalter, D.J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. J. Royal Statistical Society B, 50(2),