PatReco: Bayesian Networks Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Definitions Bayesian networks consist of nodes and (usually directional) arcs Nodes or states represent a classification class or in general events and are described with a pdf Arcs represent relations between arcs, e.g., cause and effect, time sequence Two nodes that are connected via another node are conditionally independent (given that node)
When to use Bayesian nets Bayesian networks (or networks of inference) are statistical models that are used for classification (or in general pattern recognition) problems where there are dependencies among classes, e.g., time dependencies, cause and effect dependencies
Conditional Independence Full independence between A and B P(A|B) = P(A) or P(A,B) = P(A) P(B) Conditional independence of A, B given C P(A|BC) = P(A|C) or P(A,B|C) = P(A|C)P(B|C)
Conditional Independence A, C independent given B P(C|BA) = P(C|B) B,C independent given A P(B,C|A) = P(B|A)P(C|A) A,C dependent given B P(A,C|B) cannot be reduced! A B C A B C A BC
Three problems 1.Probability computation (use independence) 2.Training/Parameter Estimation Maximum likelihood (ML) if all is observable Expectation maximization (EM) if missing data 3.Inference (Testing) Diagnosis P(cause|effect)bottom-up PredictionP(effect|cause)top-down
Probability Computation For a Bayesian Network that consists of N nodes: 1.Compute P(n 1, n 2..n N ) using chain rule starting from the “last/bottom” node and working your way up P(n 1, n 2..n N ) = P(n N | n 1, n 2.. n N-1 ) P(n N-1 |n 1, n 2.. n N-2 ) … P(n 2 |n 1 ) P(n 1 ) 2.Identify conditional independence conditions from Bayesian network topology 3.Simplify the conditionals probabilities using independence conditions
Probability Computation Topology: P(C,S,R,W) = P(W|C,S,R) P(S|CR) P(R|C)P(C) Independent: (W,C)|S,R(S,R)|C Dependent: (S,R)|W P(C,S,R,W) = P(W|S,R) P(S|C) P(R|C) P(C) C S W R
Probability Computation There are general algorithms for identifying cliques in the Bayesian net Cliques are islands of conditional dependence, i.e., terms in the probability computation that cannot be further reduced SC WSR RC
Training/Parameter Estimation Instead of estimating the joint pdf of the whole network the joint pdf of each of the cliques is estimated For example if the network joint pdf is P(C,S,R,W) = P(W|S,R) P(S|C) P(R|C) P(C) instead of computing P(C,S,R,W) we compute each of P(W|S,R), P(S|C), P(R|C), P(C) for all possible values of W, S, R, C (much simpler)
Training/Parameter Estimation For fully observable data and discrete probabilities compute maximum likelihood estimates of parameters, e.g., for discrete probs counts(W=1,S=1,R=0) P(W=1|S=1,R=0) ML = _______________________ counts(W=*,S=1,R=0)
Training/Parameter Estimation Example: the following observations pairs are given for (W,C,S,R): (1,0,1,0), (0,0,1,0),(1,1,1,0),(0,1,1,0),(1,0,1,0), (0,1,0,0),(1,0,0,1),(0,1,1,1),(1,1,1,0) Using Maximum Likelihood Estimation: P(W=1|S=1,R=0) ML = #(1, *, 1, 0)/#(*,*,1,0) = 2/5 = 0.4
Training/Parameter Estimation When data is non observable or missing the EM algorithm is employed There are efficient implementations of the EM algorithm for Bayesian nets that operate on the clique network When the topology of the Bayesian network is not known structural EM can be used
Inference There are two types of inference (testing) Diagnosis P(cause|effect)bottom-up PredictionP(effect|cause)top-down Once Once the parameters of the network are estimated the joint network pdf can be estimated for ALL possible network values Inference is simply probability computation using the network pdf
Inference For example P(W=1|C=1) = P(W=1,C=1) / P(C=1) where P(W=1,C=1) = RS P(W=1,C=1,R=*,S=*) P(C=1) = RWS P(W=*,C=1,R=*,S=*)
Inference Efficient algorithms exist for performing inference in large networks which operate on the clique network Inference is often shown as a probability maximization problem, e.g., what is the most probable cause or effect? argmax W P(W|C=1)
Continuous Case In our examples the network nodes represented discrete events (states or classes) Network nodes often hold continuous variables (observations), e.g., length, energy For the continuous case parametric pdf are introduced and their parameters are estimated using ML (observed) or EM (hidden)
Some Applications Medical diagnosis Computer problem diagnosis (MS) Markov chains Hidden Markov Models (HMMs)
Conclusions Bayesian networks are used to represent dependencies between classes Network topology defines conditional independence conditions that simplify the network pdf modeling and computation Three problems: probability computation, estimation/training, inference/testing