Qian Liu CSE 391 2005 spring University of Pennsylvania Belief Networks Qian Liu CSE 391 2005 spring University of Pennsylvania CSE 391 2005
Outline Motivation BN=DAG+CPTs Inference Learning Application of BNs CSE 391 2005
Outline Motivation BN=DAG+CPTs Inference Learning Application of BNs CSE 391 2005
From application What kind of application problems can be solved by Belief Networks? Classifier: classify email, webpage… Medical diagnosis Trouble shooting system in MS Windows Bouncy paperclip guy in MS Word Speech recognition Gene finding aka: Bayesian Network, Causal Network, Directed Graphical Model…… CSE 391 2005
From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… CSE 391 2005
From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… # entries = 2 N CSE 391 2005
From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… # entries = 2 N Too many! CSE 391 2005
From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… # entries = 2 N Too many! Is there a more clever way of representing the joint distribution? CSE 391 2005
From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… # entries = 2 N Too many! Is there a more clever way of representing the joint distribution? YES! ---- Belief Network CSE 391 2005
Outline Motivation BN=DAG+CPTs Inference Learning Application of BNs CSE 391 2005
What is BN: BN=DAG+CPTs Belief Network consists of a Directed Acyclic Graph, and Conditional Probability Tables. DAG Nodes: random variables Directed edges represent causal relations CPTs Each random variable has a CPT CPT specifies BN specifies a joint distribution on the variables: CSE 391 2005
Alarm example Random variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Causal relationship among the variables: A burglary can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call Causal relations reflect domain knowledge. CSE 391 2005
Alarm example (cont.) CSE 391 2005
Joint distribution BN specifies a joint distribution on the variables Alarm example Shorthand notation: CSE 391 2005
Another example CPTs Joint probability Convention to write joint probability: write each variable in a “cause effects” order Belief Networks are generative models, which can generate data in a “cause effects” order. CSE 391 2005
Compactness Belief Network offers a simple and compact way of representing the joint distribution of many random variables. The number of entries in a full joint distribution table ~ But, for BN, suppose a variable has at most k parents, then the total number of entries in all the CPTs ~ In real practice, k << N. We save a lot of space! CSE 391 2005
Outline Motivation BN=DAG+CPTs Inference Learning Application of BNs CSE 391 2005
Inference The task of inference is to compute the posterior probability of a set of query variables given a set of evidence variables, denoted as , given the Belief Network is known. Since the joint distribution is known, can be computed naively and inefficiently by product rule marginalization CSE 391 2005
Conditional independence A,B are (unconditionally) independent P(A,B) = P(A)P(B) (1) I. P(A|B) = P(A) (2) I. P(B|A) = P(B) (3) I. (1),(2),(3) are equivalent A,B are conditionally independent, given evidence E P(A,B|E) = P(A|E)P(B|E) (1) C.I. P(A|B,E) = P(A|E) (2) C.I. P(B|A,E) = P(B|E) (3) C.I. CSE 391 2005
Conditional independence A,B are (unconditionally) independent A,B are conditionally independent, given evidence E P(A,B) = P(A)P(B) (1) I. P(A,B|E) = P(A|E)P(B|E) (1) C.I. P(A|B) = P(A) (2) I. P(A|B,E) = P(A|E) (2) C.I. P(B|A) = P(B) (3) I. P(B|A,E) = P(B|E) (3) C.I. (1),(2),(3) are equivalent CSE 391 2005
Conditional independence Chain rule P(B,E,A,J,M) = P(B) * P(E|B) * P(A|B,E) * P(J|B,E,A) * P(M|B,E,A,J) BN P(B,E,A,J,M) = P(B) * P(E) ---- B is I. of E. * P(A|B,E) * P(J|A) ----J is C.I. of B,E, given A * P(M|A) ----M is C.I. of B,E,J, given A Belief Networks explore conditional independence among variables so as to represent the joint distribution compactly. CSE 391 2005
Conditional independence Belief Network encodes conditional independence in the graph structure. (1) A variable is C.I. of its non-descendants, given all its parents. (2) A variable is C.I. of all the other variables, given all its parents, children and children’s parents ---- that is , given its Markov blanket. CSE 391 2005
Conditional independence Alarm example B is I. with E. or, B is C.I. with E, given nothing. (1) J is C.I. with B,E,M, given A. (1) M is C.I. with B,E,J, given A. (1) Another example U is C.I. with X, given Y,V,Z (2) CSE 391 2005
Examples of inference Alarm example ---- We know: P(A) = ? CPTs: P(B), P(E), P(A|B,E), P(J|A), P(M|A) Conditional independence P(A) = ? Marginalization Chain rule B, E are Ind. CSE 391 2005
Examples of inference Alarm example ---- We know: P(J,M) = ? CPTs: P(B), P(E), P(A|B,E), P(J|A), P(M|A) Conditional independence P(J,M) = ? marginalization chain rule J is C.I. of M, given A CSE 391 2005
Outline Motivation BN=DAG+CPTs Inference Learning Applications of BNs CSE 391 2005
Learning The task of learning is to learn the Belief Network which can best describe the data we observe. Let’s assume the DAG is known, then the learning problem is simplified to learning the best CPTs from data, according to some “goodness” criterion. Note: There are many kinds of learning: learn different things, different “goodness” criteria…… We are only going to discuss the easiest kind of learning. CSE 391 2005
Training data All binary variables Observe T examples Assume examples are identically, independently distributed (i.i.d.) from distribution Task: how to learn the best CPTs from the T examples? Example t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 CSE 391 2005
Think of learning as… One person used a DAG and a set of CPTs to generate the training data. You are given the DAG and the training data, and you are asked to guess what CPTs are most likely to be used by this guy to generate the training data. t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 CSE 391 2005
Given CPTs Probability of the t-th example --- e.g. for alarm example If the t-th example is Then CSE 391 2005
Given CPTs Probability of the t-th example Probability of all the data (i.i.d.) CSE 391 2005
Given CPTs Probability of the t-th example Probability of all the data (i.i.d.) Log-likelihood of data CSE 391 2005
Given CPTs Probability of the t-th example Probability of all the data (i.i.d.) Log-likelihood of data Log-likelihood of data is a function of CPTs. Which CPTs are the best? CSE 391 2005
Maximum-likelihood learning Log-likelihood of data is a function of CPTs: So, the goodness criterion of CPTs is the log-likelihood of the data . The best CPTs are the CPTs which can maximize the log-likelihood: CSE 391 2005
Maximum-likelihood learning Mathematical formulation subject to constraints: probabilities sum up to 1 Constrained optimization with equality constraints Lagrange multiplier (which you’ve probably seen in your Calculus class) You can solve it yourself. It’s not hard at all. Very common technique in machine learning CSE 391 2005
ML solution Nicely, we can have closed-form solution for the constrained optimization problem. Example t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 CSE 391 2005
ML solution Nicely, we can have closed-form solution for the constrained optimization problem. Example t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 CSE 391 2005
ML solution Nicely, we can have closed-form solution for the constrained optimization problem. Example t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 And, the solution is very intuitive! CSE 391 2005
ML learning example Three binary variables X,Y,Z T = 1000 examples in training data X, Y, Z count 0, 0, 0 230 0, 0, 1 100 0, 1, 0 70 0, 1, 1 50 1, 0, 0 110 1, 0, 1 150 1, 1, 0 160 1, 1, 1 130 Example t X, Y, Z 1 0, 0, 0 2 0, 1, 0 3 4 1, 0, 1 5 1, 1, 0 … …… 1000 CSE 391 2005
ML learning example Three binary variables X,Y,Z T = 1000 examples in training data X, Y, Z count 0, 0, 0 230 0, 0, 1 100 0, 1, 0 70 0, 1, 1 50 1, 0, 0 110 1, 0, 1 150 1, 1, 0 160 1, 1, 1 130 Example t X, Y, Z 1 0, 0, 0 2 0, 1, 0 3 4 1, 0, 1 5 1, 1, 0 … …… 1000 CSE 391 2005
Outline Motivation BN=DAG+CPTs Inference Learning Application of BNs CSE 391 2005
Naïve Bayes Classifier Represent an object with attributes Class label Joint probability Learn CPTs from training data , Classify a new object CSE 391 2005
Naïve Bayes Classifier Inference Bayes rule marginalization C.I. CSE 391 2005
Medical Diagnosis The QMR-DT model (Shwe et al. 1991) Learning Prior probability of each disease Conditional probability of each finding given its parents Inference Given the findings of some patient, which is/are the most probable disease(s) causing these findings? CSE 391 2005
Hidden Markov Model Sequence / time-series model Q: states Speech recognition Observations: utterance/waveform States: words Gene finding Observations: genomic sequence States: gene/no-gene, different components of gene Q: states Y: observations CSE 391 2005
Applying BN to real-world problem Involves the following steps: Domain experts (or computer scientists if the problem is not very hard) specify causal relations among the random variables, then we can draw the DAG Collect training data from real-world Learn the Maximum-likelihood CPTs from the training data Infer the queries we are interested in CSE 391 2005
Summary BN=DAG+CPTs Inference Learning Compact representation of Joint probability Inference Conditional independence Probability rules Learning Maximum-likelihood solution CSE 391 2005