Download presentation
Presentation is loading. Please wait.
1
Qian Liu CSE 391 2005 spring University of Pennsylvania
Belief Networks Qian Liu CSE spring University of Pennsylvania CSE
2
Outline Motivation BN=DAG+CPTs Inference Learning Application of BNs
CSE
3
Outline Motivation BN=DAG+CPTs Inference Learning Application of BNs
CSE
4
From application What kind of application problems can be solved by Belief Networks? Classifier: classify , webpage… Medical diagnosis Trouble shooting system in MS Windows Bouncy paperclip guy in MS Word Speech recognition Gene finding aka: Bayesian Network, Causal Network, Directed Graphical Model…… CSE
5
From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… CSE
6
From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… # entries = 2 N CSE
7
From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… # entries = 2 N Too many! CSE
8
From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… # entries = 2 N Too many! Is there a more clever way of representing the joint distribution? CSE
9
From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… # entries = 2 N Too many! Is there a more clever way of representing the joint distribution? YES! ---- Belief Network CSE
10
Outline Motivation BN=DAG+CPTs Inference Learning Application of BNs
CSE
11
What is BN: BN=DAG+CPTs
Belief Network consists of a Directed Acyclic Graph, and Conditional Probability Tables. DAG Nodes: random variables Directed edges represent causal relations CPTs Each random variable has a CPT CPT specifies BN specifies a joint distribution on the variables: CSE
12
Alarm example Random variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Causal relationship among the variables: A burglary can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call Causal relations reflect domain knowledge. CSE
13
Alarm example (cont.) CSE
14
Joint distribution BN specifies a joint distribution on the variables
Alarm example Shorthand notation: CSE
15
Another example CPTs Joint probability
Convention to write joint probability: write each variable in a “cause effects” order Belief Networks are generative models, which can generate data in a “cause effects” order. CSE
16
Compactness Belief Network offers a simple and compact way of representing the joint distribution of many random variables. The number of entries in a full joint distribution table ~ But, for BN, suppose a variable has at most k parents, then the total number of entries in all the CPTs ~ In real practice, k << N. We save a lot of space! CSE
17
Outline Motivation BN=DAG+CPTs Inference Learning Application of BNs
CSE
18
Inference The task of inference is to compute the posterior probability of a set of query variables given a set of evidence variables, denoted as , given the Belief Network is known. Since the joint distribution is known, can be computed naively and inefficiently by product rule marginalization CSE
19
Conditional independence
A,B are (unconditionally) independent P(A,B) = P(A)P(B) (1) I. P(A|B) = P(A) (2) I. P(B|A) = P(B) (3) I. (1),(2),(3) are equivalent A,B are conditionally independent, given evidence E P(A,B|E) = P(A|E)P(B|E) (1) C.I. P(A|B,E) = P(A|E) (2) C.I. P(B|A,E) = P(B|E) (3) C.I. CSE
20
Conditional independence
A,B are (unconditionally) independent A,B are conditionally independent, given evidence E P(A,B) = P(A)P(B) (1) I. P(A,B|E) = P(A|E)P(B|E) (1) C.I. P(A|B) = P(A) (2) I. P(A|B,E) = P(A|E) (2) C.I. P(B|A) = P(B) (3) I. P(B|A,E) = P(B|E) (3) C.I. (1),(2),(3) are equivalent CSE
21
Conditional independence
Chain rule P(B,E,A,J,M) = P(B) * P(E|B) * P(A|B,E) * P(J|B,E,A) * P(M|B,E,A,J) BN P(B,E,A,J,M) = P(B) * P(E) B is I. of E. * P(A|B,E) * P(J|A) J is C.I. of B,E, given A * P(M|A) M is C.I. of B,E,J, given A Belief Networks explore conditional independence among variables so as to represent the joint distribution compactly. CSE
22
Conditional independence
Belief Network encodes conditional independence in the graph structure. (1) A variable is C.I. of its non-descendants, given all its parents. (2) A variable is C.I. of all the other variables, given all its parents, children and children’s parents ---- that is , given its Markov blanket. CSE
23
Conditional independence
Alarm example B is I. with E. or, B is C.I. with E, given nothing. (1) J is C.I. with B,E,M, given A. (1) M is C.I. with B,E,J, given A. (1) Another example U is C.I. with X, given Y,V,Z (2) CSE
24
Examples of inference Alarm example ---- We know: P(A) = ?
CPTs: P(B), P(E), P(A|B,E), P(J|A), P(M|A) Conditional independence P(A) = ? Marginalization Chain rule B, E are Ind. CSE
25
Examples of inference Alarm example ---- We know: P(J,M) = ?
CPTs: P(B), P(E), P(A|B,E), P(J|A), P(M|A) Conditional independence P(J,M) = ? marginalization chain rule J is C.I. of M, given A CSE
26
Outline Motivation BN=DAG+CPTs Inference Learning Applications of BNs
CSE
27
Learning The task of learning is to learn the Belief Network which can best describe the data we observe. Let’s assume the DAG is known, then the learning problem is simplified to learning the best CPTs from data, according to some “goodness” criterion. Note: There are many kinds of learning: learn different things, different “goodness” criteria…… We are only going to discuss the easiest kind of learning. CSE
28
Training data All binary variables Observe T examples
Assume examples are identically, independently distributed (i.i.d.) from distribution Task: how to learn the best CPTs from the T examples? Example t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 CSE
29
Think of learning as… One person used a DAG and a set of CPTs to generate the training data. You are given the DAG and the training data, and you are asked to guess what CPTs are most likely to be used by this guy to generate the training data. t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 CSE
30
Given CPTs Probability of the t-th example --- e.g. for alarm example
If the t-th example is Then CSE
31
Given CPTs Probability of the t-th example
Probability of all the data (i.i.d.) CSE
32
Given CPTs Probability of the t-th example
Probability of all the data (i.i.d.) Log-likelihood of data CSE
33
Given CPTs Probability of the t-th example
Probability of all the data (i.i.d.) Log-likelihood of data Log-likelihood of data is a function of CPTs. Which CPTs are the best? CSE
34
Maximum-likelihood learning
Log-likelihood of data is a function of CPTs: So, the goodness criterion of CPTs is the log-likelihood of the data . The best CPTs are the CPTs which can maximize the log-likelihood: CSE
35
Maximum-likelihood learning
Mathematical formulation subject to constraints: probabilities sum up to 1 Constrained optimization with equality constraints Lagrange multiplier (which you’ve probably seen in your Calculus class) You can solve it yourself. It’s not hard at all. Very common technique in machine learning CSE
36
ML solution Nicely, we can have closed-form solution for the constrained optimization problem. Example t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 CSE
37
ML solution Nicely, we can have closed-form solution for the constrained optimization problem. Example t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 CSE
38
ML solution Nicely, we can have closed-form solution for the constrained optimization problem. Example t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 And, the solution is very intuitive! CSE
39
ML learning example Three binary variables X,Y,Z
T = 1000 examples in training data X, Y, Z count 0, 0, 0 230 0, 0, 1 100 0, 1, 0 70 0, 1, 1 50 1, 0, 0 110 1, 0, 1 150 1, 1, 0 160 1, 1, 1 130 Example t X, Y, Z 1 0, 0, 0 2 0, 1, 0 3 4 1, 0, 1 5 1, 1, 0 … …… 1000 CSE
40
ML learning example Three binary variables X,Y,Z
T = 1000 examples in training data X, Y, Z count 0, 0, 0 230 0, 0, 1 100 0, 1, 0 70 0, 1, 1 50 1, 0, 0 110 1, 0, 1 150 1, 1, 0 160 1, 1, 1 130 Example t X, Y, Z 1 0, 0, 0 2 0, 1, 0 3 4 1, 0, 1 5 1, 1, 0 … …… 1000 CSE
41
Outline Motivation BN=DAG+CPTs Inference Learning Application of BNs
CSE
42
Naïve Bayes Classifier
Represent an object with attributes Class label Joint probability Learn CPTs from training data , Classify a new object CSE
43
Naïve Bayes Classifier
Inference Bayes rule marginalization C.I. CSE
44
Medical Diagnosis The QMR-DT model (Shwe et al. 1991) Learning
Prior probability of each disease Conditional probability of each finding given its parents Inference Given the findings of some patient, which is/are the most probable disease(s) causing these findings? CSE
45
Hidden Markov Model Sequence / time-series model Q: states
Speech recognition Observations: utterance/waveform States: words Gene finding Observations: genomic sequence States: gene/no-gene, different components of gene Q: states Y: observations CSE
46
Applying BN to real-world problem
Involves the following steps: Domain experts (or computer scientists if the problem is not very hard) specify causal relations among the random variables, then we can draw the DAG Collect training data from real-world Learn the Maximum-likelihood CPTs from the training data Infer the queries we are interested in CSE
47
Summary BN=DAG+CPTs Inference Learning
Compact representation of Joint probability Inference Conditional independence Probability rules Learning Maximum-likelihood solution CSE
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.