Qian Liu CSE spring University of Pennsylvania

Qian Liu CSE 391 2005 spring University of Pennsylvania
Belief Networks Qian Liu CSE spring University of Pennsylvania CSE

Outline Motivation BN=DAG+CPTs Inference Learning Application of BNs
CSE

From application What kind of application problems can be solved by Belief Networks? Classifier: classify , webpage… Medical diagnosis Trouble shooting system in MS Windows Bouncy paperclip guy in MS Word Speech recognition Gene finding aka: Bayesian Network, Causal Network, Directed Graphical Model…… CSE

From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… CSE

From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… # entries = 2 N CSE

From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… # entries = 2 N Too many! CSE

From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… # entries = 2 N Too many! Is there a more clever way of representing the joint distribution? CSE

From representation Problem: How to describe joint distribution of N random variables P(X1,X2,…,XN)? If we are to write the joint distribution in a table, how many entries are there in all in the table? X1, …, XN P 0, 0, ….., 0 0.008 0, 0, ….., 1 0.001 …… # entries = 2 N Too many! Is there a more clever way of representing the joint distribution? YES! ---- Belief Network CSE

CSE

What is BN: BN=DAG+CPTs
Belief Network consists of a Directed Acyclic Graph, and Conditional Probability Tables. DAG Nodes: random variables Directed edges represent causal relations CPTs Each random variable has a CPT CPT specifies BN specifies a joint distribution on the variables: CSE

Alarm example Random variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Causal relationship among the variables: A burglary can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call Causal relations reflect domain knowledge. CSE

Alarm example (cont.) CSE

Joint distribution BN specifies a joint distribution on the variables
Alarm example Shorthand notation: CSE

Another example CPTs Joint probability
Convention to write joint probability: write each variable in a “cause  effects” order Belief Networks are generative models, which can generate data in a “cause  effects” order. CSE

Compactness Belief Network offers a simple and compact way of representing the joint distribution of many random variables. The number of entries in a full joint distribution table ~ But, for BN, suppose a variable has at most k parents, then the total number of entries in all the CPTs ~ In real practice, k << N. We save a lot of space! CSE

CSE

Inference The task of inference is to compute the posterior probability of a set of query variables given a set of evidence variables, denoted as , given the Belief Network is known. Since the joint distribution is known, can be computed naively and inefficiently by product rule marginalization CSE

Belief Network encodes conditional independence in the graph structure. (1) A variable is C.I. of its non-descendants, given all its parents. (2) A variable is C.I. of all the other variables, given all its parents, children and children’s parents ---- that is , given its Markov blanket. CSE

Alarm example B is I. with E. or, B is C.I. with E, given nothing. (1) J is C.I. with B,E,M, given A. (1) M is C.I. with B,E,J, given A. (1) Another example U is C.I. with X, given Y,V,Z (2) CSE

Examples of inference Alarm example ---- We know: P(A) = ?
CPTs: P(B), P(E), P(A|B,E), P(J|A), P(M|A) Conditional independence P(A) = ? Marginalization Chain rule B, E are Ind. CSE

Examples of inference Alarm example ---- We know: P(J,M) = ?
CPTs: P(B), P(E), P(A|B,E), P(J|A), P(M|A) Conditional independence P(J,M) = ? marginalization chain rule J is C.I. of M, given A CSE

Outline Motivation BN=DAG+CPTs Inference Learning Applications of BNs
CSE

Learning The task of learning is to learn the Belief Network which can best describe the data we observe. Let’s assume the DAG is known, then the learning problem is simplified to learning the best CPTs from data, according to some “goodness” criterion. Note: There are many kinds of learning: learn different things, different “goodness” criteria…… We are only going to discuss the easiest kind of learning. CSE

Training data All binary variables Observe T examples
Assume examples are identically, independently distributed (i.i.d.) from distribution Task: how to learn the best CPTs from the T examples? Example t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 CSE

Think of learning as… One person used a DAG and a set of CPTs to generate the training data. You are given the DAG and the training data, and you are asked to guess what CPTs are most likely to be used by this guy to generate the training data. t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 CSE

Given CPTs Probability of the t-th example --- e.g. for alarm example
If the t-th example is Then CSE

Given CPTs Probability of the t-th example
Probability of all the data (i.i.d.) CSE

Probability of all the data (i.i.d.) Log-likelihood of data CSE

Probability of all the data (i.i.d.) Log-likelihood of data Log-likelihood of data is a function of CPTs. Which CPTs are the best? CSE

Maximum-likelihood learning
Log-likelihood of data is a function of CPTs: So, the goodness criterion of CPTs is the log-likelihood of the data . The best CPTs are the CPTs which can maximize the log-likelihood: CSE

Maximum-likelihood learning
Mathematical formulation subject to constraints: probabilities sum up to 1 Constrained optimization with equality constraints Lagrange multiplier (which you’ve probably seen in your Calculus class) You can solve it yourself. It’s not hard at all. Very common technique in machine learning CSE

ML solution Nicely, we can have closed-form solution for the constrained optimization problem. Example t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 CSE

ML solution Nicely, we can have closed-form solution for the constrained optimization problem. Example t X1, X2, ……, Xn 1 0, 0, 0, …., 0 2 0, 1, 0, …., 0 3 4 0, 1, 1, …., 0 5 1, 1, 1, ...., 1 … …… T 0, 1, 0, …., 1 And, the solution is very intuitive! CSE

ML learning example Three binary variables X,Y,Z
T = 1000 examples in training data X, Y, Z count 0, 0, 0 230 0, 0, 1 100 0, 1, 0 70 0, 1, 1 50 1, 0, 0 110 1, 0, 1 150 1, 1, 0 160 1, 1, 1 130 Example t X, Y, Z 1 0, 0, 0 2 0, 1, 0 3 4 1, 0, 1 5 1, 1, 0 … …… 1000  CSE

CSE

Naïve Bayes Classifier
Represent an object with attributes Class label Joint probability Learn CPTs from training data , Classify a new object CSE

Naïve Bayes Classifier
Inference Bayes rule marginalization C.I. CSE

Medical Diagnosis The QMR-DT model (Shwe et al. 1991) Learning
Prior probability of each disease Conditional probability of each finding given its parents Inference Given the findings of some patient, which is/are the most probable disease(s) causing these findings? CSE

Hidden Markov Model Sequence / time-series model Q: states
Speech recognition Observations: utterance/waveform States: words Gene finding Observations: genomic sequence States: gene/no-gene, different components of gene Q: states Y: observations CSE

Applying BN to real-world problem
Involves the following steps: Domain experts (or computer scientists if the problem is not very hard) specify causal relations among the random variables, then we can draw the DAG Collect training data from real-world Learn the Maximum-likelihood CPTs from the training data Infer the queries we are interested in CSE

Summary BN=DAG+CPTs Inference Learning
Compact representation of Joint probability Inference Conditional independence Probability rules Learning Maximum-likelihood solution CSE

Qian Liu CSE spring University of Pennsylvania

Similar presentations

Presentation on theme: "Qian Liu CSE spring University of Pennsylvania"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Qian Liu CSE spring University of Pennsylvania

Similar presentations

Presentation on theme: "Qian Liu CSE spring University of Pennsylvania"— Presentation transcript:

Similar presentations

About project

Feedback