Bayesian Networks I: Static Models & Multinomial Distributions By Peter Woolf University of Michigan Michigan Chemical Process Dynamics.

Slides:



Advertisements
Similar presentations
Estimation of Means and Proportions
Advertisements

A Tutorial on Learning with Bayesian Networks
Control Architectures: Feed Forward, Feedback, Ratio, and Cascade
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
CS479/679 Pattern Recognition Dr. George Bebis
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
CSCI 121 Special Topics: Bayesian Networks Lecture #5: Dynamic Bayes Nets.
Hidden Markov Models Fundamentals and applications to bioinformatics.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Bayesian Networks II: Dynamic Networks and Markov Chains By Peter Woolf University of Michigan Michigan Chemical Process Dynamics and.
Comparing Distributions I: DIMAC and Fishers Exact By Peter Woolf University of Michigan Michigan Chemical Process Dynamics and Controls.
Comparing Distributions II: Bayes Rule and Acceptance Sampling By Peter Woolf University of Michigan Michigan Chemical Process Dynamics.
Comparing Distributions III: Chi squared test, ANOVA By Peter Woolf University of Michigan Michigan Chemical Process Dynamics and Controls.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Maximum likelihood (ML) and likelihood ratio (LR) test
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
Dynamical Systems Analysis I: Fixed Points & Linearization By Peter Woolf University of Michigan Michigan Chemical Process Dynamics.
Results 2 (cont’d) c) Long term observational data on the duration of effective response Observational data on n=50 has EVSI = £867 d) Collect data on.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
Dynamical Systems Analysis II: Evaluating Stability, Eigenvalues By Peter Woolf University of Michigan Michigan Chemical Process Dynamics.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Dynamical Systems Analysis IV: Root Locus Plots & Routh Stability
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
Multiple Input, Multiple Output I: Numerical Decoupling By Peter Woolf University of Michigan Michigan Chemical Process Dynamics and.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Thanks to Nir Friedman, HU
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Maximum likelihood (ML)
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
Chapter 13: Inference in Regression
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
POSC 202A: Lecture 4 Probability. We begin with the basics of probability and then move on to expected value. Understanding probability is important because.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Sampling and estimation Petter Mostad
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Probability and Moment Approximations using Limit Theorems.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
From last time: on-policy vs off-policy Take an action Observe a reward Choose the next action Learn (using chosen action) Take the next action Off-policy.
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2008
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Bayesian Networks I: Static Models & Multinomial Distributions By Peter Woolf University of Michigan Michigan Chemical Process Dynamics and Controls Open Textbook version 1.0 Creative commons

Existing plant measurements Physics, chemistry, and chemical engineering knowledge & intuition Bayesian network models to establish connections Patterns of likely causes & influences Efficient experimental design to test combinations of causes ANOVA & probabilistic models to eliminate irrelevant or uninteresting relationships Process optimization (e.g. controllers, architecture, unit optimization, sequencing, and utilization) Dynamical process modeling

More scenarios where Bayesian Networks can help Inferential sensing: how do you sense the state of something you don’t see? Sensor redundancy: if multiple sensors disagree, what can you say about the state of the system? Nosy systems: if your system is highly variable, how can you model it?

Stages of knowing a model: 1.Topology and parameters are known. 2.Topology is known and we have data to learn parameters 3.Only data are known, must learn topology and parameters 4.Only partial data are known, must learn topology and parameters 5.Model is unknown and nonstationary More realistic e.g. Solve a given ODE e.g. Fit parameters to an ODE using optimization ?? More research.. Bayesian Networks

Probability Tables AP(B=on|A)P(B=off|A) high  11 =0.3  12 =0.7 medium  21 =0.99  22 =0.01 low  31 =0.46  32 =0.54 P(A=high)P(A=medium)P(A=low)  01 =0.21  02 =0.45  03 =0.34 Note: Rows sum to 1, but columns don’t A B

P(C-)P(C+) 0.5 Graphical form of Bayes’ Rule Conditional independence Decomposition of joint probability P(C+, S-, R+, W+) = P(C+)P(S-|C+)P(R+|C+)P(W+|S-,R+) Causal networks Inference on a network vs inference of a network Bayesian Networks CP(R-)P(R+) CP(S-)P(S+) S RP(W-)P(W+)

Inference on a network A B AP(B=on|A)P(B=off|A) high  11 =0.3  12 =0.7 medium  21 =0.99  22 =0.01 low  31 =0.46  32 =0.54 P(A=high)P(A=medium)P(A=low)  01 =0.21  02 =0.45  03 =0.34 Exact vs. Approximate calculation: In some cases you can exactly calculate probabilities on a BN given some data. This can be done directly or using quite complex algorithms for faster execution time. For large networks, exact is impractical.

Inference on a network A B AP(B=on|A)P(B=off|A) high  11 =0.3  12 =0.7 medium  21 =0.99  22 =0.01 low  31 =0.46  32 =0.54 P(A=high)P(A=medium)P(A=low)  01 =0.21  02 =0.45  03 =0.34 Given a value of A, say A=high, what is B? P(B=on)=0.3 P(B=off)=0.7 The answer is a probability!

Inference on a network A B AP(B=on|A)P(B=off|A) high  11 =0.3  12 =0.7 medium  21 =0.99  22 =0.01 low  31 =0.46  32 =0.54 P(A=high)P(A=medium)P(A=low)  01 =0.21  02 =0.45  03 =0.34 Given a value of B, say B=on, what is A? This is what Genie is doing on the wiki examples

Inference on a network A B AP(B=on|A)P(B=off|A) high  11 =0.3  12 =0.7 medium  21 =0.99  22 =0.01 low  31 =0.46  32 =0.54 P(A=high)P(A=medium)P(A=low)  01 =0.21  02 =0.45  03 =0.34 Approximate inference via Markov Chain Monte Carlo Sampling Given partial data, use your conditional probabilities to sample a value around the observed values and head nodes Repeat sampling out until you fill the network. Start over and gather averages.

Inference on a network Approximate inference via Markov Chain Monte Carlo Sampling Given partial data, use your conditional probabilities to sample a value around the observed values and head nodes Repeat sampling out until you fill the network. Start over and gather averages. * e1 *=observed data e1, e2=sample estimates in round 1 and 2

Inference on a network Approximate inference via Markov Chain Monte Carlo Sampling Given partial data, use your conditional probabilities to sample a value around the observed values and head nodes Repeat sampling out until you fill the network. Start over and gather averages. Method always works in the limit of infinite samples… * e1 e2 *=observed data e1, e2=sample estimates in round 1 and 2

This can be interpreted as a Bayesian network! Example scenario The network is the same as saying:

Note that these are equivalence classes and are a fundamental property of observed data. Causality can only be determined from observational data to some extent! The network A->B<-C is fundamentally different (prove it to yourself with Bayes rule), and can be distinguished with observational data. recall

FUNDAMENTAL PROPERTY! Equivalent models if we just observe A, B, and C. If we intervene and change A, B, or C we can distinguish between them. OR we can use our knowledge to choose the direction No arrangement of this last model will produce the upper 3 models.

Example scenario

Here we can use the multinomial distribution and the probabilities in the table above: (1) Given these data, what is the probability of observing a set of 9 temperature readings of which 4 are high, 2 are medium, and 3 are low? Note that these are independent readings and we don ’ t care about the ordering of the readings, just the probability of observing a set of 9 readings with this property. Compare to the binomial distribution we discussed previously (k=2)

For this problem we find: (1) Given these data, what is the probability of observing a set of 9 temperature readings of which 4 are high, 2 are medium, and 3 are low? Note that these are independent readings and we don ’ t care about the ordering of the readings, just the probability of observing a set of 9 readings with this property. Here we can use the multinomial distribution and the probabilities in the table above:

The next most likely temperature reading is medium, because this has the highest probability of 0.4. The previous sequence of temperature readings do not matter assuming these are independent readings, as is mentioned above. (2) After gathering these 9 temperature readings, what is the most likely next temperature reading you will see? Why?

(3) What is the probability of sampling a set of 9 observations with 7 of them catalyst A and 2 of them catalyst B? Here again, order does not matter. Here we can use the two state case of the multinomial distribution, (the binomial distribution):

(4) What is the probability of observing the following yield values? Note here we have the temperature and catalyst values, so we can use the conditional probability values. As before, order of observations does not matter, but the association between temperature and catalyst to yield does matter. For this part, just write down the expression you would use — you don ’ t need to do the full calculation. Number of times observed TemperatureCatalystYield 4xHAH 2xMBL 3xLAH

The number of orderings of identical items is the factorial term in the multinomial: Thus the total probability is Calculation method 1: First we will calculate the probability of this set for a particular ordering: Number of times observed TemperatureCatalystYield 4xHAH 2xMBL 3xLAH

Calculation method 2: The combination term is the same, Note that this matches the result in calculation method 1 exactly. We can repeat this for the second case to find p(0H,0M,2L|T=med, Cat=B)= which is again the same as above. Taking the product of the combinations and probabilities we find the same total probability of The probabilities can be interpreted here as another multinomial term. For example, for the first observation, we could say what is the probability of observing a 4 high, 0 med, and 0 low yields for a system with a high temperature and catalyst A? Using the multinomial distribution we would find: Number of times observed TemperatureCatalystYield 4xHAH 2xMBL 3xLAH

This term is the probability of the data given a model and parameters: P(data|model, parameters) The absolute value of this probability is not very informative by itself, but it could be if it were compared to something else. Note that the joint probability model here is p(temperature, catalyst, yield)= p(temperature)*p(catalyst)*p(yield | temperature, catalyst)= 0.047* * =7.07e-7 (Note: p(temp) and p(cat) were calculated earlier in the lecture)

What is the conditional probability model? P(temperature, cat, yield)=p(temp)p(cat)p(yield | temp) (call this model 2) As an example, lets say that you try another model where yield only depends on temperature. This model is shown graphically below:

P(temperature, cat, yield)=p(temp)p(cat)p(yield|temp) (call this model 2) How do we change this table to get p(yield|temp)?

Now what?

So which model is better?

A Bayes factor (BF) is like a p-value in probability or Bayesian terms. BF near 1=? BF far from 1=? Both models are nearly equal Models are different

Limitations: Analysis based on only 9 data points. This is useful for identifying unusual behavior. For example, in this case, we might conclude that catalyst A and B still have distinct properties, even though, say, they have been recycled many times. We don’t always have parameters like the truth table to start with.

L Constraints: There are a total of 100 samples drawn, thus 100=H+M+L For the maximum likelihood case, H=51, so the relationship between M and L is 100=51+M+L → M=49-L At some lower value of H we get the expression M=(100-H)-L Integrate by summing! 51H, 8M, and 41L M

Take Home Messages Using a Bayesian network you can describe complex relationships between variables Multinomial distributions allow you to handle variables with more than 2 states Using the rules of probability (Baye’s rule, marginalization, and independence), you can infer states on a Bayesian network