Important Distinctions in Learning BNs

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Basics of Statistical Estimation
A Tutorial on Learning with Bayesian Networks
Learning with Missing Data
Probabilistic models Haixu Tang School of Informatics.
Parameter and Structure Learning Dhruv Batra, Recitation 10/02/2008.
Learning: Parameter Estimation
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Parameter Estimation using likelihood functions Tutorial #1
. Learning Bayesian networks Slides by Nir Friedman.
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
Bayesian learning finalized (with high probability)
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Thanks to Nir Friedman, HU
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
Learning Bayesian Networks (From David Heckerman’s tutorial)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Statistical Learning (From data to distributions).
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Univariate Gaussian Case (Cont.)
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Bayesian Estimation and Confidence Intervals Lecture XXII.
Oliver Schulte Machine Learning 726
CS 2750: Machine Learning Review
Chapter 3: Maximum-Likelihood Parameter Estimation
Bayesian Estimation and Confidence Intervals
Probability Theory and Parameter Estimation I
ICS 280 Learning in Graphical Models
CS 2750: Machine Learning Density Estimation
Parameter Estimation 主講人:虞台文.
Bayes Net Learning: Bayesian Approaches
Maximum Likelihood Estimation
Oliver Schulte Machine Learning 726
Tutorial #3 by Ma’ayan Fishelson
More about Posterior Distributions
Learning Bayesian networks
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Important Distinctions in Learning BNs
LECTURE 09: BAYESIAN LEARNING
Parameter Learning 2 Structure Learning 1: The good
Learning From Observed Data
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Part II White Parts from: Technical overview for machine-learning researcher – slides from UAI 1999 tutorial.
Learning Bayesian networks
Presentation transcript:

White Parts from: Technical overview for machine-learning researcher – slides from UAI 1999 tutorial

Important Distinctions in Learning BNs Complete data versus incomplete data Observed variables versus hidden variables Learning parameters versus learning structure Scoring methods versus conditional independence tests methods Exact scores versus asymptotic scores Search strategies versus Optimal learning of trees/polytrees/TANs

of this lecture The lecture today assumes: complete data, no hidden variables, exact scores, general search

The maximum for binomial sampling data is obtained when  = h / (h+t). The Maximum Likelihood approach: maximize probability of data with respect to the unknown parameter(s). Probability of data: The maximum for binomial sampling data is obtained when  = h / (h+t). Easier yet equivalent method is to maximize the log likelihood function.

Use a priori knowledge rather than data alone. Encode uncertainty about the parameter. Choose the median or average as a point estimate

p(|data) =  p() p(data|) (As before) p(|data) =  p() p(data|)

If we had more 100 heads the peak would move much more to the right. If we had 50 heads and 50 tails the peak would just sharpen considerably. p(|data) = p(| hhth …ttth) =  p() p( hhth …ttth|) p(| data) =  p() #h (1-) #t = p(| #h, #t) (#h, #t) are sufficient statistics for binomial sampling

Example: ht … htthh

From Prior to Posterior Observation 1: If the prior is Beta(;a,b) and we have seen A heads and B tails, then the posterior is Beta(;A+a,B+b). Consequence: If the prior is Beta(;a,b) and we use a point estimate  = a/N, then after seeing the data our point estimate changes to = (A+a)/(N+N’) where N’=A+B. So what is a good choice for the hyper-parameters {a, b} ? For a random coin, maybe (100,100)? For a random thumbtack maybe (7,3)? a and b are imaginary counts, N=a+b is the equivalent sample size while A, B are the data counts and A+B is the data size.

From Prior to Posterior in Blocks Observation 2: If the prior is Dir(a1,…,an) and we have seen A1 …An counts from each state, then the posterior is Dir(a1+A1,…, an+An). Consequence: If the prior is Dir(a1,…,an) and we use a point estimate  = (a1/N,…,an/N) then after seeing the data our point estimate changes to  =( ( a1+A1)/(N+N’), … ,(an+An)/(N+N’)) Note that posterior distribution can be updated after each data point (namely in sequel or “online”) and that the posterior can serve as prior for the future data points.

Another view of the update Recall the Consequence that from  = (a1/N,…,an/N) we move on after seeing the data to i = ( ai+Ai)/(N+N’) for each i. This update can be viewed as mixture of prior and data estimates: i = N/(N+N’) (ai/N) + N’(N+N’) (Ai/N’)

p(), p(h| ) =  and p(t| ) = 1-  Learning Bayes Net parameters p(), p(h| ) =  and p(t| ) = 1- 

Learning Bayes Net parameters

P(X=x | x) = x P(Y=y |X=x, y|x, y|~x)= y|x P(Y=y |X=~x, y|x, y|~x)= y|~x Global and Local parameter independence  three separate independent thumbtack estimation tasks, assuming complete data.