Download presentation
Presentation is loading. Please wait.
1
White Parts from: Technical overview for machine-learning researcher – slides from UAI 1999 tutorial
2
Important Distinctions in Learning BNs
Complete data versus incomplete data Observed variables versus hidden variables Learning parameters versus learning structure Scoring methods versus conditional independence tests methods Exact scores versus asymptotic scores Search strategies versus Optimal learning of trees/polytrees/TANs
3
of this lecture The lecture today assumes: complete data, no hidden variables, exact scores, general search
4
The maximum for binomial sampling data is obtained when = h / (h+t).
The Maximum Likelihood approach: maximize probability of data with respect to the unknown parameter(s). Probability of data: The maximum for binomial sampling data is obtained when = h / (h+t). Easier yet equivalent method is to maximize the log likelihood function.
5
Use a priori knowledge rather than data alone.
Encode uncertainty about the parameter. Choose the median or average as a point estimate
6
p(|data) = p() p(data|)
(As before) p(|data) = p() p(data|)
7
If we had more 100 heads the peak would move much more to the right.
If we had 50 heads and 50 tails the peak would just sharpen considerably. p(|data) = p(| hhth …ttth) = p() p( hhth …ttth|) p(| data) = p() #h (1-) #t = p(| #h, #t) (#h, #t) are sufficient statistics for binomial sampling
9
Example: ht … htthh
10
11
12
15
From Prior to Posterior
Observation 1: If the prior is Beta(;a,b) and we have seen A heads and B tails, then the posterior is Beta(;A+a,B+b). Consequence: If the prior is Beta(;a,b) and we use a point estimate = a/N, then after seeing the data our point estimate changes to = (A+a)/(N+N’) where N’=A+B. So what is a good choice for the hyper-parameters {a, b} ? For a random coin, maybe (100,100)? For a random thumbtack maybe (7,3)? a and b are imaginary counts, N=a+b is the equivalent sample size while A, B are the data counts and A+B is the data size.
21
From Prior to Posterior in Blocks
Observation 2: If the prior is Dir(a1,…,an) and we have seen A1 …An counts from each state, then the posterior is Dir(a1+A1,…, an+An). Consequence: If the prior is Dir(a1,…,an) and we use a point estimate = (a1/N,…,an/N) then after seeing the data our point estimate changes to =( ( a1+A1)/(N+N’), … ,(an+An)/(N+N’)) Note that posterior distribution can be updated after each data point (namely in sequel or “online”) and that the posterior can serve as prior for the future data points.
22
Another view of the update
Recall the Consequence that from = (a1/N,…,an/N) we move on after seeing the data to i = ( ai+Ai)/(N+N’) for each i. This update can be viewed as mixture of prior and data estimates: i = N/(N+N’) (ai/N) + N’(N+N’) (Ai/N’)
23
p(), p(h| ) = and p(t| ) = 1-
Learning Bayes Net parameters p(), p(h| ) = and p(t| ) = 1-
24
Learning Bayes Net parameters
25
P(X=x | x) = x P(Y=y |X=x, y|x, y|~x)= y|x P(Y=y |X=~x, y|x, y|~x)= y|~x Global and Local parameter independence three separate independent thumbtack estimation tasks, assuming complete data.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.