White Parts from: Technical overview for machine-learning researcher – slides from UAI 1999 tutorial
Important Distinctions in Learning BNs Complete data versus incomplete data Observed variables versus hidden variables Learning parameters versus learning structure Scoring methods versus conditional independence tests methods Exact scores versus asymptotic scores Search strategies versus Optimal learning of trees/polytrees/TANs
of this lecture The lecture today assumes: complete data, no hidden variables, exact scores, general search
The maximum for binomial sampling data is obtained when = h / (h+t). The Maximum Likelihood approach: maximize probability of data with respect to the unknown parameter(s). Probability of data: The maximum for binomial sampling data is obtained when = h / (h+t). Easier yet equivalent method is to maximize the log likelihood function.
Use a priori knowledge rather than data alone. Encode uncertainty about the parameter. Choose the median or average as a point estimate
p(|data) = p() p(data|) (As before) p(|data) = p() p(data|)
If we had more 100 heads the peak would move much more to the right. If we had 50 heads and 50 tails the peak would just sharpen considerably. p(|data) = p(| hhth …ttth) = p() p( hhth …ttth|) p(| data) = p() #h (1-) #t = p(| #h, #t) (#h, #t) are sufficient statistics for binomial sampling
Example: ht … htthh
From Prior to Posterior Observation 1: If the prior is Beta(;a,b) and we have seen A heads and B tails, then the posterior is Beta(;A+a,B+b). Consequence: If the prior is Beta(;a,b) and we use a point estimate = a/N, then after seeing the data our point estimate changes to = (A+a)/(N+N’) where N’=A+B. So what is a good choice for the hyper-parameters {a, b} ? For a random coin, maybe (100,100)? For a random thumbtack maybe (7,3)? a and b are imaginary counts, N=a+b is the equivalent sample size while A, B are the data counts and A+B is the data size.
From Prior to Posterior in Blocks Observation 2: If the prior is Dir(a1,…,an) and we have seen A1 …An counts from each state, then the posterior is Dir(a1+A1,…, an+An). Consequence: If the prior is Dir(a1,…,an) and we use a point estimate = (a1/N,…,an/N) then after seeing the data our point estimate changes to =( ( a1+A1)/(N+N’), … ,(an+An)/(N+N’)) Note that posterior distribution can be updated after each data point (namely in sequel or “online”) and that the posterior can serve as prior for the future data points.
Another view of the update Recall the Consequence that from = (a1/N,…,an/N) we move on after seeing the data to i = ( ai+Ai)/(N+N’) for each i. This update can be viewed as mixture of prior and data estimates: i = N/(N+N’) (ai/N) + N’(N+N’) (Ai/N’)
p(), p(h| ) = and p(t| ) = 1- Learning Bayes Net parameters p(), p(h| ) = and p(t| ) = 1-
Learning Bayes Net parameters
P(X=x | x) = x P(Y=y |X=x, y|x, y|~x)= y|x P(Y=y |X=~x, y|x, y|~x)= y|~x Global and Local parameter independence three separate independent thumbtack estimation tasks, assuming complete data.