CS498-EA Reasoning in AI Lecture #20 Instructor: Eyal Amir Fall Semester 2009 Who is in this class? Some slides in this set were adopted from Eran Segal
Summary of last time: Inference We presented the variable elimination algorithm Specifically, VE for finding marginal P(Xi) over one variable, Xi from X1,…,Xn Order on variables such that One variable Xj eliminated at a time (a) Move unneeded terms (those not involving Xj) outside summation over Xj (b) Create a new potential function, fXj(.) over other variables appearing in the terms of the summation at (a) Works for both BNs and MFs (Markov Fields)
Exact Inference Treewidth methods: Variable elimination Clique tree algorithm Treewidth
Today: Learning in Graphical Models Parameter Estimation Maximum Likelihood Complete Observations Naïve Bayes Not so Naïve Bayes
Learning Introduction So far, we assumed that the networks were given Where do the networks come from? Knowledge engineering with aid of experts Automated construction of networks Learn by examples or instances 5
maximize i P(bi,ci,ei; w0, w1) Parameter estimation Maximum likelihood estimation: maximize i P(bi,ci,ei; w0, w1)
Learning Introduction Input: dataset of instances D={d[1],...d[m]} Output: Bayesian network Measures of success How close is the learned network to the original distribution Use distance measures between distributions Often hard because we do not have the true underlying distribution Instead, evaluate performance by how well the network predicts new unseen examples (“test data”) Classification accuracy How close is the structure of the network to the true one? Use distance metric between structures Hard because we do not know the true structure Instead, ask whether independencies learned hold in test data 7
Prior Knowledge Prespecified structure Prespecified variables Learn only CPDs Prespecified variables Learn network structure and CPDs Hidden variables Learn hidden variables, structure, and CPDs Complete/incomplete data Missing data Unobserved variables 8
Learning Bayesian Networks X1 X2 Inducer Data Prior information Y P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 9
Known Structure, Complete Data Goal: Parameter estimation Data does not contain missing values X1 X2 X1 X2 Inducer Initial network Y Y X1 X2 Y x10 x21 y0 x11 x20 y1 P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 Input Data 10
Unknown Structure, Complete Data Goal: Structure learning & parameter estimation Data does not contain missing values X1 X2 X1 X2 Inducer Initial network Y Y X1 X2 Y x10 x21 y0 x11 x20 y1 P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 Input Data 11
Known Structure, Incomplete Data Goal: Parameter estimation Data contains missing values X1 X2 X1 X2 Inducer Initial network Y Y X1 X2 Y ? x21 y0 x11 x10 x20 y1 P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 Input Data 12
Unknown Structure, Incomplete Data Goal: Structure learning & parameter estimation Data contains missing values X1 X2 X1 X2 Inducer Initial network Y Y X1 X2 Y ? x21 y0 x11 x10 x20 y1 P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 Input Data 13
Parameter Estimation Input Goal: Learn CPD parameters Network structure Choice of parametric family for each CPD P(Xi|Pa(Xi)) Goal: Learn CPD parameters Two main approaches Maximum likelihood estimation Bayesian approaches 14
Biased Coin Toss Example Coin can land in two positions: Head or Tail Estimation task Given toss examples x[1],...x[m] estimate P(H)= and P(T)=1- Assumption: i.i.d samples Tosses are controlled by an (unknown) parameter Tosses are sampled from the same distribution Tosses are independent of each other 15
Biased Coin Toss Example Goal: find [0,1] that predicts the data well “Predicts the data well” = likelihood of the data given Example: probability of sequence H,T,T,H,H 0.2 0.4 0.6 0.8 1 L(D:) 16
Maximum Likelihood Estimator Parameter that maximizes L(D:) In our example, =0.6 maximizes the sequence H,T,T,H,H 0.2 0.4 0.6 0.8 1 L(D:) 17
Maximum Likelihood Estimator General case Observations: MH heads and MT tails Find maximizing likelihood Equivalent to maximizing log-likelihood Differentiating the log-likelihood and solving for we get that the maximum likelihood parameter is: 18
Sufficient Statistics For computing the parameter of the coin toss example, we only needed MH and MT since MH and MT are sufficient statistics 19
Sufficient Statistics Definition: Function s(D) is a sufficient statistics from instances to a vector in k if for any two datasets D and D’ and any we have Datasets Statistics 20
Sufficient Statistics for Multinomial A sufficient statistics for a dataset D over a variable Y with k values is the tuple of counts <M1,...Mk> such that Mi is the number of times that the Y=yi in D Sufficient statistic Define s(x[i]) as a tuple of dimension k s(x[i])=(0,...0,1,0,...,0) (1,...,i-1) (i+1,...,k) 21
Bias and Variance of Estimator
Next Time