CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization
A GENDA Learning probability distributions from data in the setting of known structure, missing data Expectation-maximization (EM) algorithm
B ASIC P ROBLEM Given a dataset D={ x [1],…, x [M]} and a Bayesian model over observed variables X and hidden (latent) variables Z Fit the distribution P( X, Z ) to the data Interpretation : each example x [m] is an incomplete view of the “underlying” sample ( x [m], z [m]) Z X
A PPLICATIONS Clustering in data mining Dimensionality reduction Latent psychological traits (e.g., intelligence, personality) Document classification Human activity recognition
H IDDEN V ARIABLES CAN Y IELD MORE P ARSIMONIOUS M ODELS Hidden variables => conditional independences Z X1X1 X2X2 X3X3 X4X4 X1X1 X2X2 X3X3 X4X4 Without Z, the observables become fully dependent
H IDDEN V ARIABLES CAN Y IELD MORE P ARSIMONIOUS M ODELS Hidden variables => conditional independences Z X1X1 X2X2 X3X3 X4X4 X1X1 X2X2 X3X3 X4X4 Without Z, the observables become fully dependent 1+4*2=9 parameters =15 parameters
G ENERATING M ODEL z [1] x [1] z [M] x [M] zz x|z These CPTs are identical and given
E XAMPLE : DISCRETE VARIABLES z [1] x [1] z [M] x [M] zz x|z Categorical distributions given by parameters z P(Z[i] | z ) = Categorical( z ) Categorical distribution P(X[i]|z[i], x|z[i] ) = Categorical( x|z[i] ) (in other words, z[i] multiplexes between Categorical distributions)
M AXIMUM L IKELIHOOD ESTIMATION Approach: find values of z, x | z ), and D Z =( z [1],…, z [M]) that maximize the likelihood of the data L( , D Z ; D) = P(D| , D Z ) Find arg max L( , D Z ; D) over , D Z
M ARGINAL L IKELIHOOD ESTIMATION Approach: find values of z, x | z ), and that maximize the likelihood of the data without assuming values of D Z =( z [1],…, z [M]) L( ; D) = Dz P(D, D Z | ) Find arg max L( ; D) over (A partially Bayesian approach)
C OMPUTATIONAL CHALLENGES P(D| , D Z ) and P(D,D Z | ) are easy to evaluate, but… Maximum likelihood arg max L( , D Z ; D) Optimizing over M assignments to Z (|Val(Z)| M possible joint assignments) as well as continuous parameters Maximum marginal likelihood arg max L( ; D) Optimizing locally over continuous parameters, but objective requires summing over M assignments to Z
E XPECTATION M AXIMIZATION FOR ML Idea: use a coordinate ascent approach arg max , DZ L( , D Z ; D) = arg max max DZ L( , D Z ; D) Step 1: Finding D Z * = arg max DZ L( , D Z ; D) is easy given a fixed Fully observed, ML parameter estimation Step 2: Set Q( ) = L( , D Z * ; D) Finding arg max Q( is easy given that D Z is fixed Fully observed, ML parameter estimation Repeat steps 1 and 2 until convergence
E XAMPLE : C ORRELATED VARIABLES z [1] x 1 [1] z [M] x 1 [M] zz x1|z x 2 [1] x 2 [M] x1|z z x1x1 zz x2x2 M Plate notationUnrolled network
E XAMPLE : C ORRELATED VARIABLES z x1x1 zz x1|z x2x2 x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32
E XAMPLE : C ORRELATED VARIABLES z x1x1 zz x1|z x2x2 x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32 Parameter Estimates z = 0.5 x1|z=1 = 0.4, x1|z=2 = 0.3 x2|z=1 = 0.7, x2|z=2 = 0.6
E XAMPLE : C ORRELATED VARIABLES z x1x1 zz x1|z x2x2 x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32 Parameter Estimates z = 0.5 x1|z=1 = 0.4, x1|z=2 = 0.3 x2|z=1 = 0.7, x2|z=2 = 0.6 Estimated Z’s (1,1): type 1 (1,0): type 1 (0,1): type 2 (0,0): type 2
E XAMPLE : C ORRELATED VARIABLES z x1x1 zz x1|z x2x2 x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32 Parameter Estimates z = x1|z=1 = 1, x1|z=2 = 0 x2|z=1 = 0.368, x2|z=2 = Estimated Z’s (1,1): type 1 (1,0): type 1 (0,1): type 2 (0,0): type 2
E XAMPLE : C ORRELATED VARIABLES z x1x1 zz x1|z x2x2 x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32 Parameter Estimates z = x1|z=1 = 1, x1|z=2 = 0 x2|z=1 = 0.368, x2|z=2 = Estimated Z’s (1,1): type 1 (1,0): type 1 (0,1): type 2 (0,0): type 2 Converged (true ML estimate)
E XAMPLE : C ORRELATED VARIABLES z x1x1 zz x1|z x2x2 x2|z M Plate notation x3x3 x3|z x4x4 x4|z Random initial guess Z = 0.44 X1|Z=1 = 0.97 X2|Z=1 = 0.21 X3|Z=1 = 0.87 X4|Z=1 = 0.57 X1|Z=2 = 0.07 X2|Z=2 = 0.97 X3|Z=2 = 0.71 X4|Z=2 = 0.03 Log likelihood x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , ,
E XAMPLE : E STEP z x1x1 zz x1|z x2x2 x2|z M Plate notation X Dataset x3x3 x3|z x4x4 x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Z Assignments x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Random initial guess Z = 0.44 X1|Z=1 = 0.97 X2|Z=1 = 0.21 X3|Z=1 = 0.87 X4|Z=1 = 0.57 X1|Z=2 = 0.07 X2|Z=2 = 0.97 X3|Z=2 = 0.71 X4|Z=2 = 0.03 Log likelihood -4401
E XAMPLE : M STEP z x1x1 zz x1|z x2x2 x2|z M Plate notation X Dataset x3x3 x3|z x4x4 x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Current estimates Z = 0.43 X1|Z=1 = 0.67 X2|Z=1 = 0.27 X3|Z=1 = 0.37 X4|Z=1 = 0.83 X1|Z=2 = 0.31 X2|Z=2 = 0.68 X3|Z=2 = 0.31 X4|Z=2 = 0.21 Log likelihood Z Assignments x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , ,1 2111
E XAMPLE : E STEP z x1x1 zz x1|z x2x2 x2|z M Plate notation X Dataset x3x3 x3|z x4x4 x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Z Assignments Current estimates Z = 0.43 X1|Z=1 = 0.67 X2|Z=1 = 0.27 X3|Z=1 = 0.37 X4|Z=1 = 0.83 X1|Z=2 = 0.31 X2|Z=2 = 0.68 X3|Z=2 = 0.31 X4|Z=2 = 0.21 Log likelihood x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , ,
E XAMPLE : E STEP z x1x1 zz x1|z x2x2 x2|z M Plate notation X Dataset x3x3 x3|z x4x4 x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Current estimates Z = 0.40 X1|Z=1 = 0.56 X2|Z=1 = 0.31 X3|Z=1 = 0.40 X4|Z=1 = 0.92 X1|Z=2 = 0.45 X2|Z=2 = 0.66 X3|Z=2 = 0.26 X4|Z=2 = 0.04 Log likelihood Z Assignments x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , ,1 2111
E XAMPLE : L AST E-M STEP z x1x1 zz x1|z x2x2 x2|z M Plate notation X Dataset x3x3 x3|z x4x4 x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Current estimates Z = 0.43 X1|Z=1 = 0.51 X2|Z=1 = 0.36 X3|Z=1 = 0.35 X4|Z=1 = 1 X1|Z=2 = 0.53 X2|Z=2 = 0.57 X3|Z=2 = 0.33 X4|Z=2 = 0 Log likelihood Z Assignments x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , ,
P ROBLEM : M ANY L OCAL M INIMA Flipping Z assignments causes large shifts in likelihood, leading to a poorly behaved energy landscape! Solution: EM using the marginal likelihood formulation “Soft” EM (This is the typical form of the EM algorithm)
E XPECTATION M AXIMIZATION FOR MML arg max L( , D) = arg max E DZ|D, [L( ; D Z, D)] Do arg max E DZ|D, [log L( ; D Z, D)] instead (justified later) Step 1: Given current fixed t, find P(Dz| t, D) Compute a distribution over each Z[i] Step 2: Use these probabilities in the expectation E DZ |D, t [log L( , D Z ; D)] = Q( Now find max Q( Fully observed, weighted, ML parameter estimation Repeat steps 1 (expectation) and 2 (maximization) until convergence
E STEP IN DETAIL Ultimately, want to maximize Q( t ) = E DZ|D, t [log L( ; D Z, D)] over Q( t ) = m z [m] P( z [m]| x [m], t ) log P( x [m], z [m]| ) E step computes the terms w m, z ( t )=P( Z [m]= z |D, t ) over all examples m and z Val[ Z ]
M STEP IN DETAIL arg max Q( t ) = m z w m, z ( t ) log P ( x [m]| , z [m]= z ) = argmax m z P ( x [m]| , z [m]= z )^(w m, z ( t )) This is weighted ML Each z[m] is interpreted to be observed w m, z ( t ) times Most closed-form ML expressions (Bernoulli, categorial, Gaussian) can be adopted easily to weighted case
E XAMPLE : B ERNOULLI P ARAMETER FOR Z Z * = arg max z m z w m, z log P ( x [m], z [m]= z | ) = arg max z m z w m, z log (I[ z =1] Z + I[ z =0](1- Z ) = arg max z [log ( Z ) m w m, z=1 + log(1- Z ) m w m, z= 0 ] => Z * = ( m w m, z=1 )/ m (w m, z=1 + w m, z =0 ) “Expected counts” M t [z] = m w m,z ( t ) Express Z * = M t [z=1] / M t [ ]
E XAMPLE : B ERNOULLI P ARAMETERS FOR X I | Z Xi|z=k * = arg max z m w m, z =k log P( x [m], z [m]= k | Xi|z=k ) = arg max xi|z=k m z w m, z log (I[x i [m]=1, z =k] Xi|z=k + I[x i [m]=0, z =k](1- Xi|z=k ) = … (similar derivation) Xi|z=k * = M t [x i =1,z=k] / M t [z=k]
EM ON P RIOR E XAMPLE (100 ITERATIONS ) z x1x1 zz x1|z x2x2 x2|z M Plate notation X Dataset x3x3 x3|z x4x4 x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Final estimates Z = 0.49 X1|Z=1 = 0.64 X2|Z=1 = 0.88 X3|Z=1 = 0.41 X4|Z=1 = 0.46 X1|Z=2 = 0.38 X2|Z=2 = 0.00 X3|Z=2 = 0.27 X4|Z=2 = 0.68 Log likelihood P(Z)=2 x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , ,1 0.00
C ONVERGENCE In general, no way to tell a priori how fast EM will converge Soft EM is usually slower than hard EM Still runs into local minima, but has more opportunities to coordinate parameter adjustments
W HY DOES IT WORK ? Why are we optimizing over Q( t ) = m z [m] P( z [m]| x [m], t ) log P( x [m], z [m]| ) rather than the true marginalized likelihood: L( D) = m z [m] P( z [m]| x [m], t ) P( x [m], z [m]| ) ?
W HY DOES IT WORK ? Why are we optimizing over Q( t ) = m z [m] P( z [m]| x [m], t ) log P( x [m], z [m]| ) rather than the true marginalized likelihood: L( D) = m z [m] P( z [m]| x [m], t ) P( x [m], z [m]| ) ? Can prove that: The log likelihood is increased at every step A stationary point of arg max E DZ|D, [L( ; D Z, D)] is a stationary point of log L( D ) see K&F p
G AUSSIAN C LUSTERING USING EM One of the first uses of EM Widely used approach Finding good starting points: k-means algorithm (Hard assignment) Handling degeneracies Regularization
R ECAP Learning with hidden variables Typically categorical