CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

A GENDA Learning probability distributions from data in the setting of known structure, missing data Expectation-maximization (EM) algorithm

B ASIC P ROBLEM Given a dataset D={ x [1],…, x [M]} and a Bayesian model over observed variables X and hidden (latent) variables Z Fit the distribution P( X, Z ) to the data Interpretation : each example x [m] is an incomplete view of the “underlying” sample ( x [m], z [m]) Z X

A PPLICATIONS Clustering in data mining Dimensionality reduction Latent psychological traits (e.g., intelligence, personality) Document classification Human activity recognition

H IDDEN V ARIABLES CAN Y IELD MORE P ARSIMONIOUS M ODELS Hidden variables => conditional independences Z X1X1 X2X2 X3X3 X4X4 X1X1 X2X2 X3X3 X4X4 Without Z, the observables become fully dependent

H IDDEN V ARIABLES CAN Y IELD MORE P ARSIMONIOUS M ODELS Hidden variables => conditional independences Z X1X1 X2X2 X3X3 X4X4 X1X1 X2X2 X3X3 X4X4 Without Z, the observables become fully dependent 1+4*2=9 parameters 1+2+4+8=15 parameters

G ENERATING M ODEL z [1] x [1] z [M] x [M] zz  x|z These CPTs are identical and given

E XAMPLE : DISCRETE VARIABLES z [1] x [1] z [M] x [M] zz  x|z Categorical distributions given by parameters  z P(Z[i] |  z ) = Categorical(  z ) Categorical distribution P(X[i]|z[i],  x|z[i] ) = Categorical(  x|z[i] ) (in other words, z[i] multiplexes between Categorical distributions)

M AXIMUM L IKELIHOOD ESTIMATION Approach: find values of  z,  x | z ), and D Z =( z [1],…, z [M]) that maximize the likelihood of the data L( , D Z ; D) = P(D| , D Z ) Find arg max L( , D Z ; D) over , D Z

M ARGINAL L IKELIHOOD ESTIMATION Approach: find values of  z,  x | z ), and that maximize the likelihood of the data without assuming values of D Z =( z [1],…, z [M]) L(  ; D) =  Dz P(D, D Z |  ) Find arg max L(  ; D) over  (A partially Bayesian approach)

C OMPUTATIONAL CHALLENGES P(D| , D Z ) and P(D,D Z |  ) are easy to evaluate, but… Maximum likelihood arg max L( , D Z ; D) Optimizing over M assignments to Z (|Val(Z)| M possible joint assignments) as well as continuous parameters Maximum marginal likelihood arg max L(  ; D) Optimizing locally over continuous parameters, but objective requires summing over M assignments to Z

E XPECTATION M AXIMIZATION FOR ML Idea: use a coordinate ascent approach arg max , DZ L( , D Z ; D) = arg max  max DZ L( , D Z ; D) Step 1: Finding D Z * = arg max DZ L( , D Z ; D) is easy given a fixed  Fully observed, ML parameter estimation Step 2: Set Q(  ) = L( , D Z * ; D) Finding    arg max  Q(  is easy given that D Z is fixed Fully observed, ML parameter estimation Repeat steps 1 and 2 until convergence

E XAMPLE : C ORRELATED VARIABLES z [1] x 1 [1] z [M] x 1 [M] zz  x1|z x 2 [1] x 2 [M]  x1|z z x1x1 zz x2x2 M Plate notationUnrolled network

E XAMPLE : C ORRELATED VARIABLES z x1x1 zz  x1|z x2x2  x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32

E XAMPLE : C ORRELATED VARIABLES z x1x1 zz  x1|z x2x2  x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32 Parameter Estimates  z = 0.5  x1|z=1 = 0.4,  x1|z=2 = 0.3  x2|z=1 = 0.7,  x2|z=2 = 0.6

E XAMPLE : C ORRELATED VARIABLES z x1x1 zz  x1|z x2x2  x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32 Parameter Estimates  z = 0.5  x1|z=1 = 0.4,  x1|z=2 = 0.3  x2|z=1 = 0.7,  x2|z=2 = 0.6 Estimated Z’s (1,1): type 1 (1,0): type 1 (0,1): type 2 (0,0): type 2

E XAMPLE : C ORRELATED VARIABLES z x1x1 zz  x1|z x2x2  x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32 Parameter Estimates  z = 0.604  x1|z=1 = 1,  x1|z=2 = 0  x2|z=1 = 0.368,  x2|z=2 = 0.919 Estimated Z’s (1,1): type 1 (1,0): type 1 (0,1): type 2 (0,0): type 2

E XAMPLE : C ORRELATED VARIABLES z x1x1 zz  x1|z x2x2  x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32 Parameter Estimates  z = 0.604  x1|z=1 = 1,  x1|z=2 = 0  x2|z=1 = 0.368,  x2|z=2 = 0.919 Estimated Z’s (1,1): type 1 (1,0): type 1 (0,1): type 2 (0,0): type 2 Converged (true ML estimate)

E XAMPLE : C ORRELATED VARIABLES z x1x1 zz  x1|z x2x2  x2|z M Plate notation x3x3  x3|z x4x4  x4|z Random initial guess  Z = 0.44  X1|Z=1 = 0.97  X2|Z=1 = 0.21  X3|Z=1 = 0.87  X4|Z=1 = 0.57  X1|Z=2 = 0.07  X2|Z=2 = 0.97  X3|Z=2 = 0.71  X4|Z=2 = 0.03 Log likelihood -5176 x 3,x 4 x 1,x 2 0,00,11,01,1 0,0 1151422047 0,1 32163775 1,0 121173958 1,1 133924520

E XAMPLE : E STEP z x1x1 zz  x1|z x2x2  x2|z M Plate notation X Dataset x3x3  x3|z x4x4  x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0,0 1151422047 0,1 32163775 1,0 121173958 1,1 133924520 Z Assignments x 3,x 4 x 1,x 2 0,00,11,01,1 0,02 1 2 1 0,12222 1,0 1111 1,12 111 Random initial guess  Z = 0.44  X1|Z=1 = 0.97  X2|Z=1 = 0.21  X3|Z=1 = 0.87  X4|Z=1 = 0.57  X1|Z=2 = 0.07  X2|Z=2 = 0.97  X3|Z=2 = 0.71  X4|Z=2 = 0.03 Log likelihood -4401

E XAMPLE : M STEP z x1x1 zz  x1|z x2x2  x2|z M Plate notation X Dataset x3x3  x3|z x4x4  x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0,0 1151422047 0,1 32163775 1,0 121173958 1,1 133924520 Current estimates  Z = 0.43  X1|Z=1 = 0.67  X2|Z=1 = 0.27  X3|Z=1 = 0.37  X4|Z=1 = 0.83  X1|Z=2 = 0.31  X2|Z=2 = 0.68  X3|Z=2 = 0.31  X4|Z=2 = 0.21 Log likelihood -3033 Z Assignments x 3,x 4 x 1,x 2 0,00,11,01,1 0,0 2121 0,1 2222 1,0 1111 1,1 2111

E XAMPLE : E STEP z x1x1 zz  x1|z x2x2  x2|z M Plate notation X Dataset x3x3  x3|z x4x4  x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0,0 1151422047 0,1 32163775 1,0 121173958 1,1 133924520 Z Assignments Current estimates  Z = 0.43  X1|Z=1 = 0.67  X2|Z=1 = 0.27  X3|Z=1 = 0.37  X4|Z=1 = 0.83  X1|Z=2 = 0.31  X2|Z=2 = 0.68  X3|Z=2 = 0.31  X4|Z=2 = 0.21 Log likelihood -2965 x 3,x 4 x 1,x 2 0,00,11,01,1 0,0 2121 0,1 222 1 1,0 1111 1,1 21 2 1

E XAMPLE : E STEP z x1x1 zz  x1|z x2x2  x2|z M Plate notation X Dataset x3x3  x3|z x4x4  x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0,0 1151422047 0,1 32163775 1,0 121173958 1,1 133924520 Current estimates  Z = 0.40  X1|Z=1 = 0.56  X2|Z=1 = 0.31  X3|Z=1 = 0.40  X4|Z=1 = 0.92  X1|Z=2 = 0.45  X2|Z=2 = 0.66  X3|Z=2 = 0.26  X4|Z=2 = 0.04 Log likelihood -2859 Z Assignments x 3,x 4 x 1,x 2 0,00,11,01,1 0,0 2121 0,1 2222 1,0 1111 1,1 2111

E XAMPLE : L AST E-M STEP z x1x1 zz  x1|z x2x2  x2|z M Plate notation X Dataset x3x3  x3|z x4x4  x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0,0 1151422047 0,1 32163775 1,0 121173958 1,1 133924520 Current estimates  Z = 0.43  X1|Z=1 = 0.51  X2|Z=1 = 0.36  X3|Z=1 = 0.35  X4|Z=1 = 1  X1|Z=2 = 0.53  X2|Z=2 = 0.57  X3|Z=2 = 0.33  X4|Z=2 = 0 Log likelihood -2683 Z Assignments x 3,x 4 x 1,x 2 0,00,11,01,1 0,0 2121 0,1 2 1 2 1 1,02 1 2 1 1,1 21 2 1

P ROBLEM : M ANY L OCAL M INIMA Flipping Z assignments causes large shifts in likelihood, leading to a poorly behaved energy landscape! Solution: EM using the marginal likelihood formulation “Soft” EM (This is the typical form of the EM algorithm)

E XPECTATION M AXIMIZATION FOR MML arg max  L( , D) = arg max  E DZ|D,  [L(  ; D Z, D)] Do arg max  E DZ|D,  [log L(  ; D Z, D)] instead (justified later) Step 1: Given current fixed  t,  find P(Dz|  t, D) Compute a distribution over each Z[i] Step 2: Use these probabilities in the expectation E DZ |D,  t [log L( , D Z ; D)] = Q(  Now find max  Q(  Fully observed, weighted, ML parameter estimation Repeat steps 1 (expectation) and 2 (maximization) until convergence

E STEP IN DETAIL Ultimately, want to maximize Q(  t ) = E DZ|D,  t [log L(  ; D Z, D)] over  Q(  t ) =  m  z [m] P( z [m]| x [m],  t ) log P( x [m], z [m]|  ) E step computes the terms w m, z (  t )=P( Z [m]= z |D,  t ) over all examples m and z  Val[ Z ]

M STEP IN DETAIL arg max  Q(  t ) =  m  z w m, z (  t ) log P ( x [m]| , z [m]= z ) = argmax  m  z P ( x [m]| , z [m]= z )^(w m, z (  t )) This is weighted ML Each z[m] is interpreted to be observed w m, z (  t ) times Most closed-form ML expressions (Bernoulli, categorial, Gaussian) can be adopted easily to weighted case

E XAMPLE : B ERNOULLI P ARAMETER FOR Z  Z * = arg max  z  m  z w m, z log P ( x [m], z [m]= z |   ) = arg max  z  m  z w m, z log (I[ z =1]  Z +  I[ z =0](1-  Z ) = arg max  z [log (  Z )  m w m, z=1 + log(1-  Z )  m w m, z= 0 ] =>  Z * = (  m w m, z=1 )/  m (w m, z=1 + w m, z =0 ) “Expected counts” M  t [z] =  m w m,z (  t ) Express  Z * = M  t [z=1] / M  t [ ]

EM ON P RIOR E XAMPLE (100 ITERATIONS ) z x1x1 zz  x1|z x2x2  x2|z M Plate notation X Dataset x3x3  x3|z x4x4  x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0,0 1151422047 0,1 32163775 1,0 121173958 1,1 133924520 Final estimates  Z = 0.49  X1|Z=1 = 0.64  X2|Z=1 = 0.88  X3|Z=1 = 0.41  X4|Z=1 = 0.46  X1|Z=2 = 0.38  X2|Z=2 = 0.00  X3|Z=2 = 0.27  X4|Z=2 = 0.68 Log likelihood -2833 P(Z)=2 x 3,x 4 x 1,x 2 0,00,11,01,1 0,0 0.900.950.840.93 0,1 0.00 1,0 0.760.890.640.82 1,1 0.00

C ONVERGENCE In general, no way to tell a priori how fast EM will converge Soft EM is usually slower than hard EM Still runs into local minima, but has more opportunities to coordinate parameter adjustments

W HY DOES IT WORK ? Why are we optimizing over Q(  t ) =  m  z [m] P( z [m]| x [m],  t ) log P( x [m], z [m]|  ) rather than the true marginalized likelihood: L(  D) =  m  z [m] P( z [m]| x [m],  t ) P( x [m], z [m]|  ) ?

W HY DOES IT WORK ? Why are we optimizing over Q(  t ) =  m  z [m] P( z [m]| x [m],  t ) log P( x [m], z [m]|  ) rather than the true marginalized likelihood: L(  D) =  m  z [m] P( z [m]| x [m],  t ) P( x [m], z [m]|  ) ? Can prove that: The log likelihood is increased at every step A stationary point of arg max  E DZ|D,  [L(  ; D Z, D)] is a stationary point of log L(  D ) see K&F p882-884

G AUSSIAN C LUSTERING USING EM One of the first uses of EM Widely used approach Finding good starting points: k-means algorithm (Hard assignment) Handling degeneracies Regularization

R ECAP Learning with hidden variables Typically categorical

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.

Similar presentations

Presentation on theme: "CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.

Similar presentations

Presentation on theme: "CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization."— Presentation transcript:

Similar presentations

About project

Feedback