. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
2 Expectation Maximization (EM) for Bayesian networks Intuition (as before): u When we have access to all counts, then we can find the ML estimate of all parameters in all local tables directly by counting. u However, missing values do not allow us to perform such counts. u So instead, we compute the expected counts using the current parameter assignment, and then use them to compute the maximum likelihood estimate. A B C D P(A= a | ) P(B=b|A= a, ) P(C=c|A= a, ) P(D=d|b,c, )
3 Expectation Maximization (EM) X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH N (X,Y ) XY # HTHTHTHT HHTTHHTT P(Y=H|X=T, ) = 0.4 Expected Counts P(Y=H|X=H,Z=T, ) = 0.3 Data Current parameters X Y Z N (X,Z ) XZ # HTHTHTHT HHTTHHTT
4 EM (cont.) Training Data X1X1 X2X2 X3X3 Y Z1Z1 Z2Z2 Z3Z3 Initial network (G, ’) Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(Y, X 1, X 1, X 3 ) N(Z 1, Y) N(Z 2, Y) N(Z 3, Y) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 Y Z1Z1 Z2Z2 Z3Z3 Updated network (G, ) (M-Step) Reiterate Note: This EM iteration corresponds to the non- homogenous HMM iteration. When parameters are shared across local probability tables or are functions of each other, changes are needed.
5 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones
6 Relative Entropy – a measure of difference among distributions This is a measure of difference between P(x) and Q(x). It is not a symmetric function. The distribution P(x) is assumed the “true” distribution used for taking the expectation of the log of the difference. The following properties hold: We define the relative entropy H(P||Q) for two probability distributions P and Q of a variable X (with x being a value of X) as follows: H(P||Q)= P(x i ) log 2 (P(x i )/Q(x i )) xixi H(P||Q) 0 Equality holds if and only if P(x) = Q(x) for all x.
7 Average Score for sequence comparisons Recall that we have defined the scoring function via Note that the average score is the relative entropy H(P ||Q) where Q(a,b) = Q(a) Q(b). Relative entropy also arises when choosing amongst competing models.
8 The setup of the EM algorithm We start with a likelihood function parameterized by . The observed quantity is denoted X=x. It is often a vector x 1,…,x L of observations (e.g., evidence for some nodes in a Bayesian network). The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayes network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y| ) is easy to maximize. The log-likelihood of an observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x, ) (Because P(x,y| ) = P(x| ) P(y|x, )).
9 The goal of EM algorithm The log-likelihood of ONE observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x, ) The goal: Starting with a current parameter vector ’, EM’s goal is to find a new vector such that P(x| ) > P(x| ’) with the highest possible difference. The result: After enough iterations EM reaches a local maximum of the likelihood P(x| ).
10 The Expectation Operator Recall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] = y y p(y). The expectation of a function L(Y) is given by E[L(Y)] = y p(y) L(y). An example used by the EM algorithm: E ’ [log p(x,y| )] = y p(y|x, ’) log p(x,y| ) The expectation operator E is linear. For two random variables X,Y, and constants a,b, the following holds E[aX+bY] = a E[X] + b E[Y] Q( | ’)
11 Improving the likelihood Starting with log P(x| ) = log P(x, y| ) – log P(y|x, ), multiplying both sides by P(y|x, ’), and summing over y, yields Log P(x | ) = P(y|x, ’) log P(x,y| ) - P(y|x, ’) log P(y |x, ) yy = E ’ [log p(x,y| )] = Q( | ’) We now observe that = log P(x| ) – log P(x| ’) = Q( | ’) – Q( ’ | ’) + P(y|x, ’) log [P(y |x, ’) / P(y |x, )] y 0 (relative entropy) So choosing * = argmax Q( | ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x| ).
12 The EM algorithm Input: A likelihood function p(x,y| ) parameterized by . Initialization: Fix an arbitrary starting value ’ Repeat E-step: Compute Q( | ’) = E ’ [log P(x,y| )] M-step: ’ argmax Q( | ’) Until = log P(x| ) – log P(x| ’) < Comment: At the M-step one can actually choose any ’ as long as > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.
13 The EM algorithm (with multiple independent samples) Recall the log-likelihood of an observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x, ) For independent samples (x i, y i ), i=1,…,m, we can write: i log P(x i | ) = i log P(x i,y i | ) – i log P(y i |x i, ) E-step: Compute Q( | ’) = E ’ [ i log P(x i,y i | )] = i E ’ [log P(x i,y i | )] M-step: ’ argmax Q( | ’) Each sample Completed separately.
14 log P(x| ) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has a higher likelihood than the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem E ’ [log P(x,y| )]
15 Gene Counting Revisited (as EM) The observations: The variables X=(N A, N B, N AB, N O ) with a specific assignment x = (n A,n B,n AB,n O ). The hidden quantity: The variables Y=(N a/a, N a/o, N b/b, N b/O ) with a specific assignment y = (n a/a,n a/o, n b/b, n b/O ). The parameters: ={ a, b, o }. The likelihood of the completed data of n points: P(x,y| ) = P(n AB,n O, n a/a,n a/o, n b/b, n b/o | ) =
16 The E-step of Gene Counting The likelihood of the hidden data given the observed data of n points: P(y| x, ’) = P(n a/a,n a/o, n b/b, n b/o | n A,n B, n AB,n O, ’) = P(n a/a,n a/o | n A, ’ a, ’ o ) P(n b/b,n b/o | n B, ’ b, ’ o ) This is exactly the E-step we used earlier !
17 The M-step of Gene Counting The log-likelihood of the completed data of n points: Taking expectation wrt Y =(N a/a, N a/o, N b/b, N b/o ) and using linearity of E yields the function Q( | ’) which we need to maximize:
18 The M-step of Gene Counting (Cont.) We need to maximize the function: Under the constraint a + b + o =1. The solution (obtained using Lagrange multipliers) is given by Which matches the M-step we used earlier !
19 Outline for a different derivation of Gene Counting as an EM algorithm Define a variable X with values x A,x B,x AB,x O. Define a variable Y with values y a/a, y a/o, y b/b, y b/o, y a/b, y o/o. Examine the Bayesian network: Y X The local probability table for Y is P(y a/a | ) = a a, P(y a/o | ) = 2 a o, etc. The local probability table for X given Y is P(x A | y a/q, ) = 1, P(x A | y a/o, ) = 1, P(x A | y b/o, ) = 0, etc, only 0’s and 1’s. Homework: write down for yourself the likelihood function for n independent points x i,y i, and check that the EM equations match the gene counting equations.