Presentation is loading. Please wait.

Presentation is loading. Please wait.

EM and expected complete log-likelihood Mixture of Experts

Similar presentations


Presentation on theme: "EM and expected complete log-likelihood Mixture of Experts"— Presentation transcript:

1 EM and expected complete log-likelihood Mixture of Experts
Learning Theory Reza Shadmehr EM and expected complete log-likelihood Mixture of Experts Identification of a linear dynamical system

2 The log likelihood of the unlabeled data
Hidden variable Measured variable The unlabeled data In the last lecture we assumed that in the M step, we knew the posterior probabilities, and found the derivative of the log-likelihood with respect to mu and sigma to maximize the log-likelihood. Today we take a more general approach to include both the E and M steps into the log-likelihood.

3 A more general formulation of EM: Expected complete log likelihood
The real data is not labeled. But for now, assume that someone labeled it, resulting in the “complete data”. Complete log-likelihood Expected complete log-likelihood In EM, in the E step we fix theta and try to maximize the expected complete log-likelihood by setting expected value of our hidden variables z to the posterior probability. In the M step, we fix expected value of z and try to maximize the expected complete log-likelihood by setting the parameters theta.

4 A more general formulation of EM: Expected complete log likelihood
In the M step, we fix expected value of z and try to maximize the expected complete log-likelihood by setting parameters theta. Expected complete log-likelihood

5

6 Function to maximize The value pi that maximizes this function is one. But that’s not interesting because we also have another constraint: The sum of priors should be one. So we want to maximize this function given the constraint that the sum of pi_i is 1. constraint

7 Function to maximize Function to minimize constraint We have 3 such equations, one for each pi. If we add the equations together, we get:

8 EM algorithm: Summary We begin with a guess about the mixture parameters: The “E” step: Calculate the expected complete log-likelihood. In the mixture example, this reduces to just computing the posterior probabilities: The “M” step: maximize the expected complete log-likelihood with respect to the model parameters theta:

9 Selecting number of mixture components
A simple idea that helps with selection of number of mixture components is to form a cost that depends on both the log-likelihood of the data and the number of parameters used in the model. As the number of parameters increases, the log-likelihood increases. We want a cost that balances the change in the log-likelihood with the cost of having increasing parameters. A common technique is to find the m mixture components that minimize the “description length”. The effective number of parameters in the model Minimize the description length Maximum likelihood estimate of parameters for m mixture components Number of data points

10 Mixture of Experts The data set (x,y) is clearly non-linear, but we can break it up into two linear problems. We will try to switch from one “expert” to the another at around x=0. -1 1 2 3 4 5 -5 -2.5 2.5 7.5 10 Expert 2 Expert 1 + Moderator Conditional probability of choosing expert 2 Expert 1 Expert 2 -1 1 2 3 4 5 0.2 0.4 0.6 0.8

11 The Moderator (gating network)
We have observed a sequence of data points (x,y), and believe that it was generated by a process shown to the right: Note that y depends on both x (which we can measure) and z, which is hidden from us. For example, the dependence of y on x might be a simple linear model, but conditioned on z, where z is a multi-nomial. The Moderator (gating network) When there are only two experts, the moderator can be a logistic function: When there are multiple experts, the moderator can be a soft-max function:

12 Based on our hypothesis, we should have the following distribution of observed data:
A key quantity is the posterior probability of the latent variable z: Parameters of the moderator Parameters of the expert Posterior probability that the observed y “belongs” to the i-th expert. Note that the posterior probability for the i-th expert is updated based on how probable the observed data y was for this expert. In a way, the expression tells us that given the observed data y, how strongly should we assign it to expert i.

13 Output of the i-th expert
Output of the moderator Parameters of the i-th expert Output of the whole network Suppose there are two experts (m=2). For a given value of x, the two regressions each give us a Gaussian distribution at their mean. Therefore, for each value of x, we have a bimodal probability distribution for y. We have a mixture distribution in the output space y for each input value of x. The log-likelihood of the observed data.

14 The complete log-likelihood for the mixture of experts problem
The “completed” data: Complete log-likelihood Expected complete log-likelihood (assuming that someone had given us theta)

15 The E step for the mixture of experts problem
In the E step, we begin by assuming that we have theta. To compute the expected complete log-likelihood, all we need are the posterior probabilities. The posterior for each expert depends on the likelihood that the observed data y came from that expert.

16 The M step for the mixture of experts problem: the moderator
Exactly the same as the IRLS cost function. We find first and second derivatives and find a learning rule: The moderator learns from the posterior probability.

17 The M step for the mixture of experts problem: weights of the expert
A weighted least-squares problem The expert i learns from the observed data point y, weighted by the posterior probability that the error came from that expert.

18 The M step for the mixture of experts problem: variance of the expert

19 Parameter Estimation for Linear Dynamical Systems using EM
Objective: to find the parameters A, B, C, Q, and R of a linear dynamical system from a set of data that includes inputs u and outputs y. Need to find the expected complete log-likelihood

20

21

22 Posterior estimate of state and variance

23

24

25


Download ppt "EM and expected complete log-likelihood Mixture of Experts"

Similar presentations


Ads by Google