Presentation is loading. Please wait.

Presentation is loading. Please wait.

Expectation- Maximization. News o’ the day First “3-d” picture of sun Anybody got red/green sunglasses?

Similar presentations


Presentation on theme: "Expectation- Maximization. News o’ the day First “3-d” picture of sun Anybody got red/green sunglasses?"— Presentation transcript:

1 Expectation- Maximization

2 News o’ the day First “3-d” picture of sun Anybody got red/green sunglasses?

3 Administrivia No noose is good noose

4 Where we’re at Last time: E^3 Finished up (our brief survey of) RL Today: Intro to unsupervised learning The expectation-maximization “algorithm”

5 What’s with this EM thing? Nobody expects...

6 Unsupervised learning EM is (one form of) unsupervised learning: Given: data Find: “structure” of that data Clusters -- what points “group together”? (we’ll do this one today) Taxonomies -- what’s descended from/related to what? Parses -- grammatical structure of a sentence Hidden variables -- “behind the scenes”

7 Example task

8 We can see the clusters easily...... but the computer can’t. How can we get the computer to identify the clusters? Need: algorithm that takes data and returns a label (cluster ID) for each data point

9 Parsing example What’s the grammatical structure of this sentence? He never claimed to be a god.

10 Parsing example He never claimed to be a god. What’s the grammatical structure of this sentence? NVVNDetAdv NP VP NP VP S

11 Parsing example He never claimed to be a god. What’s the grammatical structure of this sentence? NVVNDetAdv NP VP NP VP S Note: entirely hidden information! Need to infer (guess) it in an ~unsupervised way.

12 EM assumptions All learning algorithms require data assumptions EM: generative model Description of process that generates your data Assumes: hidden (latent) variables Probability model: assigns probability to data + hidden variables Often think: generate hidden var, then generate data based on that hidden var

13 Classic latent var model Data generator looks like this: Behind a curtain: I flip a weighted coin Heads: I roll a 6-sided die Tails: I roll a 4-sided die I show you: Outcome of die

14 Your mission Data you get is sequence of die outcomes 6, 3, 3, 1, 5, 4, 2, 1, 6, 3, 1, 5, 2,... Your task: figure out what the coin flip was for each of these numbers Hidden variable: c≡outcome of coin flip What makes this hard?

15 A more “practical” example Robot navigating in physical world Locations in world can be occupied or unoccupied Robot wants occupancy map (so it doesn’t bump into things) Sensors are imperfect (noise, object variation, etc.) Given: sensor data Infer: occupied/unoccupied for each location

16 Classic latent var model This process describes (generates) prob distribution over numbers Hidden state: outcome of coin flip

17 Classic latent var model This process describes (generates) prob distribution over numbers Hidden state: outcome of coin flip

18 Classic latent var model This process describes (generates) prob distribution over numbers Hidden state: outcome of coin flip Observed state: outcome of die given (conditioned on) coin flip result

19 Classic latent var model This process describes (generates) prob distribution over numbers Hidden state: outcome of coin flip Observed state: outcome of die given (conditioned on) coin flip result

20 Classic latent var model This process describes (generates) prob distribution over numbers Hidden state: outcome of coin flip Observed state: outcome of die given (conditioned on) coin flip result

21 Probability of observations Final probability of outcome ( x ) is mixture of probability for each possible coin result:

22 Probability of observations Final probability of outcome ( x ) is mixture of probability for each possible coin result:

23 Probability of observations Final probability of outcome ( x ) is mixture of probability for each possible coin result:

24 Your goal Given model, and data, x 1, x 2,..., x n Find Pr[c i |x i ] So we need the model Model given by parameters: Θ= 〈 p,θ heads,θ tails 〉 Where θ heads and θ tails are die outcome probabilities; p is prob of heads

25 Where’s the problem? To get Pr[c i |x i ], you need Pr[x i |c i ] To get Pr[x i |c i ], you need model parameters To get model parameters, you need Pr[c i |x i ] Oh oh...

26 EM to the rescue! Turns out that you can run this “chicken and egg process” in a loop and eventually get the right * answer Make an initial guess about coin assignments Repeat: Use guesses to get parameters (M step) Use parameters to update coin guesses (E step) Until converged

27 EM to the rescue! function [Prc,Theta]=EM(X) // initialization Prc=pick_random_values() // the EM loop repeat { // M step: pick maximum likelihood // parameters: // argmax_theta(Pr[x,c|theta]) Theta=get_params_from_c(Prc) // E step: use complete model to get data // likelihood: Pr[c|x]=1/z*Pr[x|c,theta] Prc=get_labels_from_params(X,Theta) } until(converged)

28 Wierd, but true This is counterintuitive, but it works Essentially, you’re improving guesses on each step M step “maximizes” parameters, Θ, given data E step finds “expectation” of hidden data, given Θ Both are driving toward max likelihood joint soln Guaranteed to converge Not guaranteed to find global optimum...

29 Very easy example Two Gaussian (“bell curve”) clusters Well separated in space Two dimensions

30

31

32

33

34

35

36

37

38

39

40

41 In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability:

42 In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability: One for each component

43 In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability: Weight (probability) of each component

44 In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability: Gaussian distribution for each component w/ mean vector μ i and covariance matrix Σ i

45 In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability: Normalizing term for Gaussian

46 In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability: Squared distance of data point x from mean μ i (with respect to Σ i )

47 In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability:

48 Hidden variables Introduce the “hidden variable”, c i (x) (or just c i for short) Denotes “amount by which data point x belongs to cluster i ” Sometimes called “cluster ownership”, “salience”, “relevance”, etc.

49 M step Need: parameters ( Θ ) given hidden variables ( c i ) and N data points, x 1, x 2,...,x N Q: what are the parameters of the model? (What do we need to learn?)

50 M step Need: parameters ( Θ ) given hidden variables ( c i ) and N data points, x 1, x 2,...,x N A: Θ= 〈 α i,μ i,Σ i 〉 i=1..k

51 M step Need: parameters ( Θ ) given hidden variables ( c i ) and N data points, x 1, x 2,...,x N A: Θ= 〈 α i,μ i,Σ i 〉 i=1..k

52 M step Need: parameters ( Θ ) given hidden variables ( c i ) and N data points, x 1, x 2,...,x N A: Θ= 〈 α i,μ i,Σ i 〉 i=1..k

53 M step Need: parameters ( Θ ) given hidden variables ( c i ) and N data points, x 1, x 2,...,x N A: Θ= 〈 α i,μ i,Σ i 〉 i=1..k

54 E step Need: probability of hidden variable ( c i ) given fixed parameters ( Θ ) and observed data ( x 1,...,x N )

55 Another example k=3 Gaussian clusters Different means, covariances Well separated

56

57

58

59

60

61 Restart Problem: EM has found a “minimum energy” solution It’s only “locally” optimal B/c of poor starting choice, it ended up in wrong local optimum -- not global optimum Default answer: pick a new random start and re-run

62

63

64

65

66 Final example More Gaussians. How many clusters here?

67

68

69

70

71

72 Note... Doesn’t always work out this well in practice Sometimes the machine is smarter than humans Usually, if it’s hard for us, it’s hard for the machine too... First ~7-10 times I ran this one, it lost one cluster altogether ( α 3 →0.0001 )

73 Unresolved issues Notice: different cluster IDs (colors) end up on different blobs of points in each run Answer is “unique only up to permutation” I can swap around cluster IDs without changing solution Can’t tell what “right” cluster assignment is

74 Unresolved issues “Order” of model I.e., what k should you use? Hard to know, in general Can just try a bunch and find one that “works best” Problem: answer tends to get monotonically better w/ increasing k Best answer to date: Chinese restaurant process


Download ppt "Expectation- Maximization. News o’ the day First “3-d” picture of sun Anybody got red/green sunglasses?"

Similar presentations


Ads by Google