LING 696B: Mixture model and linear dimension reduction

Name: LING 696B: Mixture model and linear dimension reduction
Uploaded: 2017-10-19T18:55:57+00:00
Duration: PTM24S23
Channel: Grace Lewis
Description: LING 696B: Mixture model and linear dimension reduction

LING 696B: Mixture model and linear dimension reduction

Statistical estimation
Basic setup: The world: distributions p(x; ),  -- parameters “all models may be wrong, but some are useful” Given parameter , p(x; ) tells us how to calculate the probability of x (also referred to as the “likelihood” p(x|) ) Observations: X = {x1, x2, …, xN} generated from some p(x; ). N is the number of observations Model-fitting: based on some examples X, make guesses (learning, inference) about 

Statistical estimation
Example: Assuming people’s height follows normal distributions (mean, var) p(x; ) = the probability density function of normal distribution Observation: measurements of people’s height Goal: estimate parameters of the normal distribution

Maximum likelihood estimate (MLE)
Likelihood function: examples xi are independent of one another, so Among all the possible values of , choose the so that L() is the biggest Consistency: L()   H ! 

H matters a lot! Example: curve fitting with polynomials

Clustering Need to divide x1, x2, …, xN into clusters, without a priori knowledge of where clusters are An unsupervised learning problem: fitting a mixture model to x1, x2, …, xN Example: height of male and female follow two distributions, but don’t know gender from x1, x2, …, xN

The K-means algorithm Start with a random assignment, calculate the means

The K-means algorithm Re-assign members to the closest cluster according to the means

The K-means algorithm Update the means based on the new assignments, and iterate

Why does K-means work? In the beginning, the centers are poorly chosen, so the clusters overlap a lot But if centers are moving away from each other, then clusters tend to separate better Vice versa, if clusters are well-separated, then the centers will stay away from each other Intuitively, these two steps “help each other”

Interpreting K-means as statistical estimation
Equivalent to fitting a mixture of Gaussians with: Spherical covariance Uniform prior (weights on each Gaussian) Problems: Ambiguous data should have gradient membership Shape of the clusters may not be spherical Size of the cluster should play a role

Multivariate Gaussian
1-D: N(, 2) N-D: N(, ), ~NX1 vector, ~NXN matrix with (i,j) = ij ~ correlation Probability calculation: P(x; ,) = C ||-N/2 exp{-(x-)T -1 (x-)} Intuitive meaning of -1: how to calculate the distance from x to  transpose inverse

Multivariate Gaussian: log likelihood and distance
Spherical covariance matrix -1 Diagonal covariance matrix -1 Full covariance matrix -1

Learning mixture of Gaussian: EM algorithm
Expectation: putting “soft” labels on data -- a pair (, 1-) (0.05, 0.95) (0.8, 0.2) (0.5, 0.5)

Learning mixture of Gaussian: EM algorithm
Maximization: doing Maximum-Likelihood with weighted data Notice everyone is wearing a hat!

EM v.s. K-means Same: EM better captures the intuition:
Iterative optimization, provably converge (see demo) EM better captures the intuition: Ambiguous data are assigned gradient membership Clusters can be arbitrary shaped pancakes Size of the cluster is a parameter Allows for flexible control based on prior knowledge (see demo)

EM is everywhere Our problem: the labels are important, yet not observable – “hidden variables” This situation is common for complex models, and Maximum likelihood --> EM Bayesian Networks Hidden Markov models Probabilistic Context Free Grammars Linear Dynamic Systems Of course, everything we said here still needs to be made more precise if you have to write an algorithm. “guess” has a particular meaning in terms of calculating probabilities, for example. At any rate, this kind of intuition is what is behind the algorithms. The strategy that was taken in the previous couple of slides has an official name called “EM” algorithm. It has a number of incarnations in computational linguistics. Some of them are quite sophisticated, when you dealing with problems like parsing. But they all follow the same strategy, which is to some extent the best you can do in face of missing information

Beyond Maximum likelihood? Statistical parsing
Interesting remark from Mark Johnson: Intialize a PCFG with treebank counts Train the PCFG on treebank with EM A large a mount of NLP research try to dump the first, and improve the second Measure of success Log likelihood

What’s wrong with this? Mark Johnson’s idea:
Wrong data: human don’t just learn from strings Wrong model: human syntax isn’t context-free Wrong way of calculating likelihood: p(sentence | PCFG) isn’t informative (Maybe) wrong measure of success?

End of excursion: Mixture of many things
Any generative model can be combined with a mixture model to deal with categorical data Examples: Mixture of Gaussians Mixture of HMMs Mixture of Factor Analyzers Mixture of Expert networks It all depends on what you are modeling

Applying to the speech domain
Speech signals have high dimensions Using front-end acoustic modeling from speech recognition: Mel-Frequency Cepstral Coefficients (MFCC) Speech sounds are dynamic Dynamic acoustic modeling: MFCC-delta Mixture components are Hidden Markov Models (HMM) The second challenge is that speech segments are dynamic – they include multiple different events and can be long or short.

Clustering speech with K-means
Phones from TIMIT

Clustering speech with K-means
Diphones Words

What’s wrong here Longer sound sequences are more distinguishable for people Yet doing K-means on static feature vectors misses the change over time Mixture components must be able to capture dynamic data Solution: mixture of HMMs

Mixture of HMMs HMM HMM Mixture Learning: EM for HMM + EM for mixture
burst silence transition Here each state is supposed to characterize a different portion of speech, and the loops allow each portion to be longer or shorter Learning: EM for HMM + EM for mixture

Mixture of HMMs Model-based clustering Front-end (MFCC+delta)
Algorithm: initial guess by K-means, then EM Gaussian mixture for single frames HMM mixture for whole sequences

Mixture of HMM v.s. K-means
Phone clustering: 7 phones from 22 speakers They don't fall into 5 manner classes as we wished. But at least you see the improvement *1 – 5: cluster index

Diphone clustering: 6 diphones from 300+ speakers If phones are somewhat static, there should be more obvious improvements on more dynamic units. Let's look at diphones

Word clustering: 3 words from 300+ speakers Distinguishing dynamic sounds are not so difficult for people, if it's hard for the computer then the method must be wrong.

Growing the model Guess 6 at once is hard, but 2 is easy;
Hill climbing strategy: starting with 2, then 3, 4, ... Implementation: split the cluster with the maximum gain in likelihood; Intuition: discriminate within the biggest pile.

Learning categories and features with mixture model
Procedure: apply mixture model and EM algorithm, inductively find clusters Each split is followed by a retraining step using all data Data Here we use a kind of prefix coding to label the clusters, which shows you where each cluster comes from. It is a kind of mnemonic, because the classes arise from the data, and we do not start with assumptions about which sounds should go under which classes. This mnemonic also helps us keep track of things more easily. For example, 112 shows that it comes from 1, then 11, … 2 1 11 12 21 22

% classified as Cluster 1 % classified as Cluster 2
IPA TIMIT All data T D 1 obstruent 2 sonorant  R ? R) Maybe you can say: what if we draw a line, and cut through here, then we will get the obstruent and sonorants. Actually, what I am trying to demonstrate is, if you consider how to define obstruents and sonorants from a purely acoustic point of view, no such straight line exists! Sonority is a matter of the amount of airflow coming out of you mouth, and you cannot just simply say 1 or 0. What you would expect, instead, is actually a more gradient picture like this one. j l % classified as Cluster 2

% classified as Cluster 12
% classifed as Cluster 11 All data  S 1 1 2 tS T d D 11 fricative 12 Here you might be surprised that affricates do not stand up as a separate category. They kind of stand in between fricatives and stops. But if you look at child speech errors, that seems to be the kind of mistakes that they make – taking affricates as stops or fricatives % classified as Cluster 12

All data u  r 1 1 2 oU  R oI  A AU 11 12 21 back sonorant 22 aI U  u In this picture there is another surprise: the liquids – l, w and r go together with the back vowels, and you don’t see a special category of approximants. This is again determined by their spectral property: acoustically, they are much more similar to back vowels than to the other liquid y, which sounds much more like a front high vowel. We are kind of “redefining” phonetic classes only by their acoustic properties, and not worrying about how they are produced.  I i eI i j % classified as Cluster 22

% classified as Cluster 122
% classified Cluster 121 All data 1 1 2 11 12 21 22 121 oral stop 122 nasal stop % classified as Cluster 122

All data j i 1 1 2  I 11 12 21 22 eI front low sonorant high nasal oral stop fricative back  121 122 221 222  % classified as Cluster 222

Summary: learning features
Discovered features: distinctions between natural classes based on spectral properties All data 1 [+sonorant] [- sonorant] [+fricative] [-fricative] [+back] [-back] [-nasal] [+nasal] [+high] [-high] If features arise from statistical learning, I think it is natural to accept the nature of features as being gradient. For individual sounds, the feature values are gradient rather than binary (Ladefoged, 01)

Evaluation: phone classification
How do the “soft” classes fit into “hard” ones? Training set This is just intended to show that our “soft” classes at least make sense. But it's also worth reflecting what we mean by “errors”. Test set Are “errors” really errors?

Level 2: Learning segments + phonotactics
Segmentation is a kind of hidden structure Iterative strategy works here too Optimization -- the augmented model: p(words | units, phonotactics, segmentation) Units  argmax p({wi} | U, P, {si}) Clustering = argmax p(segments | units) -- Level 1 Phonotactics  argmax p({wi} | U, P, {si}) Estimating transitions of Markov chain Segmentation  argmax p({wi} | U, P, {si}) Viterbi decoding We wanted to do some kind of counting. But we don’t know what to count. Segmentation is an extremely useful things, and that’s what we do when we read spectrograms all the time. The second aspect of the model is how infants learn segments. As we discussed a couple of slides ago, there are two things that need to be done: one is to find where the segments are, and the other is finding what the segments are. How can we learn phonetic categories from this sort of data? A critical thing here is “where are the phonetic categories?” So we start with a guess, and use that to update the knowledge. Then we come back and improve the guess, and so on.

Iterative learning as coordinate-wise ascent
Level curves of likelihood score Units phonotactics Initial value comes from Level-1 learning segmentation Each step increases likelihood score and eventually reaches a local maximum

Level 3: Lexicon can be mixtures too
Re-clustering of words using the mixture-based lexical model Initial values (mixture components, weights)  bottom-up learning (Stage 2) Iterating steps: Classify each word as the best exemplar of the given lexical item (also infer segmentation) Update lexical weights + units + phonotactics

Big question: How to choose K?
Basic problem: Nested hypothesis spaces: Hk-1  Hk  Hk+1  … As K goes up, likelihood always goes up. Recall the polynomial curve fitting Mixture model too (see demo)

Idea #1: don’t just look at the likelihood, look at the combination of likelihood and something else Bayesian Information Criterion: -2 log L() + (log N)*d Minimal Description Length: log L() + description() Akaike Information Criterion: -2 log L() + 2 d/N In practice, often need magical “weights” in front of the something else

Idea #2: use one set of data for learning, one for testing generalization Cross-validation: run EM until the likelihood starts to hurt in the test set (see demo) What if you have a bad test set: Jack-knife procedure Cutting data into 10 parts, and do 10 training and tests

Idea #3: treat K as “hyper” parameter, and do Bayesian learning on K More flexible: K can grow up and down depending on number of data Allow K to grow to infinity: Dirichlet / Chinese restaurant process mixture Need “hyper-hyper” parameters to control how likely K grows Computationally also intensive

There is really no elegant universal solution One view: statistical learning looks within Hk, but does not come up with Hk itself How do people choose K? (also see later reading)

Dimension reduction Why dimension reduction?
Example: estimate a continuous probability distribution by counting histograms on samples 20 bins 30 bins 10 bins

Dimension reduction Now think about 2D, 3D …
How many bins do you need? Estimate density of distribution with Parzen window: How big (r) does the window needs to grow? Data in the window Window size

Curse of dimensionality
Discrete distributions: Phonetics experiment: M speakers X N sentences X P stresses X Q segments … … Decision rules: (K) Nearest-neighbor How big a K is safe? How long do you have to wait until you are really sure they are your nearest neighbors?

One obvious solution Assume we know something about the distribution
Translates to a parametric approach Example: counting histograms for 10-D data needs lots of bins, but knowing it’s a pancake allows us to fit a Gaussian d10 parameters v.s. how many?

Linear dimension reduction
Principle Components Analysis Multidimensional Scaling Factor Analysis Independent Component Analysis As we will see, we still need to assume we know something…

Principle Component Analysis
Many names (eigen modes, KL transform, etc.) and relatives The key is to understand how to make a pancake Centering, rotating and smashing Step 1: moving the dough to the center X <-- X - 

Step 2: finding a direction of projection that has the maximal “stretch” Linear projection of X onto vector w: Projw(X) = XNXd * wdX1 (X centered) Now measure the stretch This is sample variance = Var(X*w) x w

Step 3: formulate this as a constrained optimization problem Objective of optimization: Var(X*w) Need constraint on w: (otherwise can explode), only consider the direction So formally: argmax||w||=1 Var(X*w)

Some algebra (homework): Var(x) = E[(x - E[x]) = E[x2] - (E[x])2 Apply to matrices (homework) Var(X*w) = wTXT * X w = wTCov(X) w (why) Cov(X) is a dXd matrix (homework) Symmetric (easy) For any y, yTCov(X) y >= 0 (tricky)

Going back to the optimization problem: = argmax||w||=1 Var(X*w) = argmax||w||=1 wTCov(X) w The solution is an eigenvector of Cov(X) w1 The first Principle Component!

More principle components
We keep looking for w2 in all the directions perpendicular to w1 Formally: argmax||w2||=1,w2w1 wTCov(X) w This turns out to be another eigenvector corresponding to the 2nd largest eigenvalue w2 New coordinates!

Rotation Can keep going until we pick up all d eigenvectors, perpendicular to each other Putting these eigenvectors together, we have a big matrix W=(w1,w2,…,wd) W is called an orthogonal matrix This corresponds to a rotation of the pancake This pancake has no correlation between dimensions

LING 696B: Mixture model and linear dimension reduction

Similar presentations

Presentation on theme: "LING 696B: Mixture model and linear dimension reduction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LING 696B: Mixture model and linear dimension reduction

Similar presentations

Presentation on theme: "LING 696B: Mixture model and linear dimension reduction"— Presentation transcript:

Similar presentations

About project

Feedback