LING 696B: Mixture model and linear dimension reduction

Slides:



Advertisements
Similar presentations
Image Modeling & Segmentation
Advertisements

Clustering Beyond K-means
Computer vision: models, learning and inference Chapter 8 Regression.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
K Means Clustering , Nearest Cluster and Gaussian Mixture
Supervised Learning Recap
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Segmentation CSE P 576 Larry Zitnick Many slides courtesy of Steve Seitz.
Lecture 5: Learning models using EM
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering.
Machine Learning CMPT 726 Simon Fraser University
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Unsupervised Learning
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Clustering Unsupervised learning Generating “classes”
Image Segmentation Image segmentation is the operation of partitioning an image into a collection of connected sets of pixels. 1. into regions, which usually.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Isolated-Word Speech Recognition Using Hidden Markov Models
Gaussian Mixture Model and the EM algorithm in Speech Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
EM and expected complete log-likelihood Mixture of Experts
7-Speech Recognition Speech Recognition Concepts
1 LING 696B: PCA and other linear projection methods.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 LING 696B: Midterm review: parametric and non-parametric inductive inference.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
HMM - Part 2 The EM algorithm Continuous density HMM.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Jakob Verbeek December 11, 2009
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Flat clustering approaches
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Other Models for Time Series. The Hidden Markov Model (HMM)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
Hidden Markov Models Part 2: Algorithms
Probabilistic Models with Latent Variables
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 15: REESTIMATION, EM AND MIXTURES
Parametric Methods Berlin Chen, 2005 References:
EM Algorithm and its Applications
Presentation transcript:

LING 696B: Mixture model and linear dimension reduction

Statistical estimation Basic setup: The world: distributions p(x; ),  -- parameters “all models may be wrong, but some are useful” Given parameter , p(x; ) tells us how to calculate the probability of x (also referred to as the “likelihood” p(x|) ) Observations: X = {x1, x2, …, xN} generated from some p(x; ). N is the number of observations Model-fitting: based on some examples X, make guesses (learning, inference) about 

Statistical estimation Example: Assuming people’s height follows normal distributions (mean, var) p(x; ) = the probability density function of normal distribution Observation: measurements of people’s height Goal: estimate parameters of the normal distribution

Maximum likelihood estimate (MLE) Likelihood function: examples xi are independent of one another, so Among all the possible values of , choose the so that L() is the biggest Consistency: L()   H ! 

H matters a lot! Example: curve fitting with polynomials

Clustering Need to divide x1, x2, …, xN into clusters, without a priori knowledge of where clusters are An unsupervised learning problem: fitting a mixture model to x1, x2, …, xN Example: height of male and female follow two distributions, but don’t know gender from x1, x2, …, xN

The K-means algorithm Start with a random assignment, calculate the means

The K-means algorithm Re-assign members to the closest cluster according to the means

The K-means algorithm Update the means based on the new assignments, and iterate

Why does K-means work? In the beginning, the centers are poorly chosen, so the clusters overlap a lot But if centers are moving away from each other, then clusters tend to separate better Vice versa, if clusters are well-separated, then the centers will stay away from each other Intuitively, these two steps “help each other”

Interpreting K-means as statistical estimation Equivalent to fitting a mixture of Gaussians with: Spherical covariance Uniform prior (weights on each Gaussian) Problems: Ambiguous data should have gradient membership Shape of the clusters may not be spherical Size of the cluster should play a role

Multivariate Gaussian 1-D: N(, 2) N-D: N(, ), ~NX1 vector, ~NXN matrix with (i,j) = ij ~ correlation Probability calculation: P(x; ,) = C ||-N/2 exp{-(x-)T -1 (x-)} Intuitive meaning of -1: how to calculate the distance from x to  transpose inverse

Multivariate Gaussian: log likelihood and distance Spherical covariance matrix -1 Diagonal covariance matrix -1 Full covariance matrix -1

Learning mixture of Gaussian: EM algorithm Expectation: putting “soft” labels on data -- a pair (, 1-) (0.05, 0.95) (0.8, 0.2) (0.5, 0.5)

Learning mixture of Gaussian: EM algorithm Maximization: doing Maximum-Likelihood with weighted data Notice everyone is wearing a hat!

EM v.s. K-means Same: EM better captures the intuition: Iterative optimization, provably converge (see demo) EM better captures the intuition: Ambiguous data are assigned gradient membership Clusters can be arbitrary shaped pancakes Size of the cluster is a parameter Allows for flexible control based on prior knowledge (see demo)

EM is everywhere Our problem: the labels are important, yet not observable – “hidden variables” This situation is common for complex models, and Maximum likelihood --> EM Bayesian Networks Hidden Markov models Probabilistic Context Free Grammars Linear Dynamic Systems Of course, everything we said here still needs to be made more precise if you have to write an algorithm. “guess” has a particular meaning in terms of calculating probabilities, for example. At any rate, this kind of intuition is what is behind the algorithms. The strategy that was taken in the previous couple of slides has an official name called “EM” algorithm. It has a number of incarnations in computational linguistics. Some of them are quite sophisticated, when you dealing with problems like parsing. But they all follow the same strategy, which is to some extent the best you can do in face of missing information

Beyond Maximum likelihood? Statistical parsing Interesting remark from Mark Johnson: Intialize a PCFG with treebank counts Train the PCFG on treebank with EM A large a mount of NLP research try to dump the first, and improve the second Measure of success Log likelihood

What’s wrong with this? Mark Johnson’s idea: Wrong data: human don’t just learn from strings Wrong model: human syntax isn’t context-free Wrong way of calculating likelihood: p(sentence | PCFG) isn’t informative (Maybe) wrong measure of success?

End of excursion: Mixture of many things Any generative model can be combined with a mixture model to deal with categorical data Examples: Mixture of Gaussians Mixture of HMMs Mixture of Factor Analyzers Mixture of Expert networks It all depends on what you are modeling

Applying to the speech domain Speech signals have high dimensions Using front-end acoustic modeling from speech recognition: Mel-Frequency Cepstral Coefficients (MFCC) Speech sounds are dynamic Dynamic acoustic modeling: MFCC-delta Mixture components are Hidden Markov Models (HMM) The second challenge is that speech segments are dynamic – they include multiple different events and can be long or short.

Clustering speech with K-means Phones from TIMIT

Clustering speech with K-means Diphones Words

What’s wrong here Longer sound sequences are more distinguishable for people Yet doing K-means on static feature vectors misses the change over time Mixture components must be able to capture dynamic data Solution: mixture of HMMs

Mixture of HMMs HMM HMM Mixture Learning: EM for HMM + EM for mixture burst silence transition Here each state is supposed to characterize a different portion of speech, and the loops allow each portion to be longer or shorter Learning: EM for HMM + EM for mixture

Mixture of HMMs Model-based clustering Front-end (MFCC+delta) Algorithm: initial guess by K-means, then EM Gaussian mixture for single frames HMM mixture for whole sequences

Mixture of HMM v.s. K-means Phone clustering: 7 phones from 22 speakers They don't fall into 5 manner classes as we wished. But at least you see the improvement *1 – 5: cluster index

Mixture of HMM v.s. K-means Diphone clustering: 6 diphones from 300+ speakers If phones are somewhat static, there should be more obvious improvements on more dynamic units. Let's look at diphones

Mixture of HMM v.s. K-means Word clustering: 3 words from 300+ speakers Distinguishing dynamic sounds are not so difficult for people, if it's hard for the computer then the method must be wrong.

Growing the model Guess 6 at once is hard, but 2 is easy; Hill climbing strategy: starting with 2, then 3, 4, ... Implementation: split the cluster with the maximum gain in likelihood; Intuition: discriminate within the biggest pile.

Learning categories and features with mixture model Procedure: apply mixture model and EM algorithm, inductively find clusters Each split is followed by a retraining step using all data Data Here we use a kind of prefix coding to label the clusters, which shows you where each cluster comes from. It is a kind of mnemonic, because the classes arise from the data, and we do not start with assumptions about which sounds should go under which classes. This mnemonic also helps us keep track of things more easily. For example, 112 shows that it comes from 1, then 11, … 2 1 11 12 21 22

% classified as Cluster 1 % classified as Cluster 2 IPA TIMIT All data T D 1 obstruent 2 sonorant  R ? R) Maybe you can say: what if we draw a line, and cut through here, then we will get the obstruent and sonorants. Actually, what I am trying to demonstrate is, if you consider how to define obstruents and sonorants from a purely acoustic point of view, no such straight line exists! Sonority is a matter of the amount of airflow coming out of you mouth, and you cannot just simply say 1 or 0. What you would expect, instead, is actually a more gradient picture like this one. j l % classified as Cluster 2

% classified as Cluster 12 % classifed as Cluster 11 All data  S 1 1 2 tS T d D 11 fricative 12 Here you might be surprised that affricates do not stand up as a separate category. They kind of stand in between fricatives and stops. But if you look at child speech errors, that seems to be the kind of mistakes that they make – taking affricates as stops or fricatives % classified as Cluster 12

% classified as Cluster 21 % classified as Cluster 22 All data u  r 1 1 2 oU  R oI  A AU 11 12 21 back sonorant 22 aI U  u In this picture there is another surprise: the liquids – l, w and r go together with the back vowels, and you don’t see a special category of approximants. This is again determined by their spectral property: acoustically, they are much more similar to back vowels than to the other liquid y, which sounds much more like a front high vowel. We are kind of “redefining” phonetic classes only by their acoustic properties, and not worrying about how they are produced.  I i eI i j % classified as Cluster 22

% classified as Cluster 122 % classified Cluster 121 All data 1 1 2 11 12 21 22 121 oral stop 122 nasal stop % classified as Cluster 122

% classified as Cluster 221 % classified as Cluster 222 All data j i 1 1 2  I 11 12 21 22 eI front low sonorant high nasal oral stop fricative back  121 122 221 222  % classified as Cluster 222

Summary: learning features Discovered features: distinctions between natural classes based on spectral properties All data 1 [+sonorant] [- sonorant] [+fricative] [-fricative] [+back] [-back] [-nasal] [+nasal] [+high] [-high] If features arise from statistical learning, I think it is natural to accept the nature of features as being gradient. For individual sounds, the feature values are gradient rather than binary (Ladefoged, 01)

Evaluation: phone classification How do the “soft” classes fit into “hard” ones? Training set This is just intended to show that our “soft” classes at least make sense. But it's also worth reflecting what we mean by “errors”. Test set Are “errors” really errors?

Level 2: Learning segments + phonotactics Segmentation is a kind of hidden structure Iterative strategy works here too Optimization -- the augmented model: p(words | units, phonotactics, segmentation) Units  argmax p({wi} | U, P, {si}) Clustering = argmax p(segments | units) -- Level 1 Phonotactics  argmax p({wi} | U, P, {si}) Estimating transitions of Markov chain Segmentation  argmax p({wi} | U, P, {si}) Viterbi decoding We wanted to do some kind of counting. But we don’t know what to count. Segmentation is an extremely useful things, and that’s what we do when we read spectrograms all the time. The second aspect of the model is how infants learn segments. As we discussed a couple of slides ago, there are two things that need to be done: one is to find where the segments are, and the other is finding what the segments are. How can we learn phonetic categories from this sort of data? A critical thing here is “where are the phonetic categories?” So we start with a guess, and use that to update the knowledge. Then we come back and improve the guess, and so on.

Iterative learning as coordinate-wise ascent Level curves of likelihood score Units phonotactics Initial value comes from Level-1 learning segmentation Each step increases likelihood score and eventually reaches a local maximum

Level 3: Lexicon can be mixtures too Re-clustering of words using the mixture-based lexical model Initial values (mixture components, weights)  bottom-up learning (Stage 2) Iterating steps: Classify each word as the best exemplar of the given lexical item (also infer segmentation) Update lexical weights + units + phonotactics

Big question: How to choose K? Basic problem: Nested hypothesis spaces: Hk-1  Hk  Hk+1  … As K goes up, likelihood always goes up. Recall the polynomial curve fitting Mixture model too (see demo)

Big question: How to choose K? Idea #1: don’t just look at the likelihood, look at the combination of likelihood and something else Bayesian Information Criterion: -2 log L() + (log N)*d Minimal Description Length: log L() + description() Akaike Information Criterion: -2 log L() + 2 d/N In practice, often need magical “weights” in front of the something else

Big question: How to choose K? Idea #2: use one set of data for learning, one for testing generalization Cross-validation: run EM until the likelihood starts to hurt in the test set (see demo) What if you have a bad test set: Jack-knife procedure Cutting data into 10 parts, and do 10 training and tests

Big question: How to choose K? Idea #3: treat K as “hyper” parameter, and do Bayesian learning on K More flexible: K can grow up and down depending on number of data Allow K to grow to infinity: Dirichlet / Chinese restaurant process mixture Need “hyper-hyper” parameters to control how likely K grows Computationally also intensive

Big question: How to choose K? There is really no elegant universal solution One view: statistical learning looks within Hk, but does not come up with Hk itself How do people choose K? (also see later reading)

Dimension reduction Why dimension reduction? Example: estimate a continuous probability distribution by counting histograms on samples 20 bins 30 bins 10 bins

Dimension reduction Now think about 2D, 3D … How many bins do you need? Estimate density of distribution with Parzen window: How big (r) does the window needs to grow? Data in the window Window size

Curse of dimensionality Discrete distributions: Phonetics experiment: M speakers X N sentences X P stresses X Q segments … … Decision rules: (K) Nearest-neighbor How big a K is safe? How long do you have to wait until you are really sure they are your nearest neighbors?

One obvious solution Assume we know something about the distribution Translates to a parametric approach Example: counting histograms for 10-D data needs lots of bins, but knowing it’s a pancake allows us to fit a Gaussian d10 parameters v.s. how many?

Linear dimension reduction Principle Components Analysis Multidimensional Scaling Factor Analysis Independent Component Analysis As we will see, we still need to assume we know something…

Principle Component Analysis Many names (eigen modes, KL transform, etc.) and relatives The key is to understand how to make a pancake Centering, rotating and smashing Step 1: moving the dough to the center X <-- X - 

Principle Component Analysis Step 2: finding a direction of projection that has the maximal “stretch” Linear projection of X onto vector w: Projw(X) = XNXd * wdX1 (X centered) Now measure the stretch This is sample variance = Var(X*w) x w

Principle Component Analysis Step 3: formulate this as a constrained optimization problem Objective of optimization: Var(X*w) Need constraint on w: (otherwise can explode), only consider the direction So formally: argmax||w||=1 Var(X*w)

Principle Component Analysis Some algebra (homework): Var(x) = E[(x - E[x])2 = E[x2] - (E[x])2 Apply to matrices (homework) Var(X*w) = wTXT * X w = wTCov(X) w (why) Cov(X) is a dXd matrix (homework) Symmetric (easy) For any y, yTCov(X) y >= 0 (tricky)

Principle Component Analysis Going back to the optimization problem: = argmax||w||=1 Var(X*w) = argmax||w||=1 wTCov(X) w The solution is an eigenvector of Cov(X) w1 The first Principle Component!

More principle components We keep looking for w2 in all the directions perpendicular to w1 Formally: argmax||w2||=1,w2w1 wTCov(X) w This turns out to be another eigenvector corresponding to the 2nd largest eigenvalue w2 New coordinates!

Rotation Can keep going until we pick up all d eigenvectors, perpendicular to each other Putting these eigenvectors together, we have a big matrix W=(w1,w2,…,wd) W is called an orthogonal matrix This corresponds to a rotation of the pancake This pancake has no correlation between dimensions