Biointelligence Laboratory, Seoul National University

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Part 2: Unsupervised Learning
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Visual Recognition Tutorial
Ch 13. Sequential Data (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Kim Jin-young Biointelligence Laboratory, Seoul.
Clustering.
Visual Recognition Tutorial
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
Gaussian Mixture Models and Expectation Maximization.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Summarized by Soo-Jin Kim
Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J. S. Kim Biointelligence Laboratory, Seoul National University.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 17 Gaussian Mixture Models and Expectation Maximization
CS Statistical Machine learning Lecture 24
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Biointelligence Laboratory, Seoul National University
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Biointelligence Laboratory, Seoul National University
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Ch 12. Continuous Latent Variables ~ 12
Probability Theory and Parameter Estimation I
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
Ch3: Model Building through Regression
CS 2750: Machine Learning Probability Review Density Estimation
Classification of unlabeled data:
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
CS 2750: Machine Learning Expectation Maximization
Latent Variables, Mixture Models and EM
Unsupervised-learning Methods for Image Clustering
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
SMEM Algorithm for Mixture Models
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CS 2750: Machine Learning Expectation Maximization
Gaussian Mixture Models And their training with the EM algorithm
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Summarized by Kim Jin-young
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Biointelligence Laboratory, Seoul National University
Presentation transcript:

Biointelligence Laboratory, Seoul National University Ch 9. Mixture Models and EM Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Ho-Sik Seok Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Contents 9.1 K-means Clustering 9.2 Mixtures of Gaussians 9.2.1 Maximum likelihood 9.2.2 EM for Gaussian mixtures 9.3 An Alternative View of EM 9.3.1 Gaussian mixtures revisited 9.3.2 Relation to K-means 9.3.3 Mixtures of Bernouli distributions 9.3.4 EM for Bayesian linear regression 9.4 The EM Algorithm in General (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.1 K-means Clustering (1/3) Problem of identifying groups, or clusters, of data points in a multidimensional space Partitioning the data set into some number K of clusters Cluster: a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster Goal: an assignment of data points to clusters such that the sum of the squares of the distances to each data point to its closest vector (the center of the cluster) is a minimum (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.1 K-means Clustering (2/3) Two-stage optimization In the 1st stage: minimizing J with respect to the rnk, keeping the μk fixed In the 2nd stage: minimizing J with respect to the μk, keeping rnk fixed The mean of all of the data points assigned to cluster k (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.1 K-means Clustering (3/3) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 9.2 Mixtures of Gaussians (1/3) A formulation of Gaussian mixtures in terms of discrete latent variables Gaussian mixture distribution can be written as a linear superposition of Gaussian An equivalent formulation of the Gaussian mixture involving an explicit latent variable Graphical representation of a mixture model A binary random variable z having a 1-of-k representation The marginal distribution of x is a Gaussian mixture of the form (*)  for every observed data point xn, there is a corresponding latent variable zn …(*) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2 Mixtures of Gaussians (2/3) γ(zk) can also be viewed as the responsibility that component k takes for explaining’ the observation x (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2 Mixtures of Gaussians (3/3) Generating random samples distributed according to the Gaussian mixture model Generating a value for z, which denoted as from the marginal distribution p(z) and then generate a value for x from the conditional distribution The three states of z, corresponding to the three components of the mixture, are depicted in red, green, blue The corresponding samples from the marginal distribution p(x) The same samples in which the colors represent the value of the responsibilities γ(znk) associated with data point Illustrating the responsibilities by evaluating the posterior probability for each component in the mixture distribution which this data set was generated (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2.1 Maximum likelihood (1/3) Graphical representation of a Gaussian mixture model for a set of N i.i.d. data points {xn}, with corresponding latent points {zn} The log of the likelihood function ….(*1) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2.1 Maximum likelihood (2/3) For simplicity, consider a Gaussian mixture whose components have covariance matrices given by Suppose that one of the components of the mixture model has its mean μj exactly equal to one of the data points so that μj = xn This data point will contribute a term in the likelihood function of the term Once there are at least two components in the mixture, one of the components can have a finite variance and therefore assign finite probability to all of the data points while the other component can shrink onto one specific data point and thereby contribute an ever increasing additive value to the log likelihood  over-fitting problem (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2.1 Maximum likelihood (3/3) Over-fitting problem Example of the over-fitting in a maximum likelihood approach This problem does not occur in the case of Bayesian approach In applying maximum likelihood to a Gaussian mixture models, there should be steps to avoid finding such pathological solutions and instead seek local minima of the likelihood function that are well behaved Identifiability problem A K-component mixture will have a total of K! equivalent solutions corresponding to the K! ways of assigning K sets of parameters to K components Difficulty of maximizing the log likelihood function (*1)  the difficulty arises from the presence of the summation over k that appears inside the logarithm in (*1) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2.2 EM for Gaussian mixtures (1/4) Assign some initial values for the means, covariances, and mixing coefficients Expectation or E step Using the current value for the parameters to evaluate the posterior probabilities or responsibilities Maximization or M step Using the result of II to re-estimate the means, covariances, and mixing coefficients It is common to run the K-means algorithm in order to find a suitable initial values The covariance matrices  the sample covariances of the clusters found by the K-means algorithm Mixing coefficients  the fractions of data points assigned to the respective clusters (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2.2 EM for Gaussian mixtures (2/4) Given a Gaussian mixture model, the goal is to maximize the likelihood function with respect to the parameters Initialize the means μk, covariance Σk and mixing coefficients πk E step M step Evaluate the log likelihood (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2.2 EM for Gaussian mixtures (3/4) …(*2) Setting the derivatives of (*2) with respect to the means of the Gaussian components to zero  Setting the derivatives of (*2) with respect to the covariance of the Gaussian components to zero  Responsibilityγ(znk ) A weighted mean of all of the points in the data set Each data point weighted by the corresponding posterior probability The denominator given by the effective # of points associated with the corresponding component (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2.2 EM for Gaussian mixtures (4/4) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.3 An Alternative View of EM General EM Maximizing the log likelihood function Given a joint distribution p(X, Z|Θ) over observed variables X and latent variables Z, governed by parameters Θ Choose an initial setting for the parameters Θold E step Evaluate p(Z|X,Θold ) M step Evaluate Θnew given by Θnew = argmaxΘQ(Θ ,Θold) Q(Θ ,Θold) = ΣZ p(Z|X, Θold)ln p(X, Z| Θ) It the covariance criterion is not satisfied, then let Θold  Θnew (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.3.1 Gaussian mixtures revisited (1/2) For the complete data set {X, Z}, the likelihood function (9.14) The logarithm acts directly on the Gaussian distribution  much simpler solution to the maximum likelihood problem zn is the 1 of K coding  the complete-data log likelihood function is a sum of K independent distributions, one for each mixture component  the maximization with respect to a mean or a covariance is exactly for a single Gaussian The mixing coefficients are equal to the fractions of data points assigned to the corresponding components (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.3.1 Gaussian mixtures revisited (2/2) Unknown latent variables  considering of the complete-data log likelihood with respect to the posterior distribution of the latent variables Posterior distribution The expected value of the indicator variable under this posterior distribution The expected value of the complete-data log likelihood function …(*3) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.4 The EM Algorithm in General (1/3) The EM algorithm is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.4 The EM Algorithm in General (2/3) Illustration of the decomposition Illustration of the E step - The q distribution is set equal to the posterior distribution for the current parameter values, causing the lower bound to move up to the same value as the log likelihood function Illustration of the M step - The distribution q(Z) is held fixed and the lower bound L(q, Θ) is maximized with respect to the parameter Θ to give a revised value (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.4 The EM Algorithm in General (3/3) E step The lower bound L(q, Θold) is maximized with respect to q(Z) while holding Θold fixed The largest value of L(q, Θold) will occur when the Kullback-Leibler divergence vanishes M step The distribution q(Z) is held fixed and the lower bound L(q, Θ) is maximized with respect to Θ This will cause the lower bound L to increase The quantity that is being maximized is the expectation of the complete-data log likelihood (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/