Classification of unlabeled data:

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Gaussian Mixture.
Image Modeling & Segmentation
Unsupervised Learning
Clustering Beyond K-means
Segmentation and Fitting Using Probabilistic Methods
Visual Recognition Tutorial
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Evaluating Hypotheses
Clustering.
Gaussian Mixture Example: Start After First Iteration.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Semi-Supervised Learning
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
EM and expected complete log-likelihood Mixture of Experts
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Mixture of Gaussians This is a probability distribution for random variables or N-D vectors such as… –intensity of an object in a gray scale image –color.
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Flat clustering approaches
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Applied statistics Usman Roshan.
Usman Roshan CS 675 Machine Learning
Probability Theory and Parameter Estimation I
Statistical Models for Automatic Speech Recognition
Clustering Evaluation The EM Algorithm
Latent Variables, Mixture Models and EM
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Mathematical Foundations of BME Reza Shadmehr
Seam Carving Project 1a due at midnight tonight.
Integration of sensory modalities
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Biointelligence Laboratory, Seoul National University
Mathematical Foundations of BME
Mathematical Foundations of BME Reza Shadmehr
Radial Basis Functions: Alternative to Back Propagation
EM Algorithm and its Applications
EM Algorithm 主講人:虞台文.
Clustering (2) & EM algorithm
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Classification of unlabeled data: 580.691 Learning Theory Reza Shadmehr Classification of unlabeled data: Mixture models, K-means algorithm, and Expectation Maximization (EM)

The problem of finding labels for unlabeled data So far we have solved “supervised” classification problems where a teacher told us the label of each example. In nature, items often do not come with labels. How can we learn labels without a teacher? Unlabeled data Labeled data -10 -8 -6 -4 -2 2 4 6 -10 -8 -6 -4 -2 2 4 6

Example: image segmentation Identify pixels that are white matter, gray matter, or outside of the brain. Outside the brain 0.2 0.4 0.6 0.8 1 250 500 750 1000 1250 1500 Gray matter White matter Pixel value (normalized)

If our data is not labeled, we can hypothesize that: Mixtures If our data is not labeled, we can hypothesize that: There are exactly m classes in the data: Each class y occurs with a specific frequency: Examples of class y are governed by a specific distribution: According to our hypothesis, each example x(i) must have been generated from a specific “mixture” distribution: We might hypothesize that the distributions are Gaussian: Parameters of the distributions Mixing proportions Normal distribution

Each data point is generated by a single Gaussian Mixture densities Suppose there are only three classes. The data generation process will be something like this: Each data point is generated by a single Gaussian Multinomial random variable that represents the class x belongs to If we had a classifier that could sort our data, we would get labels like this. Here, the idea is that if x is assigned to class i, we set the i-th component of vector z equal to one and keep the remaining components of z at zero.

Hidden variable Measured variable

Step 2: Re-estimate the center of each class. Mixture densities Our imagined classifier If our data were labeled, we could estimate the mean and variance of each Gaussian and compute the mixing proportion. Number of times a particular class was present Mixing proportion Mean of each class Variance of each class The K-means algorithm begins by making an initial guess about class center: Step 1: Assign each data point to the class that has the closest center. Step 2: Re-estimate the center of each class.

K-means algorithm on an easy problem (first 5 iterations - converged) classification True labels -10 -8 -6 -4 -2 2 4 6 -10 -8 -6 -4 -2 2 4 6 -10 -8 -6 -4 -2 2 4 6

K-means algorithm on a harder problem (first 10 iterations-converged) -12 -10 -8 -6 -4 -2 2 4 6 K-means classification True labels -12 -10 -8 -6 -4 -2 2 -12 -10 -8 -6 -4 -2 2

The cost function for the K-means algorithm Step 1: assuming that class centers are known, to minimize cost we should classify a point based on the closest center. Step 2: assuming that class memberships are known, the derivative of cost with respect to centers dictates that centers should be in the middle of the class. Number of points in class j

Types of classification data Data set is complete. K-means: Data set is incomplete, but we complete it by assigning ‘hard’ memberships. EM: data set is incomplete, but we complete it using posterior probabilities (a “soft” class membership).

EM algorithm Instead of using a “hard” class assignment, we could use a “soft” assignment that represents the posterior probability of class membership for each data point: This gives us a “soft” class membership. Using this classifier, we can now re-estimate the parameters of our distribution for each class: “soft” class membership “Number of times” a particular class was present (no longer an integer!)

EM algorithm: iteration 0 We assume that there are exactly m classes in the data. We begin with a guess about the initial setting of the parameters: We select m random samples from the dataset and assume that these are the mean of each class. We set the variance of each class to the sampled covariance of the whole data. We could set the mixing coefficients to be equal.

EM algorithm: iteration k+1 We have a guess about the mixture parameters: The “E” step: Complete the incomplete data with the posterior probabilities. For each data point, we assign a class membership by computing the posterior probability for all m classes: The “M” step: estimate new mixture parameters based on the new class assignments:

Example: classification with EM (easy problem) Initial guess 5th iteration 10th iteration 15th iteration -10 -8 -6 -4 -2 2 4 6 -10 -8 -6 -4 -2 2 4 6 -10 -8 -6 -4 -2 2 4 6 -10 -8 -6 -4 -2 2 4 6 To plot each distribution, I draw an ellipse centered at the mean with an area that covers the expected location of the 1st, 2nd, and 3rd quartile of the data. -10 -8 -6 -4 -2 2 4 6 -10 -8 -6 -4 -2 2 4 6 Colors indicate probability of belonging to one class or another. True labeled data Posterior probabilities

EM algorithm: log-likelihood increases with each step 10 20 30 40 -930 -920 -910 -900 -890 -880 -870

Example: classification with EM (hard problem) Initial guess 30th iteration 60th iteration -12 -10 -8 -6 -4 -2 2 -12 -10 -8 -6 -4 -2 2 -12 -10 -8 -6 -4 -2 2 -12 -10 -8 -6 -4 -2 2 Posterior probabilities True labeled data 20 40 60 80 -858 -856 -854 -852 -850 iteration

Some observations about EM: EM takes many more iterations to reach convergence than K-means. Each cycles requires significantly more computation. It is common to run K-means first to find a reasonable initialization for EM. Covariance matrices can be set to cluster covariance found for K-means. Mixing proportions can be set to the fraction of data assigned to each cluster in K-means.

On the “M” step, EM maximizes the log-likelihood of the data Under the assumption that pi is constant

EM and maximizing log-likelihood of the data This is the same as the rule that we had come up with “soft” classification. The rules are the same for estimates of class variance and mixture probabilities. Note that these equation for the parameters are not solutions because the posterior probabilities are themselves function of parameters. What we have done with the algorithm is to try and solve for the parameters iteratively: we made a guess about the parameters, computed the posteriors, and then found the parameters that maximized the likelihood of the data. We iterated until we found convergence.

EM algorithm: iteration k+1 We have a guess about the mixture parameters: The “E” step: Compute the posterior probabilities for all data points: The “M” step: under the assumption that the posterior probabilities are constant, estimate new parameter values to maximize log-likelihood: