Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Slides:



Advertisements
Similar presentations
Mixture Models and the EM Algorithm
Advertisements

Clustering Beyond K-means
Segmentation and Fitting Using Probabilistic Methods
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Chapter 5: Partially-Supervised Learning
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Clustering.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Visual Recognition Tutorial
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Bing LiuCS Department, UIC1 Chapter 8: Semi-Supervised Learning Also called “partially supervised learning”
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Gaussian Mixture Model and the EM algorithm in Speech Recognition
EM and expected complete log-likelihood Mixture of Experts
Text Classification, Active/Interactive learning.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Text Clustering.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Flat clustering approaches
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Chapter 3: Maximum-Likelihood Parameter Estimation
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
10701 / Machine Learning.
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Topic Models in Text Processing
Learning From Observed Data
Biointelligence Laboratory, Seoul National University
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
EM Algorithm and its Applications
Clustering (2) & EM algorithm
Presentation transcript:

Unsupervised Learning Part 2

Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization method Review for quiz Tuesday

Clustering via finite Gaussian mixture models A “soft”, generative version of K-means clustering Given: Training set S = {x 1,..., x N }, and K. Assumption: Data is generated by sampling from a “mixture” (linear combination) of K Gaussians.

Mixture of three Gaussians (one dimensional data)

Clustering via finite Gaussian mixture models Clusters: Each cluster will correspond to a single Gaussian. Each point x  S will have some probability distribution over the K clusters. Goal: Given the data, find the Gaussians! (And the coefficients of their sum) I.e., Find parameters {θ K } of these K Gaussians such P(S | {θ K }) is maximized. This is called a Maximum Likelihood method. – S is the data – {θ K } is the “hypothesis” or “model” – P(S | {θ K }) is the “likelihood”.

General form of one-dimensional (univariate) Gaussian Mixture Model

Assume training set S has N values generated by a univariant Gaussian distribution: Likelihood function: probability of data given model (or parameters of model) Simple Case: Maximum Likelihood for Single Univariate Gaussian

How to estimate parameters  and  from S? Maximize the likelihood function with respect to  and . Problem: Individual values of are typically very small. (Can underflow numerical precision of computer.)

Solution: Work with log likelihood instead of likelihood.

Now, find maximum likelihood parameters,  and σ 2. First, maximize with respect to . (ML = “Maximum Likelihood”) Result:

Now, maximize with respect to σ 2.

The resulting distribution is called a “generative model” because it can generate new data values. We say that parameterizes the model. In general, θ is used to denote the (learnable) parameters of a probabilistic model

Finite Gaussian Mixture Models Back to finite Gaussian mixture models Assumption: Data is generated from mixture of K Gaussians. Each cluster will correspond to a single Gaussian. Each point x will have some probability distribution over the K clusters.

Multivariate Gaussian Distribution Multivariate (D-dimensional):

Let S be a set of multivariate data points (vectors): S = {x 1,..., x m }. General expression for finite Gaussian mixture model: That is, x has probability of “membership” in multiple clusters/classes.

More Complicated Case: Maximum Likelihood for Gaussian Mixture Model Goal: Given S = {x 1,..., x N }, and given K, find the Gaussian mixture model (with K Gaussians) for which S has maximum log-likelihood. Log likelihood function: Given S, we can maximize this function to find But no closed form solution (unlike simple case in previous slides) In this multivariate case, can most efficiently maximize this function using the “Expectation / Maximization” (EM) algorithm.

Expectation-Maximization (EM) algorithm General idea: – Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.) – Alternate between E (expectation) and M (maximization) step: E step: use current values for parameters to evaluate posterior probabilities, or “responsibilities”, for each data point. (Analogous to determining which cluster a point belongs to, in K-means.) M step: Use these probabilities to re-estimate means, covariances, and mixing coefficients. (Analogous to moving the cluster centers to the means of their members, in K-means.) Repeat until the log-likelihood or the parameters θ do not change significantly.

More detailed version of EM algorithm 1.Let X be the set of training data. Initialize the means  k, covariances  k, and mixing coefficients  k, and evaluate initial value of log likelihood. 2.E step. Evaluate the “responsibilities” using the current parameter values where r n,k denotes the “responsibilities” of the kth cluster for the nth data point.

3.M step. Re-estimate the parameters θ using the current responsibilities.

4.Evaluate the log likelihood with the new parameters and check for convergence of either the parameters or the log likelihood. If not converged, return to step 2.

EM much more computationally expensive than k-means Common practice: Use k-means to set initial parameters, then improve with EM. – Initial means: Means of clusters found by k-means – Initial covariances: Sample covariances of the clusters found by k-means algorithm. – Initial mixture coefficients: Fractions of data points assigned to the respective clusters.

Can prove that EM finds local maxima of log-likelihood function. EM is very general technique for finding maximum-likelihood solutions for probabilistic models

Text classification from labeled and unlabeled documents using EM K. Nigam et al., Machine Learning, 2000 Big problem with text classification: need labeled data. What we have: lots of unlabeled data. Question of this paper: Can unlabeled data be used to increase classification accuracy? I.e.: Any information implicit in unlabeled data? Any way to take advantage of this implicit information?

General idea: A version of EM algorithm Train a classifier with small set of available labeled documents. Use this classifier to assign probabilisitically-weighted class labels to unlabeled documents by calculating expectation of missing class labels. Then train a new classifier using all the documents, both originally labeled and formerly unlabeled. Iterate.

Probabilistic framework Assumes data are generated with Gaussian mixture model Assumes one-to-one correspondence between mixture components and classes. “These assumptions rarely hold in real-world text data”

Probabilistic framework Let C = {c 1,..., c K } be the classes / mixture components Let  = {  1,...,  K }  {  1,...,  K }  {  1,...,  K } be the mixture parameters. Assumptions: A document d i is created by first selecting a mixture component according to the mixture weights  j, then having this selected mixture component generate a document according to its own parameters, with distribution p(d i | c j ;  ). Likelihood of document d i :

Now, we will apply EM to a Naive Bayes Classifier Recall Naive Bayes classifier: Assume each feature is conditionally independent, given c j.

To “train” naive Bayes from labeled data, estimate These values are estimates of the parameters in . Call these values.

Note that Naive Bayes can be thought of as a generative mixture model. Document d i is represented as a vector of word frequencies ( w 1,..., w |V| ), where V is the vocabulary (all known words). The probability distribution over words associated with each class is parameterized by . We need to estimate to determine what probability distribution document d i = ( w 1,..., w |V| )is most likely to come from.

Applying EM to Naive Bayes We have a small number of labeled documents S labeled, and a large number of unlabeled documents, S unlabeled. The initial parameters are estimated from the labeled documents S labeled. Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x  S unlabeled. Maximization step: Re-estimate using values for x  S unlabeled  S unlabeled Repeat until or has converged.

Augmenting EM What if basic assumptions (each document generated by one component; one-to-one mapping between components and classes) do not hold? They tried two things to deal with this: (1) Weighting unlabeled data less than labeled data (2) Allow multiple mixture components per class: A document may be comprised of several different sub-topics, each best captured with a different word distribution.

Data 20 UseNet newsgroups Web pages (WebKB) Newswire articles (Reuters)

Gaussian Mixture Models as Generative Models General format of GMM: How would you use this to generate a new data point x?