Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp. 487-515, 532-541, 546-552 (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Mixture Models and the EM Algorithm
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Clustering Beyond K-means
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
What is Cluster Analysis?
Visual Recognition Tutorial
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Text Clustering.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Machine Learning Lecture 9: Clustering
Data Mining K-means Algorithm
Classification of unlabeled data:
10701 / Machine Learning.
Clustering (3) Center-based algorithms Fuzzy k-means
Data Mining Lecture 11.
Clustering Evaluation The EM Algorithm
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
Critical Issues with Respect to Clustering
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Gaussian Mixture Models And their training with the EM algorithm
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
INTRODUCTION TO Machine Learning
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Text Categorization Berlin Chen 2003 Reference:
Learning From Observed Data
Clustering Techniques
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
EM Algorithm and its Applications
Presentation transcript:

Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp. 487-515, 532-541, 546-552 (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)

Unsupervised learning = No labels on training examples Unsupervised learning = No labels on training examples! Main approach: Clustering

Example: Optdigits data set

Etc. .. Optdigits features x = (f1, f2, ..., f64) = (0, 2, 13, 16, 16, 16, 2, 0, 0, ...) Etc. ..

Partitional Clustering of Optdigits Feature 1 Feature 2 Feature 3 64-dimensional space

Partitional Clustering of Optdigits Feature 1 Feature 2 Feature 3 64-dimensional space

Hierarchical Clustering of Optdigits Feature 1 Feature 2 Feature 3 64-dimensional space

Hierarchical Clustering of Optdigits Feature 1 Feature 2 Feature 3 64-dimensional space

Hierarchical Clustering of Optdigits Feature 1 Feature 2 Feature 3 64-dimensional space

Issues for clustering algorithms How to measure distance between pairs of instances? How many clusters to create? Should clusters be hierarchical? (I.e., clusters of clusters) Should clustering be “soft”? (I.e., an instance can belong to different clusters, with “weighted belonging”)

Most commonly used (and simplest) clustering algorithm: K-Means Clustering

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

K-means clustering algorithm

K-means clustering algorithm Typically, use mean of points in cluster as centroid

K-means clustering algorithm Distance metric: Chosen by user. For numerical attributes, often use L2 (Euclidean) distance. Centroid of a cluster here refers to the mean of the points in the cluster.

Example: Image segmentation by K-means clustering by color From http://vitroz.com/Documents/Image%20Segmentation.pdf K=5, RGB space K=10, RGB space

K=5, RGB space K=10, RGB space

K=5, RGB space K=10, RGB space

Clustering text documents A text document is represented as a feature vector of word frequencies Distance between two documents is the cosine of the angle between their corresponding feature vectors.

Figure 4. Two-dimensional map of the PMRA cluster solution, representing nearly 29,000 clusters and over two million articles. Boyack KW, Newman D, Duhon RJ, Klavans R, et al. (2011) Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6(3): e18029. doi:10.1371/journal.pone.0018029 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0018029

Exercise 1

How to evaluate clusters produced by K-means? Unsupervised evaluation Supervised evaluation

Unsupervised Cluster Evaluation We don’t know the classes of the data instances Let C denote a clustering (i.e., set of K clusters that is the result of a clustering algorithm) and let c denote a cluster in C. Let |c| denote the number of elements in c. We want to minimize the distance between elements of c and the centroid μc . coherence of each cluster c – i.e., minimize Mean Square Error (mse):

Unsupervised Cluster Evaluation We don’t know the classes of the data instances Let C denote a clustering (i.e., set of K clusters that is the result of a clustering algorithm) and let c denote a cluster in C. Let |c| denote the number of elements in c. We want to minimize the distance between elements of c and the centroid μc . coherence of each cluster c – i.e., minimize Mean Square Error (mse): Note: The assigned reading uses sum square error rather than mean square error.

Unsupervised Cluster Evaluation We don’t know the classes of the data instances We also want to maximize pairwise separation of each cluster. That is, maximize Mean Square Separation (mss):

Exercises 2-3

Supervised Cluster Evaluation Suppose we know the classes of the data instances Entropy of a cluster: The degree to which a cluster consists of objects of a single class. Mean entropy of a clustering: Average entropy over all clusters in the clustering We want to minimize mean entropy

Entropy Example Suppose there are 3 classes: 1, 2, 3 Cluster 1 1 2 1 3 1 1 3 2 3 3 3 2 3 1 1 3 2 2 3 2

Exercise 4

Issues for K-means Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt Issues for K-means

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt Issues for K-means The algorithm is only applicable if the mean is defined. For categorical data, use K-modes: The centroid is represented by the most frequent values.

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt Issues for K-means The algorithm is only applicable if the mean is defined. For categorical data, use K-modes: The centroid is represented by the most frequent values. The user needs to specify K.

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt Issues for K-means The algorithm is only applicable if the mean is defined. For categorical data, use K-modes: The centroid is represented by the most frequent values. The user needs to specify K. The algorithm is sensitive to outliers Outliers are data points that are very far away from other data points. Outliers could be errors in the data recording or some special data points with very different values.

Issues for K-means: Problems with outliers Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt Issues for K-means: Problems with outliers CS583, Bing Liu, UIC

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt Dealing with outliers One method is to remove some data points in the clustering process that are much further away from the centroids than other data points. Expensive Not always a good idea! Another method is to perform random sampling. Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small. Assign the rest of the data points to the clusters by distance or similarity comparison, or classification

Issues for K-means (cont …) Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt Issues for K-means (cont …) The algorithm is sensitive to initial seeds. + + CS583, Bing Liu, UIC

Issues for K-means (cont …) Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt Issues for K-means (cont …) If we use different seeds: good results + + CS583, Bing Liu, UIC

Issues for K-means (cont …) Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt Issues for K-means (cont …) If we use different seeds: good results Often can improve K-means results by doing several random restarts. + + CS583, Bing Liu, UIC

Issues for K-means (cont …) Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt Issues for K-means (cont …) If we use different seeds: good results Often can improve K-means results by doing several random restarts. + + Often useful to select instances from data as initial seeds. CS583, Bing Liu, UIC

Issues for K-means (cont …) Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt Issues for K-means (cont …) The K-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres). + CS583, Bing Liu, UIC

Other Issues What if a cluster is empty? Choose a replacement centroid At random, or From cluster that has highest mean square error How to choose K ? The assigned reading discusses several methods for improving a clustering with “postprocessing”.

Choosing the K in K-Means Hard problem! Often no “correct” answer for unlabeled data Many proposed methods! Here are a few: Try several values of K, see which is best, via cross-validation. Metrics: mean square error, mean square separation, penalty for too many clusters [why?] Start with K = 2. Then try splitting each cluster. New means are one sigma away from cluster center in direction of greatest variation. Use similar metrics to above.

“Elbow” method: Plot average mse (or SSE) vs. K. Choose K at which SSE (or other metric) stops decreasing abruptly. However, sometimes no clear “elbow” “elbow”

Homework 5

Quiz 4 Review

Soft Clustering with Gaussian Mixture Models

Soft Clustering with Gaussian mixture models A “soft”, generative version of K-means clustering Given: Training set S = {x1, ..., xN}, and K. Assumption: Data is generated by sampling from a “mixture” (linear combination) of K Gaussians.

Gaussian Mixture Models Assumptions K clusters Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] Assume that each data instance we have was generated by the following procedure: 1. Select cluster ci with probability P(ci) = πi 2. Sample point from ci’s Gaussian distribution

Mixture of three Gaussians (one dimensional data)

Clustering via finite Gaussian mixture models Clusters: Each cluster will correspond to a single Gaussian. Each point x  S will have some probability distribution over the K clusters. Goal: Given the data, find the Gaussians! (And their probabilities πi .) I.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is maximized. This is called a Maximum Likelihood method. S is the data {θK} is the “hypothesis” or “model” P(S | {θK}) is the “likelihood”.

General form of one-dimensional (univariate) Gaussian Mixture Model

Maximum Likelihood for Single Univariate Gaussian Learning a GMM Simple Case: Maximum Likelihood for Single Univariate Gaussian Assume training set S has N values generated by a univariant Gaussian distribution: Likelihood function: probability of data given model (or parameters of model)

How to estimate parameters  and  from S? Maximize the likelihood function with respect to  and  . We want the  and  that maximize the probability of the data. Problem: Individual values of are typically very small. (Can underflow numerical precision of computer.)

Solution: Work with log likelihood instead of likelihood.

Solution: Work with log likelihood instead of likelihood. Find a simplified expression for this.

Solution: Work with log likelihood instead of likelihood.

Now, find maximum likelihood parameters,  and σ2. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

Now, find maximum likelihood parameters,  and σ2. First, maximize with respect to . Find  that maximizes this. Result: (ML = “Maximum Likelihood”)

Now, find maximum likelihood parameters,  and σ2. First, maximize with respect to . Find  that maximizes this. How to do this?

Now, find maximum likelihood parameters,  and σ2. First, maximize with respect to . Find  that maximizes this.

Now, find maximum likelihood parameters,  and σ2. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

Now, maximize with respect to σ2. Find σ2 that maximizes this.

Now, maximize with respect to σ2. Find σ2 that maximizes this.

Now, maximize with respect to σ2.

The resulting distribution is called a “generative model” because it can generate new data values. We say that parameterizes the model. In general, θ is used to denote the (learnable) parameters of a probabilistic model

Learning a GMM More general case: Multivariate Gaussian Distribution Multivariate (D-dimensional) Gaussian:

Covariance: Variance: Covariance Matrix Σ : Σi,j = cov (xi , xj)

Let S be a set of multivariate data points (vectors): S = {x1, ..., xm}. General expression for finite Gaussian mixture model: That is, x has probability of “membership” in multiple clusters/classes.

Maximum Likelihood for Multivariate Gaussian Mixture Model Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture model (with K multivariate Gaussians) for which S has maximum log-likelihood. Log likelihood function: Given S, we can maximize this function to find But no closed form solution (unlike simple case in previous slides) In this multivariate case, we can efficiently maximize this function using the “Expectation / Maximization” (EM) algorithm.

Expectation-Maximization (EM) algorithm General idea: Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.) Alternate between E (expectation) and M (maximization) step: E step: use current values for parameters to evaluate posterior probabilities, or “responsibilities”, for each data point. (Analogous to determining which cluster a point belongs to, in K-means.) M step: Use these probabilities to re-estimate means, covariances, and mixing coefficients. (Analogous to moving the cluster centers to the means of their members, in K-means.) Repeat until the log-likelihood or the parameters θ do not change significantly.

More detailed version of EM algorithm Let X be the set of training data. Initialize the means k, covariances k, and mixing coefficients k, and evaluate initial value of log likelihood. E step. Evaluate the “responsibilities” using the current parameter values where rn,k denotes the “responsibilities” of the kth cluster for the nth data point.

M step. Re-estimate the parameters θ using the current responsibilities.

Evaluate the log likelihood with the new parameters and check for convergence of either the parameters or the log likelihood. If not converged, return to step 2.

EM much more computationally expensive than K-means Common practice: Use K-means to set initial parameters, then improve with EM. Initial means: Means of clusters found by k-means Initial covariances: Sample covariances of the clusters found by K-means algorithm. Initial mixture coefficients: Fractions of data points assigned to the respective clusters.

Can prove that EM finds local maxima of log-likelihood function. EM is very general technique for finding maximum-likelihood solutions for probabilistic models

Using GMM for Classification Assume each cluster corresponds to one of the classes. A new test example x is classified according to

Case Study: Text classification from labeled and unlabeled documents using EM K. Nigam et al., Machine Learning, 2000 Big problem with text classification: need labeled data. What we have: lots of unlabeled data. Question of this paper: Can unlabeled data be used to increase classification accuracy? I.e.: Any information implicit in unlabeled data? Any way to take advantage of this implicit information?

General idea: A version of EM algorithm Train a classifier with small set of available labeled documents. Use this classifier to assign probabilisitically-weighted class labels to unlabeled documents by calculating expectation of missing class labels. Then train a new classifier using all the documents, both originally labeled and formerly unlabeled. Iterate.

Probabilistic framework Assumes data are generated with Gaussian mixture model Assumes one-to-one correspondence between mixture components and classes. “These assumptions rarely hold in real-world text data”

Probabilistic framework Let C = {c1, ..., cK} be the classes / mixture components Let  = {1, ..., K}  {1, ..., K}  {1, ..., K} be the mixture parameters. Assumptions: A document di is created by first selecting a mixture component according to the mixture weights j, then having this selected mixture component generate a document according to its own parameters, with distribution p(di | cj; ). Likelihood of document di :

Now, we will apply EM to a Naive Bayes Classifier Recall Naive Bayes classifier: Assume each feature is conditionally independent, given cj.

To “train” naive Bayes from labeled data, estimate These values are estimates of the parameters in . Call these values .

Note that Naive Bayes can be thought of as a generative mixture model. Document di is represented as a vector of word frequencies ( w1, ..., w|V| ), where V is the vocabulary (all known words). The probability distribution over words associated with each class is parameterized by . We need to estimate to determine what probability distribution document di = ( w1, ..., w|V| )is most likely to come from.

Applying EM to Naive Bayes We have a small number of labeled documents Slabeled, and a large number of unlabeled documents, Sunlabeled. The initial parameters are estimated from the labeled documents Slabeled. Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x  Sunlabeled. Maximization step: Re-estimate using values for x  Sunlabeled  Sunlabeled Repeat until or has converged.

Augmenting EM What if basic assumptions (each document generated by one component; one-to-one mapping between components and classes) do not hold? They tried two things to deal with this: (1) Weighting unlabeled data less than labeled data (2) Allow multiple mixture components per class: A document may be comprised of several different sub-topics, each best captured with a different word distribution.

Data 20 UseNet newsgroups Web pages (WebKB) Newswire articles (Reuters)