Clustering.

Slides:



Advertisements
Similar presentations
Mixture Models and the EM Algorithm
Advertisements

Clustering Beyond K-means
K Means Clustering , Nearest Cluster and Gaussian Mixture
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Supervised Learning Recap
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Segmentation and Fitting Using Probabilistic Methods
Machine Learning and Data Mining Clustering
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Learning
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
ECE 5984: Introduction to Machine Learning
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Clustering Unsupervised learning Generating “classes”
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Principles of Pattern Recognition
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Sergios Theodoridis Konstantinos Koutroumbas Version 2
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Flat clustering approaches
Chapter 13 (Prototype Methods and Nearest-Neighbors )
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Big Data Infrastructure
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
CH 5: Multivariate Methods
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Biointelligence Laboratory, Seoul National University
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
EM Algorithm and its Applications
Presentation transcript:

Clustering

What is Cluster Analysis k-Means Adaptive Initialization EM Learning Mixture Gaussians E-step M-step k-Means vs Mixture of Gaussians

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

k-Means Clustering

Feature space Sample

Norm ||x|| ≥ 0 equality only if x=0 || x||=|| ||x|| ||x1+x2||≤ ||x1||+||x2|| lp norm

Metric d(x,y) ≥ 0 equality holds only if x=y d(x,y) = d(y,x) d(x,y) ≤ d(x,z)+d(z,y)

k-means Clustering Cluster centers c1,c2,.,ck with clusters C1,C2,.,Ck

Error The error function has a local minima if,

k-means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x Compute centroids Reassign clusters Converged!

Algorithm Random initialization of k cluster centers do { -assign to each xi in the dataset the nearest cluster center (centroid) cj according to d2 -compute all new cluster centers } until ( |Enew - Eold| <  or number of iterations max_iterations)

Adaptive k-means learning (batch modus) for large datasets Random initialization of cluster centers do { chose xi from the dataset cj* nearest cluster center (centroid) cj according to d2 } until ( |Enew - Eold| <  or number of iterations max_iterations)

How to chose k? You have to know your data! Repeated runs of k-means clustering on the same data can lead to quite different partition results Why? Because we use random initialization

Adaptive Initialization Choose a maximum radius within every data point should have a cluster seed after completion of the initialization phase In a single sweep go through the data and assigns the cluster seeds according to the chosen radius A data point becomes a new cluster seed, if it is not covered by the spheres with the chosen radius of the other already assigned seeds K-MAI clustering (Wichert et al. 2003)

EM Expectation Maximization Clustering

Feature space Sample Mahalanobis distance

Bayes’s rule After the evidence is obtained; posterior probability P(a|b) The probability of a given that all we know is b (Reverent Thomas Bayes 1702-1761)

Covariance Measuring the tendency two features xi and xj varying in the same direction The covariance between features xi and xj is estimated for n patterns

Learning Mixture Gaussians What kind of probability distribution might have generated the data Clustering presumes that the data are generated from mixture distributions, P

The Normal Density Univariate density Where: Density which is analytically tractable Continuous density A lot of processes are asymptotically Gaussian Where:  = mean (or expected value) of x 2 = expected squared deviation or variance

Example: Mixture of 2 Gaussians

Multivariate density where: Multivariate normal density in d dimensions is: where: x = (x1, x2, …, xd)t (t stands for the transpose vector form)  = (1, 2, …, d)t mean vector  = d*d covariance matrix || and -1 are determinant and inverse respectively

Example: Mixture of 3 Gaussians

A mixture distribution has k components, each of which is a distribution in its own A data point is generated by first choosing a component and than generating a sample from that component

Let C denote the component with values 1,…,k Mixture distribution is given by x refers to the data point wi=P(C=i) the weight of each component µi the mean (vector) of each component ∑i (matrix) the covariance of each component

If we knew which component generated each data point, then it would be easy to recover the component Gaussians We could fit the parameters of a Gaussian to a data set

Basic EM idea Pretend that we know the parameters of the model Infer the probability that each data point belongs to each component Refit the component to the data, where each component is fitted to the entire data set Each point is weighted by the probability that it belongs to that component

Algorithm We initialize the mixture parameters arbitrarily E- step (expectation): Compute the probabilities pij=P(C=i|xj), the probability that xj was generated by the component I By Bayes’ rule pij=P(xj|C=i)P(C=i) P(xj|C=i) is just the probability at xj of the ith Gaussian P(C=i) is just the weight parameter of the ith Gaussian

M-step (maximization): wi=P(C=i)

Problems Gaussians component shrinks so that it covers just a single point Variance goes to zero, and likelihood will go to infinity Two components can “merge”, acquiring identical means and variances and sharing their data points Serious problems, especially in high dimensions It helps to initialize the parameters with reasonable values

k-Means vs Mixture of Gaussians Both are iterative algorithms to assign points to clusters K-Means: minimize MixGaussian: maximize P(x|C=i) Mixture of Gaussian is the more general formulation Equivalent to k-Means when ∑i =I ,

What is Cluster Analysis k-Means Adaptive Initialization EM Learning Mixture Gaussians E-step M-step k-Means vs Mixture of Gaussians

Tree Clustering COBWEB