Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Advertisements

CS479/679 Pattern Recognition Dr. George Bebis
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Visual Recognition Tutorial
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 4 (Part 1): Non-Parametric Classification
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Chapter 10 Unsupervised Learning & Clustering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Visual Recognition Tutorial
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Introduction to Bayesian Parameter Estimation
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Clustering Unsupervised learning Generating “classes”
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
1 E. Fatemizadeh Statistical Pattern Recognition.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 32: HIERARCHICAL CLUSTERING Objectives: Unsupervised.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Univariate Gaussian Case (Cont.)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Chapter 3: Maximum-Likelihood Parameter Estimation
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 09: BAYESIAN LEARNING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Text Categorization Berlin Chen 2003 Reference:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 22: HIERARCHICAL CLUSTERING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Chapter 10 Unsupervised Learning & Clustering

Pattern Classification, Chapter 10 2 Introduction Previously, all our training samples were labeled: these samples were said “supervised” Why are we interested in “unsupervised” procedures which use unlabeled samples? 1)Collecting and Labeling a large set of sample patterns can be costly 2)We can train with large amounts of (less expensive) unlabeled data  Then use supervision to label the groupings found, this is appropriate for large “data mining” applications where the contents of a large database are not known beforehand

Pattern Classification, Chapter )Patterns may change slowly with time  Improved performance can be achieved if classifiers running in a unsupervised mode are used 4)We can use unsupervised methods to identify features that will then be useful for categorization  ‘smart’ feature extraction 5)We gain some insight into the nature (or structure) of the data  which set of classification labels?

Pattern Classification, Chapter 10 4 Mixture Densities & Identifiability Assume: functional forms for underlying probability densities are known value of an unknown parameter vector must be learned i.e., like chapter 3 but without class labels Specific assumptions: The samples come from a known number c of classes The prior probabilities P(  j ) for each class are known (j = 1, …,c) Forms for the P(x |  j,  j ) (j = 1, …,c) are known The values of the c parameter vectors  1,  2, …,  c are unknown The category labels are unknown

Pattern Classification, Chapter 10 5 The PDF for the samples is: This density function is called a mixture density Our goal will be to use samples drawn from this mixture density to estimate the unknown parameter vector . Once  is known, we can decompose the mixture into its components and use a MAP classifier on the derived densities.

Pattern Classification, Chapter 10 6 Can  be recovered from the mixture? Consider the case where: Unlimited number of samples Use nonparametric technique to find p(x|  ) for every x If several  result in same p(x|  )  can’t find unique solution This is the issue of solution identifiability. Definition: Identifiability A density P(x |  ) is said to be identifiable if    ’ implies that there exists an x such that: P(x |  )  P(x |  ’)

Pattern Classification, Chapter 10 7 As a simple example, consider the case where x is binary and P(x |  ) is the mixture: Assume that: P(x = 1 |  ) = 0.6  P(x = 0 |  ) = 0.4 We know P(x |  ) but not  We can say:  1 +  2 = 1.2 but not what  1 and  2 are. Thus, we have a case in which the mixture distribution is completely unidentifiable, and therefore unsupervised learning is impossible.

Pattern Classification, Chapter 10 8 In the discrete distributions too many components can be problematic Too many unknowns Perhaps more unknowns than independent equations  identifiability can become a serious problem!

Pattern Classification, Chapter 10 9 While it can be shown that mixtures of normal densities are usually identifiable, the parameters in the simple mixture density cannot be uniquely identified if P(  1 ) = P(  2 ) (we cannot recover a unique  even from an infinite amount of data!)  = (  1,  2 ) and  = (  2,  1 ) are two possible vectors that can be interchanged without affecting P(x |  ). Identifiability can be a problem, we always assume that the densities we are dealing with are identifiable!

Pattern Classification, Chapter ML Estimates Suppose that we have a set D = {x 1, …, x n } of n unlabeled samples drawn independently from the mixture density: (  is fixed but unknown!) The MLE is:

Pattern Classification, Chapter ML Estimates Then the log-likelihood is: And the gradient of the log-likelihood is:

Pattern Classification, Chapter Since the gradient must vanish at the value of  i that maximizes l, the ML estimate must satisfy the conditions

Pattern Classification, Chapter The MLE for P(  i ) and must satisfy:

Pattern Classification, Chapter Applications to Normal Mixtures p(x |  i,  i ) ~ N(  i,  i ) Case 1 = Simplest case Case 2 = more realistic case Case ii ii P(  i ) c 1?xxx 2???x 3????

Pattern Classification, Chapter Case 1: Multivariate Normal, Unknown mean vectors  i =  i  i = 1, …, c, The likelihood is for the i th mean is: ML estimate of  = (  i ) is: Where is the fraction of those samples having value x k that come from the ith class, and is the average of the samples coming from the i-th class.

Pattern Classification, Chapter Unfortunately, equation (1) does not give explicitly However, if we have some way of obtaining good initial estimates for the unknown means, equation (1) can be seen as an iterative process for improving the estimates

Pattern Classification, Chapter This is a gradient ascent for maximizing the log- likelihood function Example: Consider the simple two-component one-dimensional normal mixture (2 clusters!) Let’s set  1 = -2,  2 = 2 and draw 25 samples sequentially from this mixture. The log-likelihood function is: 11 22

Pattern Classification, Chapter The maximum value of l occurs at: (which are not far from the true values:  1 = -2 and  2 = +2) There is another peak at which has almost the same height as can be seen from the following figure. This mixture of normal densities is identifiable When the mixture density is not identifiable, the ML solution is not unique

Pattern Classification, Chapter 10 19

Pattern Classification, Chapter Case 2: All parameters unknown No constraints are placed on the covariance matrix Let p(x | ,  2 ) be the two-component normal mixture:

Pattern Classification, Chapter Suppose  = x 1, therefore: For the rest of the samples: Finally, The likelihood is therefore large and the maximum- likelihood solution becomes singular.

Pattern Classification, Chapter Assumption: MLE is well-behaved at local maxima. Consider the largest of the finite local maxima of the likelihood function and use the ML estimation. We obtain the following local-maximum-likelihood estimates: Iterative scheme

Pattern Classification, Chapter Where :

Pattern Classification, Chapter K-Means Clustering Goal: find the c mean vectors  1,  2, …,  c Replace the squared Mahalanobis distance Find the mean nearest to x k and approximate as: Use the iterative scheme to find

Pattern Classification, Chapter If n is the known number of patterns and c the desired number of clusters, the k-means algorithm is: Begin initialize n, c,  1,  2, …,  c (randomly selected) do classify n samples according to nearest  i recompute  i until no change in  i return  1,  2, …,  c End Complexity is O(ndcT) where d is the # features, T the # iterations

Pattern Classification, Chapter K-means cluster on data from previous figure

Pattern Classification, Chapter Unsupervised Bayesian Learning Other than the ML estimate, the Bayesian estimation technique can also be used in the unsupervised case (see chapters ML & Bayesian methods, Chap. 3 of the textbook) number of classes is known class priors are known forms of class-conditional probability densities P(x|  j,  j ) are known However, the full parameter vector  is unknown Part of our knowledge about  is contained in the prior p(  ) rest of our knowledge of  is in the training samples We compute the posterior distribution using the training samples

Pattern Classification, Chapter We can compute p(  |D) as seen previously and passing through the usual formulation introducing the unknown parameter vector . Hence, the best estimate of p( x|  i ) is obtained by averaging p( x |  i,  i ) over  i. The goodness of this estimate depends on p(  |D); this is the main issue of the problem. P(  i |D) = P(  i ) since selection of  i is independent of previous samples

Pattern Classification, Chapter From Bayes we get: where independence of the samples yields the likelihood or alternately (denoting D n the set of n samples) the recursive form: If p(  ) is almost uniform in the region where p(D|  ) peaks, then p(  |D) peaks in the same place.

Pattern Classification, Chapter If the only significant peak occurs at and the peak is very sharp, then and Therefore, the ML estimate is justified. Both approaches coincide if large amounts of data are available. In small sample size problems they can agree or not, depending on the form of the distributions The ML method is typically easier to implement than the Bayesian one

Pattern Classification, Chapter Formal Bayesian solution: unsupervised learning of the parameters of a mixture density is similar to the supervised learning of the parameters of a component density. Significant differences: identifiability, computational complexity The issue of identifiability With SL, the lack of identifiability means that we do not obtain a unique vector, but an equivalence class, which does not present theoretical difficulty as all yield the same component density. With UL, the lack of identifiability means that the mixture cannot be decomposed into its true components  p(x | D n ) may still converge to p(x), but p(x |  i, D n ) will not in general converge to p(x |  i ), hence there is a theoretical barrier. The computational complexity With SL, the sufficient statistics allows the solutions to be computationally feasible

Pattern Classification, Chapter With UL, samples comes from a mixture density and there is little hope of finding simple exact solutions for p(D |  ).  n samples results in 2 n terms. (Corresponding to the ways in the which the n samples can be drawn from the 2 classes.) Another way of comparing the UL and SL is to consider the usual equation in which the mixture density is explicit

Pattern Classification, Chapter If we consider the case in which P(  1 )=1 and all other prior probabilities as zero, corresponding to the supervised case in which all samples comes from the class  1, then we get From Previous slide

Pattern Classification, Chapter Comparing the two eqns, we see that observing an additional sample changes the estimate of . Ignoring the denominator which is independent of , the only significant difference is that in the SL, we multiply the “prior” density for  by the component density p(x n |  1,  1 ) in the UL, we multiply the “prior” density by the whole mixture Assuming that the sample did come from class  1, the effect of not knowing this category is to diminish the influence of x n in changing  for category 1.. Eqns From Previous slide

Pattern Classification, Chapter Data Clustering Structures of multidimensional patterns are important for clustering If we know that data come from a specific distribution, such data can be represented by a compact set of parameters (sufficient statistics) If samples are considered coming from a specific distribution, but actually they are not, these statistics is a misleading representation of the data

Pattern Classification, Chapter Aproximation of density functions: Mixture of normal distributions can approximate arbitrary PDFs In these cases, one can use parametric methods to estimate the parameters of the mixture density. No free lunch  dimensionality issue! Huh?

Pattern Classification, Chapter Caveat If little prior knowledge can be assumed, the assumption of a parametric form is meaningless: Issue: imposing structure vs finding structure  use non parametric method to estimate the unknown mixture density. Alternatively, for subclass discovery: use a clustering procedure identify data points having strong internal similarities

Pattern Classification, Chapter Similarity measures What do we mean by similarity? Two isses: How to measure the similarity between samples? How to evaluate a partitioning of a set into clusters? Obvious measure of similarity/dissimilarity is the distance between samples Samples of the same cluster should be closer to each other than to samples in different classes.

Pattern Classification, Chapter Euclidean distance is a possible metric: assume samples belonging to same cluster if their distance is less than a threshold d 0 Clusters defined by Euclidean distance are invariant to translations and rotation of the feature space, but not invariant to general transformations that distort the distance relationship

Pattern Classification, Chapter Achieving invariance: normalize the data, e.g., such that they all have zero means and unit variance, or use principal components for invariance to rotation A broad class of metrics is the Minkowsky metric where q  1 is a selectable parameter: q = 1  Manhattan or city block metric q = 2  Euclidean metric One can also used a nonmetric similarity function s(x,x’) to compare 2 vectors.

Pattern Classification, Chapter It is typically a symmetric function whose value is large when x and x’ are similar. For example, the inner product In case of binary-valued features, we have, e.g.: Tanimoto distance

Pattern Classification, Chapter Clustering as optimization The second issue: how to evaluate a partitioning of a set into clusters? Clustering can be posed as an optimization of a criterion function The sum-of-squared-error criterion and its variants Scatter criteria The sum-of-squared-error criterion Let n i the number of samples in D i, and m i the mean of those samples

Pattern Classification, Chapter The sum of squared error is defined as This criterion defines clusters by their mean vectors m i  it minimizes the sum of the squared lengths of the error x - m i. The minimum variance partition minimizes J e Results: Good when clusters form well separated compact clouds Bad with large differences in the number of samples in different clusters.

Pattern Classification, Chapter Scatter criteria Scatter matrices used in multiple discriminant analysis, i.e., the within-scatter matrix S W and the between- scatter matrix S B S T = S B +S W Note: S T does not depend on partitioning In contrast, S B and S W depend on partitioning Two approaches: minimize the within-cluster maximize the between-cluster scatter

Pattern Classification, Chapter The trace (sum of diagonal elements) is the simplest scalar measure of the scatter matrix proportional to the sum of the variances in the coordinate directions This is the sum-of-squared-error criterion, J e.

Pattern Classification, Chapter As tr[S T ] = tr[S W ] + tr[S B ] and tr[S T ] is independent from the partitioning, no new results can be derived by minimizing tr[S B ] However, seeking to minimize the within-cluster criterion J e =tr[S W ], is equivalent to maximise the between-cluster criterion where m is the total mean vector:

Pattern Classification, Chapter Iterative optimization Clustering  discrete optimization problem Finite data set  finite number of partitions What is the cost of exhaustive search?  c n /c! For c clusters. Not a good idea Typically iterative optimization used: starting from a reasonable initial partition Redistribute samples to minimize criterion function.  guarantees local, not global, optimization.

Pattern Classification, Chapter consider an iterative procedure to minimize the sum-of-squared-error criterion J e where J i is the effective error per cluster. Moving sample from cluster D i to D j, changes the errors in the 2 clusters by:

Pattern Classification, Chapter Hence, the transfer is advantegeous if the decrease in J i is larger than the increase in J j

Pattern Classification, Chapter Alg. 3 is sequential version of the k-means alg. Alg. 3 updates each time a sample is reclassified k-means waits until n samples have been reclassified before updating Alg 3 can get trapped in local minima Depends on order of the samples Basically, myopic approach But it is online!

Pattern Classification, Chapter Starting point is always a problem Approaches: 1.Random centers of clusters 2.Repetition with different random initialization 3.c-cluster starting point as the solution of the (c-1)- cluster problem plus the sample farthest from the nearer cluster center

Pattern Classification, Chapter Hierarchical Clustering Many times, clusters are not disjoint, but a cluster may have subclusters, in turn having sub- subclusters, etc. Consider a sequence of partitions of the n samples into c clusters The first is a partition into n cluster, each one containing exactly one sample The second is a partition into n-1 clusters, the third into n-2, and so on, until the n-th in which there is only one cluster containing all of the samples At the level k in the sequence, c = n-k+1.

Pattern Classification, Chapter Given any two samples x and x’, they will be grouped together at some level, and if they are grouped a level k, they remain grouped for all higher levels Hierarchical clustering  tree representation called dendrogram

Pattern Classification, Chapter Are groupings natural or forced: check similarity values Evenly distributed similarity  no justification for grouping Another representation is based on set, e.g., on the Venn diagrams

Pattern Classification, Chapter Hierarchical clustering can be divided in agglomerative and divisive. Agglomerative (bottom up, clumping): start with n singleton cluster and form the sequence by merging clusters Divisive (top down, splitting): start with all of the samples in one cluster and form the sequence by successively splitting clusters

Pattern Classification, Chapter Agglomerative hierarchical clustering The procedure terminates when the specified number of cluster has been obtained, and returns the cluster as sets of points, rather than the mean or a representative vector for each cluster

Pattern Classification, Chapter At any level, the distance between nearest clusters can provide the dissimilarity value for that level To find the nearest clusters, one can use which behave quite similar of the clusters are hyperspherical and well separated. The computational complexity is O(cn 2 d 2 ), n>>c

Pattern Classification, Chapter Nearest-neighbor algorithm (single linkage) d min is used Viewed in graph terms, an edge is added to the nearest nonconnected components Equivalent of Prims minimum spanning tree algorithm Terminates when the distance between nearest clusters exceeds an arbitrary threshold

Pattern Classification, Chapter The use of d min as a distance measure and the agglomerative clustering generate a minimal spanning tree Chaining effect: defect of this distance measure (right)

Pattern Classification, Chapter The farthest neighbor algorithm (complete linkage) d max is used This method discourages the growth of elongated clusters In graph theoretic terms: every cluster is a complete subgraph the distance between two clusters is determined by the most distant nodes in the 2 clusters terminates when the distance between nearest clusters exceeds an arbitrary threshold

Pattern Classification, Chapter When two clusters are merged, the graph is changed by adding edges between every pair of nodes in the 2 clusters All the procedures involving minima or maxima are sensitive to outliers. The use of d mean or d avg are natural compromises

Pattern Classification, Chapter The problem of the number of clusters How many clusters should there be? For clustering by extremizing a criterion function repeat the clustering with c=1, c=2, c=3, etc. look for large changes in criterion function Alternatively: state a threshold for the creation of a new cluster useful for on line cases sensitive to order of presentation of data. These approaches are similar to model selection procedures

Pattern Classification, Chapter Graph-theoretic methods Caveat: no uniform way of posing clustering as a graph theoretic problem Generalize from a threshold distance to arbitrary similarity measures. If s 0 is a threshold value, we can say that x i is similar to x j if s(x i, x j ) > s 0. We can define a similarity matrix S = [s ij ]

Pattern Classification, Chapter This matrix induces a similarity graph, dual to S, in which nodes corresponds to points and edge joins node i and j iff s ij =1. Single-linkage alg.: two samples x and x’ are in the same cluster if there exists a chain x, x 1, x 2, …, x k, x’, such that x is similar to x 1, x 1 to x 2, and so on  connected components of the graph Complete-link alg.: all samples in a given cluster must be similar to one another and no sample can be in more than one cluster. Neirest-neighbor algorithm is a method to find the minimum spanning tree and vice versa Removal of the longest edge produce a 2-cluster grouping, removal of the next longest edge produces a 3-cluster grouping, and so on.

Pattern Classification, Chapter This is a divisive hierarchical procedure, and suggest ways to dividing the graph in subgraphs E.g., in selecting an edge to remove, comparing its length with the lengths of the other edges incident the nodes

Pattern Classification, Chapter One useful statistic to be estimated from the minimal spanning tree is the edge length distribution For instance, in the case of 2 dense cluster immersed in a sparse set of points: