Download presentation
Presentation is loading. Please wait.
Published byJustin Reeves Modified over 9 years ago
1
Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression
2
Clustering Clustering definition: Partition a given set of objects into M groups (clusters) such that the objects of each group are ‘similar’ and ‘different’ from the objects of the other groups. A distance (or similarity) measure is required. Unsupervised learning: no class labels Clustering is NP-complete Clustering Examples: documents, images, time series, image segmentation, video analysis, gene clustering, motif discovery, web applications Big Issue: Number of Clusters estimation Difficult to evaluate solutions
3
Clustering Cluster Assignments: hard vs (fuzzy/probabilistic) Clustering Methods −Hierarchical (agglomerative, divisive) −Density-based (non-parametric) −Parametric (k-means, mixture models etc) Clustering Problems – Data Vectors – Similarity Matrix
4
Agglomerative Clustering The simplest approach and a good starting point Starting from singleton clusters at each step we merge the two most similar clusters A similarity (or distance) measure between clusters is needed Output: dendrogram Drawback: merging decisions are permanent (cannot be corrected at a later stage)
5
Density-Based Clustering (eg DBSCAN) Identify ‘dense regions’ in the data space Merge neighboring dense regions Require a lot of points Complexity: O(n 2 ) Core Border Outlier Eps = 1cm MinPts = 5 (set empirically, how?)
6
Parametric methods k-means (data vectors): O(n) (n: the number of objects to be clustered) k-medoids (similarity matrix): O(n 2 ) Mixture models (data vectors): O(n) Spectral clustering (similarity matrix): O(n 3 ) Kernel k-means (similarity matrix): O(n 2 ) Affinity Propagation (similarity matrix): O(n 2 )
7
k-means Partition a dataset X of N vectors x i into M subsets (clusters) C k such that intra-cluster variance is minimized. Intra-cluster variance: distance from the cluster prototype m k k-means: Prototype = cluster center Finds local minima w.r.t. clustering error – sum of intra-cluster variances Highly dependent on the initial positions (examples) of the centers m k
8
k-medoids Similar to k-means The represenative is the cluster medoid: the cluster object with smallest average distance to the other cluster objects At each iteration the medoid is computed instead of centroid Increased complexity: O(n 2 ) Medoid: more robust to outliers k-medoids can be used with similarity matrix
9
k-means (1) vs k-medoids (2)
10
10 Spectral Clustering (Ng & Jordan, NIPS2001)Ng & Jordan, NIPS2001 Input: Similarity matrix between pairs of objects, number of clusters M Example: a(x,y)=exp(-||x-y|| 2 /σ 2 ) (RBF kernel) Spectral analysis of the similarity matrix: compute top M eigenvectors and form matrix U The i-th object corresponds to a vector in R k : i-th row of U. Rows are clustered in M clusters using k-means
11
11 Spectral Clustering k-means spectral (RBF kernel,σ=1) 2 rings dataset
12
Spectral Clustering ↔ Graph cut Data graph Vertices: objects Edge weight: pairwise similarity Clustering = Graph Partitioning
13
13 Cluster Indicator vector z i =(0,0,…,0,1,0,…0) T for object i Indicator matrix Z=[z 1,…,z n ] (nxk, for k clusters), Z T Z=I Graph partitioning = trace maximization wrt Z: The relaxed problem: is solved optimally using the spectral algorithm to obtain Y k-means is applied on y ij to obtain z ij Spectral Clustering ↔ Graph cut
14
Kernel-Based Clustering (non-linear cluster separation) – Given a set of objects and the kernel matrix K=[K ij ] containing the similarities between each pair of objects – Goal: Partition the dataset into subsets (clusters) C k such that intra- cluster similarity is maximized. – Kernel trick: Data points are mapped from input space to a higher dimensional feature space through a transformation φ(x). RBF kernel: K(x,y)=exp(-||x-y|| 2 /σ 2 )
15
Kernel k-Means Kernel k-means = k-means in feature space – Minimizes the clustering error in feature space Differences from k-means – Cluster centers m k in feature space cannot be computed – Each cluster C k is explicitly described by its data objects – Computation of distances from centers in feature space: Finds local minima - Strong dependence on the initial partition
16
Spectral Relaxation of Kernel k-means 1 Dhillon, I.S., Guan, Y., Kulis, B., Weighted graph cuts without eigenvectors: A multilevel approach, IEEE TPAMI, 2007 Spectral methods can substitute kernel k-means and vice versa Constant
17
Exemplar-Based Methods Cluster data by identifying representative exemplars – An exemplar is an actual dataset point, similar to a medoid – All data points are considered as possible exemplars – The number of clusters is decided during learning (but a depends on a user-defined parameter) Methods – Convex Mixture Models – Affinity Propagation
18
Affinity Propagation (AP) (Frey et al., Science 2007)Frey et al., Science 2007 Clusters data by identifying representative exemplars – Exemplars are identified by transmitting messages between data points Input to the algorithm – A similarity matrix where s(i,k) indicates how well data point x k is suited to be an exemplar for data point x i. – Self-similarities s(k,k) that control the number of identified clusters and a higher value means that x k is more likely to become an exemplar Self-similarities are independent of the other similarities Higher values result in more clusters
19
Affinity Propagation Clustering criterion: – s(i,c i ) is the similarity between the data point x i and its exemplar – Minimized by passing messages between data points, called responsibilities and availabilities Responsibility r(i,k): – Sent from x i to candidate exemplar x k reflects the accumulated evidence for how well suited x k is to serve as the exemplar of x i taking into account other potential exemplars for x i
20
Affinity Propagation Availability a(i,k) – Sent from candidate exemplar x k to x i reflects the accumulated evidence for how appropriate it would be for x i to choose x k as its exemplar, taking into account the support from other points that x k should be an exemplar The algorithm alternates between responsibility and availability calculation and The exemplars are the points with r(k,k)+a(k,k)>0 – http://www.psi.toronto.edu/index.php?q=affinity%20propagation http://www.psi.toronto.edu/index.php?q=affinity%20propagation
21
Affinity Propagation
22
Incremental Clustering Bisecting k-means (Steinbach,Karypis & Kumar, SIGKDD 2000) Start with k=1 (m 1 = data average) Assume a solution with k clusters – Find the ‘best’ cluster split in two subclusters – Replace the cluster center with the two subcluster centers – Run k-means with k+1 centers (optional) – k:=k+1 Until M clusters have been added Split a cluster using several random trials Each trial: – Randomly initialize two centers from the cluster points – Run 2-means using the cluster points only Keep the split of the trial with the lowest clustering error
23
Global k-means (Likas, Vlassis & Verbeek, PR 2003)Likas, Vlassis & Verbeek, PR 2003 Incremental, deterministic clustering algorithm that runs k-Means several times Finds near-optimal solutions wrt clustering error Idea: a near-optimal solution for k clusters can be obtained by running k-means from an initial state – the k-1 centers are initialized from a near-optimal solution of the (k- 1)-clustering problem – the k-th center is initialized at some data point x n (which?) Consider all possible initializations (one for each x n )
24
Global k-means In order to solve the M-clustering problem: – Solve the 1-clustering problem (trivial) – Solve the k-clustering problem using the solution of the (k-1)-clustering problem Execute k-Means N times, initialized as at the n-th run (n=1,…,N). Keep the solution corresponding to the run with the lowest clustering error as the solution with k clusters – k:=k+1, Repeat step 2 until k=M.
25
Best Initial m 2 Best Initial m 3 Best Initial m 4 Best Initial m 5
27
Fast Global k-Means How is the complexity reduced? – We select the initial state with the greatest reduction in clustering error in the first iteration of k-means (reduction can be computed analytically) – k-means is executed only once from this state Restrict the set of candidate initial points (kd-tree, summarization)
28
Global Kernel k-Means (Tzortzis & Likas, IEEE TNN 2009)Tzortzis & Likas, IEEE TNN 2009 In order to solve the M-clustering problem: 1.Solve the 1-clustering problem with Kernel k-Means (trivial solution) 2.Solve the k-clustering problem using the solution of the (k-1)-clustering problem a)Let denote the solution to the (k-1)-clustering problem b)Execute Kernel k-Means N times, initialized during the n-th run as c)Keep the run with the lowest clustering error as the solution with k clusters d)k := k+1 3.Repeat step 2 until k=M. The fast Global kernel k-means can be applied Select representative data points using convex mixture models
29
Best Initial C 3 Best Initial C 4 Blue circles: optimal initialization of the cluster to be added RBF kernel: K(x,y)=exp(-||x-y|| 2 /σ 2 ) Best Initial C 2
31
Clustering Methods: Summary Usually we assume that the number of clusters is given k-means is still the most widely used method Mixture models could be used when lots of data are available Spectral clustering (or kernel k-means) the most popular when similarity matrix is given Beware of the parameter initialization problem! Ground truth absence makes evaluation difficult How could we estimate the number of clusters?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.