Clustering Clustering definition: Partition a given set of objects into M groups (clusters) such that the objects of each group are ‘similar’ and ‘different’

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

ECG Signal processing (2)
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
An Introduction of Support Vector Machine
Machine learning continued Image source:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Lecture 21: Spectral Clustering
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Cluster Analysis: Basic Concepts and Algorithms
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Computer Vision - A Modern Approach Set: Segmentation Slides by D.A. Forsyth Segmentation and Grouping Motivation: not information is evidence Obtain a.
Clustering Unsupervised learning Generating “classes”
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Data mining and machine learning A brief introduction.
Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning Queens College Lecture 7: Clustering.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Unsupervised Learning
Spectral Methods for Dimensionality
Support Vector Machines
Support vector machines
Clustering Anna Reithmeir Data Mining Proseminar 2017
Data Mining: Basic Cluster Analysis
Support Feature Machine for DNA microarray data
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Machine Learning Clustering: K-means Supervised Learning
Clustering Usman Roshan.
Data Mining K-means Algorithm
Dimensionality reduction
CSE 5243 Intro. to Data Mining
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
K-means and Hierarchical Clustering
Jianping Fan Dept of CS UNC-Charlotte
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Hierarchical clustering approaches for high-throughput data
Critical Issues with Respect to Clustering
CSE572, CBS598: Data Mining by H. Liu
Jianping Fan Dept of Computer Science UNC-Charlotte
DATA MINING Introductory and Advanced Topics Part II - Clustering
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
CSE572, CBS572: Data Mining by H. Liu
Support Vector Machines
Text Categorization Berlin Chen 2003 Reference:
CSE572: Data Mining by H. Liu
EM Algorithm and its Applications
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
Clustering Usman Roshan CS 675.
Support Vector Machines 2
Unsupervised Learning
Presentation transcript:

Clustering Clustering definition: Partition a given set of objects into M groups (clusters) such that the objects of each group are ‘similar’ and ‘different’ from the objects of the other groups. A distance (or similarity) measure is required. Unsupervised learning: no class labels Clustering is NP-complete ---> equivalence to graph partitioning Clustering Examples: documents, images, time series, image segmentation, video analysis, gene clustering, motif discovery, web applications

Clustering Cluster Assignments: hard vs (fuzzy/probabilistic) Clustering Methods Hierarchical (agglomerative, divisive) Density-based (non-parametric) Parametric (k-means, mixture models etc) Clustering Problems Data Vectors Similarity Matrix

Agglomerative Clustering The simplest approach and a good starting point Starting from singleton clusters at each step we merge the two most similar clusters A similarity (or distance) measure between clusters is needed Output: dendrogram Drawbacks: - merging decisions are permanent (cannot be corrected at a later stage) - cubic complexity

Density-Based Clustering (eg DBSCAN) Identify ‘dense regions’ in the data space Merge neighboring dense regions Require a lot of points Complexity: O(n2) Core Border Outlier Eps = 1cm MinPts = 5 (set empirically, how?)

Parametric methods k-means (data vectors): O(n) (n: the number of objects to be clustered) k-medoids (similarity matrix): O(n2) Mixture models (data vectors): O(n) Spectral clustering (similarity matrix): O(n3) Kernel k-means (similarity matrix): O(n2) Affinity Propagation (similarity matrix): O(n2)

k-means Partition a dataset X of N vectors xi into M subsets (clusters) Ck such that intra-cluster variance is minimized. Intra-cluster variance: distance from the cluster prototype mk k-means: Prototype = cluster center Finds local minima w.r.t. clustering error sum of intra-cluster variances Highly dependent on the initial positions (examples) of the centers mk

Spectral Clustering (Ng & Jordan, NIPS2001) Input: Similarity matrix between pairs of objects, number of clusters M Example: a(x,y)=exp(-||x-y||2/σ2) (RBF kernel) Spectral analysis of the similarity matrix: compute top M eigenvectors and form matrix U The i-th object corresponds to a vector in Rk : i-th row of U. Rows are clustered in M clusters using k-means

Spectral Clustering 2 rings dataset k-means spectral (RBF kernel,σ=1)

Spectral Clustering ↔ Graph cut Data graph Vertices: objects Edge weight: pairwise similarity Clustering = Graph Partitioning

Spectral Clustering ↔ Graph cut Cluster Indicator vector zi=(0,0,…,0,1,0,…0)T for object i Indicator matrix Z=[z1,…,zn] (nxk, for k clusters), ZTZ=I Graph partitioning = trace maximization wrt Z: The relaxed problem: is solved optimally using the spectral algorithm to obtain Y k-means is applied on yij to obtain zij

Kernel-Based Clustering (non-linear cluster separation) Given a set of objects and the kernel matrix K=[Kij] containing the similarities between each pair of objects Goal: Partition the dataset into subsets (clusters) Ck such that intra-cluster similarity is maximized. Kernel trick: Data points are mapped from input space to a higher dimensional feature space through a transformation φ(x). RBF kernel: K(x,y)=exp(-||x-y||2 /σ2)

Kernel k-Means Kernel k-means = k-means in feature space Minimizes the clustering error in feature space Differences from k-means Cluster centers mk in feature space cannot be computed Each cluster Ck is explicitly described by its data objects Computation of distances from centers in feature space: Finds local minima - Strong dependence on the initial partition Spectral clustering and kernel k-means optimize the same objective function, but in a different way  may converge to different local optima of the objective function

Exemplar-Based Methods Cluster data by identifying representative exemplars An exemplar is an actual dataset point, similar to a medoid All data points are considered as possible exemplars The number of clusters is decided during learning (but a depends on a user-defined parameter) Methods Convex Mixture Models Affinity Propagation

Affinity Propagation (AP) (Frey et al., Science 2007) Clusters data by identifying representative exemplars Exemplars are identified by transmitting messages between data points Input to the algorithm A similarity matrix where s(i,k) indicates how well data point xk is suited to be an exemplar for data point xi. Self-similarities s(k,k) that control the number of identified clusters and a higher value means that xk is more likely to become an exemplar Self-similarities are independent of the other similarities Higher values result in more clusters

Affinity Propagation Clustering criterion: s(i,ci) is the similarity between the data point xi and its exemplar Minimized by passing messages between data points, called responsibilities and availabilities Responsibility r(i,k): Sent from xi to candidate exemplar xk reflects the accumulated evidence for how well suited xk is to serve as the exemplar of xi taking into account other potential exemplars for xi

Affinity Propagation Availability a(i,k) Sent from candidate exemplar xk to xi reflects the accumulated evidence for how appropriate it would be for xi to choose xk as its exemplar, taking into account the support from other points that xk should be an exemplar The algorithm alternates between responsibility and availability calculation and The exemplars are the points with r(k,k)+a(k,k)>0 http://www.psi.toronto.edu/index.php?q=affinity%20propagation

Affinity Propagation

Maximum-Margin Clustering (MMC) MMC extends the large margin principle of SVM to clustering (also called unsupervised SVM) SVM training: Given a two-class dataset X={(xi , yi )} (yi =1 or -1) and a kernel matrix K (with corresponding feature transformation φ(x) ): find the maximum margin separating hyperplane (w, b) in φ(x).

Maximum-Margin Clustering (MMC) SVM training: MMC clustering: find the partition y of the dataset into two clusters such that the margin between the clusters is maximum: A cluster balance constraint is imposed to treat outliers Finds local minima – high complexity – two clusters only

Maximum-Margin Clustering (MMC) k-means solutions

Incremental Clustering Bisecting k-means (Steinbach,Karypis & Kumar, SIGKDD 2000) Start with k=1 (m1= data average) Assume a solution with k clusters Find the ‘best’ cluster split in two subclusters Replace the cluster center with the two subcluster centers Run k-means with k+1 centers (optional) k:=k+1 Until M clusters have been added Split a cluster using several random trials Each trial: Randomly initialize two centers from the cluster points Run 2-means using the cluster points only Keep the split of the trial with the lowest clustering error

Global k-means (Likas, Vlassis & Verbeek, PR 2003) Incremental, deterministic clustering algorithm that runs k-Means several times Finds near-optimal solutions wrt clustering error Idea: a near-optimal solution for k clusters can be obtained by running k-means from an initial state the k-1 centers are initialized from a near-optimal solution of the (k-1)-clustering problem the k-th center is initialized at some data point xn (which?) Consider all possible initializations (one for each xn)

Global k-means In order to solve the M-clustering problem: Solve the 1-clustering problem (trivial) Solve the k-clustering problem using the solution of the (k-1)-clustering problem Execute k-Means N times, initialized as at the n-th run (n=1,…,N). Keep the solution corresponding to the run with the lowest clustering error as the solution with k clusters k:=k+1, Repeat step 2 until k=M.

Best Initial m2 Best Initial m3 Best Initial m4 Best Initial m5

Clustering Big datasets First perform summarization of the big dataset using a low cost algorithm, then cluster the summarized dataset using any clustering algorithm: A linear clustering algorithm (e.g. kmeans) or one-shot clustering can be applied to the big dataset to initially partition this dataset into a large number (e.g. thousands) of clusters A representative (e.g. centroid or medoid) is selected for each cluster. The set of representatives is further clustered into a smaller number of clusters using any clustering method to get the final solution.

Clustering Methods: Summary Usually we assume that the number of clusters is given k-means is still the most widely used method Spectral clustering (or kernel k-means) the most popular when similarity matrix is given Beware of the parameter initialization problem! Ground truth absence makes evaluation difficult How could we estimate the number of clusters?