Clustering Usman Roshan CS 675. Clustering Suppose we want to cluster n vectors in R d into two groups. Define C 1 and C 2 as the two groups. Our objective.

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
Unsupervised Learning
Clustering Beyond K-means
K Means Clustering , Nearest Cluster and Gaussian Mixture
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Recap. Be cautious.. Data may not be in one blob, need to separate data into groups Clustering.
Unsupervised learning
Machine Learning and Data Mining Clustering
Clustering and Dimensionality Reduction Brendan and Yifang April
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Clustering II.
Classification and risk prediction
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts
Population structure identification BNFO 602 Roshan.
PCA, population structure, and K-means clustering BNFO 601.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering Color/Intensity
Clustering.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Unsupervised Learning
Visual Recognition Tutorial
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Evaluating Performance for Data Mining Techniques
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
Image Segmentation Image segmentation is the operation of partitioning an image into a collection of connected sets of pixels. 1. into regions, which usually.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Segmentation Techniques Luis E. Tirado PhD qualifying exam presentation Northeastern University.
Region Segmentation Readings: Chapter 10: 10.1 Additional Materials Provided K-means Clustering (text) EM Clustering (paper) Graph Partitioning (text)
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSIE Dept., National Taiwan Univ., Taiwan
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning Queens College Lecture 7: Clustering.
Flat clustering approaches
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Machine learning optimization Usman Roshan. Machine learning Two components: – Modeling – Optimization Modeling – Generative: we assume a probabilistic.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Spectral Methods for Dimensionality
Usman Roshan CS 675 Machine Learning
Clustering Usman Roshan.
Classification of unlabeled data:
Haim Kaplan and Uri Zwick
Basic machine learning background with Python scikit-learn
3.3 Network-Centric Community Detection
EM Algorithm and its Applications
Clustering Usman Roshan CS 675.
Presentation transcript:

Clustering Usman Roshan CS 675

Clustering Suppose we want to cluster n vectors in R d into two groups. Define C 1 and C 2 as the two groups. Our objective is to find C 1 and C 2 that minimize where m i is the mean of class C i

Clustering NP hard even for 2-means NP hard even on plane K-means heuristic –Popular and hard to beat –Introduced in 1950s and 1960s

K-means algorithm for two clusters Input: Algorithm: 1.Initialize: assign x i to C 1 or C 2 with equal probability and compute means: 2.Recompute clusters: assign x i to C 1 if ||x i -m 1 ||<||x i -m 2 ||, otherwise assign to C 2 3.Recompute means m 1 and m 2 4.Compute objective 5.Compute objective of new clustering. If difference is smaller than then stop, otherwise go to step 2.

K-means Is it guaranteed to find the clustering which optimizes the objective? It is guaranteed to find a local optimal We can prove that the objective decreases with subsequence iterations

Proof sketch of convergence of k-means Justification of first inequality: by assigning x j to the closest mean the objective decreases or stays the same Justification of second inequality: for a given cluster its mean minimizes squared error loss

K-means clustering K-means is the Expected-Maximization solution if we assume data is generated by Gaussian distribution –EM: Find clustering of data that maximizes likelihood –Unsupervised means no parameters given. Thus we iterate between estimating expected and actual values of parameters PCA gives relaxed solution to k-means clustering. Sketch of proof: –Cast k-means clustering as maximization problem –Relax cluster indicator variables to be continuous and solution is given by PCA

K-means clustering K-medians variant: –Select cluster center that is the median –Has the effect of minimizing L1 error K-medoid –Cluster center is an actual datapoint (not same as k-medians) Algorithms similar to k-means Similar local minima problems to k- means

Other clustering algorithms Spherical k-means –Normalize each datapoint (set length to 1) –Clustering by finding center with minimum cosine angle to cluster points –Similar iterative algorithm to Euclidean k- means –Converges with similar proofs.

Other clustering algorithms Hierarchical clustering –Initialize n clusters where each datapoint is in its own cluster –Merge two nearest clusters into one –Update distances of new cluster to existing ones –Repeat step 2 until k clusters are formed.

Other clustering algorithms Graph clustering (Spectral clustering) –Find cut of minimum cost in a bipartition of the data –NP-hard if look for minimum cut such that size of two clusters are similar –Relaxation leads to spectral clustering –Based on calculating Laplacian of a graph and eigenvalues and eigenvectors of similarity matrix

Application on population structure data Data are vectors x i where each feature takes on values 0, 1, and 2 to denote number of alleles of a particular single nucleotide polymorphism (SNP) Output y i is an integer indicating the population group a vector belongs to

Publicly available real data Datasets (Noah Rosenberg’s lab): –East Asian admixture: 10 individuals from Cambodia, 15 from Siberia, 49 from China, and 16 from Japan; 459,188 SNPs –African admixture: 32 Biaka Pygmy individuals, 15 Mbuti Pygmy, 24 Mandenka, 25 Yoruba, 7 San from Namibia, 8 Bantu of South Africa, and 12 Bantu of Kenya; 454,732 SNPs –Middle Eastern admixture: contains 43 Druze from Israil- Carmel, 47 Bedouins from Israel-Negev, 26 Palestinians from Israel-Central, and 30 Mozabite from Algeria-Mzab; 438,596 SNPs

East Asian populations

African populations

Middle Eastern populations

K-means applied to PCA data PCA and kernel PCA (poly degree 2) K-means and spherical k-means Number of clusters set to true value