Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Slides:



Advertisements
Similar presentations
Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Biological Networks Analysis Introduction and Dijkstras algorithm.
Advertisements

Basic Gene Expression Data Analysis--Clustering
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Cluster Analysis: Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Han-na Yang Trace Clustering in Process Mining M. Song, C.W. Gunther, and W.M.P. van der Aalst.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Unsupervised learning
Introduction to Bioinformatics
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Lecture 6 Image Segmentation
Mutual Information Mathematical Biology Seminar
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Clustering.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Machine Learning Queens College Lecture 7: Clustering.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Data Clustering Michael J. Watts
Revision (Part II) Ke Chen
Clustering.
Revision (Part II) Ke Chen
Clustering.
Dimension reduction : PCA and Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
Text Categorization Berlin Chen 2003 Reference:
Clustering.
Presentation transcript:

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

 The clustering problem:  partition genes into distinct sets with high homogeneity and high separation  Clustering (unsupervised) vs. classification  Clustering methods:  Agglomerative vs. divisive; hierarchical vs. non-hierarchical  Hierarchical clustering algorithm: 1.Assign each object to a separate cluster. 2.Find the pair of clusters with the shortest distance, and regroup them into a single cluster. 3.Repeat 2 until there is a single cluster.  Many possible distance metrics  Metric matters A quick review

K-mean clustering Agglomerative Non-hierarchical

 An algorithm for partitioning n observations/points into k clusters such that each observation belongs to the cluster with the nearest mean/center  Isn’t this a somewhat circular definition?  Assignment of a point to a cluster is based on the proximity of the point to the cluster mean  But the cluster mean is calculated based on all the points assigned to the cluster. K-mean clustering cluster_2 mean cluster_1 mean

 An algorithm for partitioning n observations/points into k clusters such that each observation belongs to the cluster with the nearest mean/center  The chicken and egg problem: I do not know the means before I determine the partitioning into clusters I do not know the partitioning into clusters before I determine the means  Key principle - cluster around mobile centers:  Start with some random locations of means/centers, partition into clusters according to these centers, and then correct the centers according to the clusters (somewhat similar to expectation-maximization algorithm) K-mean clustering: Chicken and egg

 The number of centers, k, has to be specified a-priori  Algorithm: 1.Arbitrarily select k initial centers 2.Assign each element to the closest center 3.Re-calculate centers (mean position of the assigned elements) 4.Repeat 2 and 3 until one of the following termination conditions is reached: i.The clusters are the same as in the previous iteration ii.The difference between two iterations is smaller than a specified threshold iii.The maximum number of iterations has been reached K-mean clustering algorithm How can we do this efficiently?

 Assigning elements to the closest center Partitioning the space B A

 Assigning elements to the closest center Partitioning the space B A closer to A than to B closer to B than to A

 Assigning elements to the closest center Partitioning the space B A C closer to A than to B closer to B than to A closer to B than to C

 Assigning elements to the closest center Partitioning the space B A C closest to A closest to B closest to C

 Assigning elements to the closest center Partitioning the space B A C

 Decomposition of a metric space determined by distances to a specified discrete set of “centers” in the space  Each colored cell represents the collection of all points in this space that are closer to a specific center s than to any other center  Several algorithms exist to find the Voronoi diagram. Voronoi diagram

 The number of centers, k, has to be specified a priori  Algorithm: 1.Arbitrarily select k initial centers 2.Assign each element to the closest center (Voronoi) 3.Re-calculate centers (mean position of the assigned elements) 4.Repeat 2 and 3 until one of the following termination conditions is reached: i.The clusters are the same as in the previous iteration ii.The difference between two iterations is smaller than a specified threshold iii.The maximum number of iterations has been reached K-mean clustering algorithm

K-mean clustering example  Two sets of points randomly generated  200 centered on (0,0)  50 centered on (1,1)

K-mean clustering example  Two points are randomly chosen as centers (stars)

K-mean clustering example  Each dot can now be assigned to the cluster with the closest center

K-mean clustering example  First partition into clusters

 Centers are re-calculated K-mean clustering example

 And are again used to partition the points

K-mean clustering example  Second partition into clusters

K-mean clustering example  Re-calculating centers again

K-mean clustering example  And we can again partition the points

K-mean clustering example  Third partition into clusters

K-mean clustering example  After 6 iterations:  The calculated centers remains stable

Cell cycle Spellman et al. (1998)

K-mean clustering: Summary  The convergence of k-mean is usually quite fast (sometimes 1 iteration results in a stable solution)  K-means is time- and memory-efficient  Strengths:  Simple to use  Fast  Can be used with very large data sets  Weaknesses:  The number of clusters has to be predetermined  The results may vary depending on the initial choice of centers

K-mean clustering: Variations  Expectation-maximization (EM): maintains probabilistic assignments to clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of means.  k-means++: attempts to choose better starting points.  Some variations attempt to escape local optima by swapping points between clusters

The take-home message D’haeseleer, 2005 Hierarchical clustering K-mean clustering ?

What else are we missing?

 What if the clusters are not “linearly separable”? What else are we missing?