Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz

Slides:



Advertisements
Similar presentations
BioInformatics (3).
Advertisements

Basic Gene Expression Data Analysis--Clustering
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Techniques: Clustering
Introduction to Bioinformatics
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Cluster Analysis: Basic Concepts and Algorithms
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Lecture 09 Clustering-based Learning
Clustering Unsupervised learning Generating “classes”
Clustering Algorithms Mu-Yu Lu. What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other.
Evaluating Performance for Data Mining Techniques
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Hierarchical Clustering
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
Pattern Recognition Introduction to bioinformatics 2006 Lecture 4.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Fuzzy C-Means Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Data Mining and Text Mining. The Standard Data Mining process.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Unsupervised Learning
Data Mining: Basic Cluster Analysis
Clustering CSC 600: Data Mining Class 21.
Hierarchical Clustering
John Nicholas Owen Sarah Smith
Hierarchical clustering approaches for high-throughput data
Hierarchical and Ensemble Clustering
Information Organization: Clustering
Hierarchical and Ensemble Clustering
SEEM4630 Tutorial 3 – Clustering.
Unsupervised Learning
Presentation transcript:

Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz Bioinformatics and Systems Biology Group

Ulf Schmitz, Pattern recognition - Clustering2 Outline 1.Introduction 2.Hierarchical clustering 3.Partitional clustering k-means and derivatives 4.Fuzzy Clustering

Ulf Schmitz, Pattern recognition - Clustering3 Introduction into Clustering algorithms Clustering is the classification of similar objects into separated groups –or the partitioning of a data set into subsets (clusters) –so that the data in each subset (ideally) share some common trait Machine learning typically regards clustering as a form of unsupervised learning. we distinguish: –Hierarchical Clustering (finds successive clusters using previously established clusters) –Partitional Clustering (determines all clusters at once)

Ulf Schmitz, Pattern recognition - Clustering4 Introduction into Clustering algorithms gene expression data analysis identification of regulatory binding sites phylogenetic tree clustering (for inference of horizontally transferred genes) protein domain identification identification of structural motifs Applications

Ulf Schmitz, Pattern recognition - Clustering5 Introduction into Clustering algorithms data matrix collects observations of n objects, described by m measurements rows refer to objects, characterised by values in the columns if units of measurements, associated with the columns of X differ, it’s necessary to normalise Data matrix : column vector : mean : standard deviation

Ulf Schmitz, Pattern recognition - Clustering6 Hierarchical clustering 1.find dis/similarity between every pair of objects in the data set by evaluating a distance measure 2.group the objects into a hierarchical cluster tree (dendrogram) by linking newly formed clusters 3.obtain a partition of the data set into clusters by selecting a suitable ‘cut-level’ of the cluster tree produces a sequence of nested partitions, the steps are:

Ulf Schmitz, Pattern recognition - Clustering7 Hierarchical clustering 1.start with n clusters, each containing one object and calculate the distance matrix D 1 2.determine from D 1 which of the objects are least distant (e.g. I and J) 3.merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster 4.repeat steps 2 and 3 a total of m -1 times until a single cluster is formed record which clusters are merged at each step record the distances between the clusters that are merged in that step Agglomerative Hierarchical clustering

Ulf Schmitz, Pattern recognition - Clustering8 Hierarchical clustering one treats the data matrix X as a set of n (row) vectors with m elements calculating the distances Euclidian distance are row vectors of X City block distance an example

Ulf Schmitz, Pattern recognition - Clustering9 Hierarchical clustering an example Euclidian distance City block distance

Ulf Schmitz, Pattern recognition - Clustering10 Hierarchical clustering 1.start with n clusters, each containing one object and calculate the distance matrix D 1 2.determine from D 1 which of the objects are least distant (e.g. I and J) 3.merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster 4.repeat steps 2 and 3 a total of m -1 times until a single cluster is formed record which clusters are merged at each step record the distances between the clusters that are merged in that step Agglomerative Hierarchical clustering

Ulf Schmitz, Pattern recognition - Clustering11 Hierarchical clustering x1x1 x2x2 x3x3 x4x4 x5x5 x1x x2x x3x x4x x5x distance matrix x 1, x 3 X2X2 x 4, x 5 x 1, x X2X x 4, x

Ulf Schmitz, Pattern recognition - Clustering12 Hierarchical clustering d IJ d 15 d 14 d 13 d 25 d 24 d 23 single linkage: complete linkage: group average: Methods to define a distance between clusters: N is the number of members in a cluster centroid linkage:

Ulf Schmitz, Pattern recognition - Clustering13 Hierarchical clustering

Ulf Schmitz, Pattern recognition - Clustering14 Hierarchical clustering 1.start with n clusters, each containing one object and calculate the distance matrix D 1 2.determine from D 1 which of the objects are least distant (e.g. I and J) 3.merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster 4.repeat steps 2 and 3 a total of m -1 times until a single cluster is formed record which clusters are merged at each step record the distances between the clusters that are merged in that step Agglomerative Hierarchical clustering

Ulf Schmitz, Pattern recognition - Clustering15 Hierarchical clustering 1.the choice of distance measure is important 2.there is no provision for reassigning objects that have been incorrectly grouped 3.errors are not handled explicitly in the procedure 4.no method of calculating intercluster distances is universally the best but, single-linkage clustering is least successful and, group average clustering tends to be fairly well Limits of hierarchical clustering

Ulf Schmitz, Pattern recognition - Clustering16 Partitional clustering – K means Involves prior specification of the number of clusters, k no pairwise distance matrix is required The relevant distance is the distance from the object to the cluster center (centroid)

Ulf Schmitz, Pattern recognition - Clustering17 Partitional clustering – K means 1.partition the objects in k clusters (can be done by random partitioning or by arbitrarily clustering around two or more objects) 2.calculate the centroids of the clusters 3.assign or reassign each object to that cluster whose centroid is closest (distance is calculated as Euclidean distance) 4.recalculate the centroids of the new clusters formed after the gain or loss of objects to or from the previous clusters 5.repeat steps 3 and 4 for a predetermined number of iterations or until membership of the groups no longer changes

Ulf Schmitz, Pattern recognition - Clustering18 Partitional clustering – K means object x1x1 x2x2 A11 B31 C48 D810 E96 step 1: make an arbitrary partition of the objects into clusters: e.g. objects with into Cluster 1, all other into Cluster 2 A,B and C in Cluster 1, and D and E in Cluster 2 step 2: calculate the centroids of the clusters cluster 1: cluster 2: step 3: calculate the Euclidean distance between each object and each of the two clusters centroids: object d(x1,c1)d(x1,c1)d(x2,c2)d(x2,c2) A B C D E A D B C E

Ulf Schmitz, Pattern recognition - Clustering19 Partitional clustering – K means 1.partition the objects in k clusters (can be done by random partitioning or by arbitrarily clustering around two or more objects) 2.calculate the centroids of the clusters 3.assign or reassign each object to that cluster whose centroid is closest (distance is calculated as Euclidean distance) 4.recalculate the centroids of the new clusters formed after the gain or loss of objects to or from the previous clusters 5.repeat steps 3 and 4 for a predetermined number of iterations or until membership of the groups no longer changes

Ulf Schmitz, Pattern recognition - Clustering20 Partitional clustering – K means step 4: C turns out to be closer to Cluster 2 and has to be reassigned repeat step2 and step3 object d(X,1)d(X,2) A B C D E cluster 1: cluster 2: no further reassignments are necessary

Ulf Schmitz, Pattern recognition - Clustering21 Partitional clustering – K means

Ulf Schmitz, Pattern recognition - Clustering22 Fuzzy clustering is an extension of k – means clustering –an objects belongs to a cluster in a certain degree for all objects the degrees of membership in the k clusters adds up to one: a fuzzy weight is introduced, which determines the fuzziness of the resulting clusters –for ω → 1, the cluster becomes a hard partition –for ω → ∞, the degree of membership approximates 1/k –typical values are ω = 1.25 and ω = 2

Ulf Schmitz, Pattern recognition - Clustering23 Fuzzy clustering fix k, 2 ≤ k 0 (e.g or 0.001), and fix ω, 1 ≤ ω < ∞. Initialize first cluster set randomly. step1: compute cluster centers step2: compute distances between objects and cluster centers

Ulf Schmitz, Pattern recognition - Clustering24 Fuzzy clustering step3: update partition matrix: until: the algorithm is terminated if changes in the partition matrix are negligible

Ulf Schmitz, Pattern recognition - Clustering25 Clustering Software Cluster 3.0 (for gene expression data analysis ) PyCluster (Python Module) Algorithm::Cluster (Perl package) C clustering library

Ulf Schmitz, Pattern recognition - Clustering26 Outlook Bioperl

Ulf Schmitz, Pattern recognition - Clustering27 Thanx for your attention!!!