Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz

Similar presentations


Presentation on theme: "Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz"— Presentation transcript:

1 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de Bioinformatics and Systems Biology Group www.sbi.informatik.uni-rostock.de

2 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering2 Outline 1.Introduction 2.Hierarchical clustering 3.Partitional clustering k-means and derivatives 4.Fuzzy Clustering

3 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering3 Introduction into Clustering algorithms Clustering is the classification of similar objects into separated groups –or the partitioning of a data set into subsets (clusters) –so that the data in each subset (ideally) share some common trait Machine learning typically regards clustering as a form of unsupervised learning. we distinguish: –Hierarchical Clustering (finds successive clusters using previously established clusters) –Partitional Clustering (determines all clusters at once)

4 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering4 Introduction into Clustering algorithms gene expression data analysis identification of regulatory binding sites phylogenetic tree clustering (for inference of horizontally transferred genes) protein domain identification identification of structural motifs Applications

5 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering5 Introduction into Clustering algorithms data matrix collects observations of n objects, described by m measurements rows refer to objects, characterised by values in the columns if units of measurements, associated with the columns of X differ, it’s necessary to normalise Data matrix : column vector : mean : standard deviation

6 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering6 Hierarchical clustering 1.find dis/similarity between every pair of objects in the data set by evaluating a distance measure 2.group the objects into a hierarchical cluster tree (dendrogram) by linking newly formed clusters 3.obtain a partition of the data set into clusters by selecting a suitable ‘cut-level’ of the cluster tree produces a sequence of nested partitions, the steps are:

7 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering7 Hierarchical clustering 1.start with n clusters, each containing one object and calculate the distance matrix D 1 2.determine from D 1 which of the objects are least distant (e.g. I and J) 3.merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster 4.repeat steps 2 and 3 a total of m -1 times until a single cluster is formed record which clusters are merged at each step record the distances between the clusters that are merged in that step Agglomerative Hierarchical clustering

8 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering8 Hierarchical clustering one treats the data matrix X as a set of n (row) vectors with m elements calculating the distances Euclidian distance are row vectors of X City block distance an example

9 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering9 Hierarchical clustering an example Euclidian distance City block distance

10 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering10 Hierarchical clustering 1.start with n clusters, each containing one object and calculate the distance matrix D 1 2.determine from D 1 which of the objects are least distant (e.g. I and J) 3.merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster 4.repeat steps 2 and 3 a total of m -1 times until a single cluster is formed record which clusters are merged at each step record the distances between the clusters that are merged in that step Agglomerative Hierarchical clustering

11 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering11 Hierarchical clustering x1x1 x2x2 x3x3 x4x4 x5x5 x1x1 02.91551.00003.0414 x2x2 2.915502.54953.35412.5000 x3x3 1.00002.549502.0616 x4x4 3.04143.35412.061601.0000 x5x5 3.04142.50002.06161.00000 distance matrix x 1, x 3 X2X2 x 4, x 5 x 1, x 3 02.91552.0616 X2X2 2.915502.5000 x 4, x 5 2.06162.50000

12 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering12 Hierarchical clustering d IJ d 15 d 14 d 13 d 25 d 24 d 23 single linkage: complete linkage: group average: 1 2 5 4 3 Methods to define a distance between clusters: N is the number of members in a cluster centroid linkage:

13 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering13 Hierarchical clustering

14 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering14 Hierarchical clustering 1.start with n clusters, each containing one object and calculate the distance matrix D 1 2.determine from D 1 which of the objects are least distant (e.g. I and J) 3.merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster 4.repeat steps 2 and 3 a total of m -1 times until a single cluster is formed record which clusters are merged at each step record the distances between the clusters that are merged in that step Agglomerative Hierarchical clustering

15 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering15 Hierarchical clustering 1.the choice of distance measure is important 2.there is no provision for reassigning objects that have been incorrectly grouped 3.errors are not handled explicitly in the procedure 4.no method of calculating intercluster distances is universally the best but, single-linkage clustering is least successful and, group average clustering tends to be fairly well Limits of hierarchical clustering

16 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering16 Partitional clustering – K means Involves prior specification of the number of clusters, k no pairwise distance matrix is required The relevant distance is the distance from the object to the cluster center (centroid)

17 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering17 Partitional clustering – K means 1.partition the objects in k clusters (can be done by random partitioning or by arbitrarily clustering around two or more objects) 2.calculate the centroids of the clusters 3.assign or reassign each object to that cluster whose centroid is closest (distance is calculated as Euclidean distance) 4.recalculate the centroids of the new clusters formed after the gain or loss of objects to or from the previous clusters 5.repeat steps 3 and 4 for a predetermined number of iterations or until membership of the groups no longer changes

18 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering18 Partitional clustering – K means object x1x1 x2x2 A11 B31 C48 D810 E96 step 1: make an arbitrary partition of the objects into clusters: e.g. objects with into Cluster 1, all other into Cluster 2 A,B and C in Cluster 1, and D and E in Cluster 2 step 2: calculate the centroids of the clusters cluster 1: cluster 2: step 3: calculate the Euclidean distance between each object and each of the two clusters centroids: object d(x1,c1)d(x1,c1)d(x2,c2)d(x2,c2) A2.8710.26 B2.358.90 C4.864.50 D8.542.06 E6.872.06 A D B C E 2 4 6 8 10 246 8

19 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering19 Partitional clustering – K means 1.partition the objects in k clusters (can be done by random partitioning or by arbitrarily clustering around two or more objects) 2.calculate the centroids of the clusters 3.assign or reassign each object to that cluster whose centroid is closest (distance is calculated as Euclidean distance) 4.recalculate the centroids of the new clusters formed after the gain or loss of objects to or from the previous clusters 5.repeat steps 3 and 4 for a predetermined number of iterations or until membership of the groups no longer changes

20 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering20 Partitional clustering – K means step 4: C turns out to be closer to Cluster 2 and has to be reassigned repeat step2 and step3 object d(X,1)d(X,2) A1.009.22 B1.008.06 C7.283.00 D10.822.24 E8.602.83 cluster 1: cluster 2: no further reassignments are necessary

21 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering21 Partitional clustering – K means

22 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering22 Fuzzy clustering is an extension of k – means clustering –an objects belongs to a cluster in a certain degree for all objects the degrees of membership in the k clusters adds up to one: a fuzzy weight is introduced, which determines the fuzziness of the resulting clusters –for ω → 1, the cluster becomes a hard partition –for ω → ∞, the degree of membership approximates 1/k –typical values are ω = 1.25 and ω = 2

23 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering23 Fuzzy clustering fix k, 2 ≤ k 0 (e.g. 0.01 or 0.001), and fix ω, 1 ≤ ω < ∞. Initialize first cluster set randomly. step1: compute cluster centers step2: compute distances between objects and cluster centers

24 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering24 Fuzzy clustering step3: update partition matrix: until: the algorithm is terminated if changes in the partition matrix are negligible

25 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering25 Clustering Software Cluster 3.0 (for gene expression data analysis ) PyCluster (Python Module) Algorithm::Cluster (Perl package) C clustering library http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm

26 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering26 Outlook Bioperl

27 www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering27 Thanx for your attention!!!


Download ppt "Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz"

Similar presentations


Ads by Google