Download presentation
Presentation is loading. Please wait.
1
Multivariate Statistical Methods
Cluster Analysis By Jen-pei Liu, PhD Division of Biometry, Department of Agronomy, National Taiwan University and Division of Biostatistics and Bioinformatics National Health Research Institutes 2018/12/10 Copyright by Jen-pei Liu, PhD
2
Copyright by Jen-pei Liu, PhD
Cluster Analysis Introduction Measures of Similarity Hierarchical Clustering K-mean Clustering Summary 2018/12/10 Copyright by Jen-pei Liu, PhD
3
Copyright by Jen-pei Liu, PhD
Introduction A sample of n objects, each with measurements of p variables To use the measurements of p variables to devise a scheme for grouping n objects into classes Similar objects are in the same class 2018/12/10 Copyright by Jen-pei Liu, PhD
4
Copyright by Jen-pei Liu, PhD
Introduction In general, the number of clusters is not known in advance – unsupervised analysis The number of class is pre-specified in the discriminant analysis and is based on a predicted function– supervised analysis 2018/12/10 Copyright by Jen-pei Liu, PhD
5
Copyright by Jen-pei Liu, PhD
Introduction Examples Cluster of depressed patients Data reduction Marketing Test markets: large number of cities Small number of groups of similar cities one member from each group selected for testing Microarray Clusters of genes Clusters of subjects 2018/12/10 Copyright by Jen-pei Liu, PhD
6
Copyright by Jen-pei Liu, PhD
Introduction Types of Clustering Methods Hierarchical Clustering To find a series of partition A bottom-to-up clustering Partitional Method To produce a single partition of objects A up-to-bottom clustering 2018/12/10 Copyright by Jen-pei Liu, PhD
7
Copyright by Jen-pei Liu, PhD
Introduction Example Student Chinese (X1) Math (X2) 2018/12/10 Copyright by Jen-pei Liu, PhD
8
Measures of Similarity
2018/12/10 Copyright by Jen-pei Liu, PhD
9
Measures of Similarity
Euclidean Distances Matrix for 6 students 2018/12/10 Copyright by Jen-pei Liu, PhD
10
Measures of Similarity
2018/12/10 Copyright by Jen-pei Liu, PhD
11
Measures of Similarity
The Manhattan (city block) distance: 2018/12/10 Copyright by Jen-pei Liu, PhD
12
Measures of Similarity
2018/12/10 Copyright by Jen-pei Liu, PhD
13
Measures of Similarity
2018/12/10 Copyright by Jen-pei Liu, PhD
14
Measures of Similarity
Correlation coefficient A measure for association Not a measure for similarity (or agreement) Euclidean distance A measure for agreement Not a measure for association 2018/12/10 Copyright by Jen-pei Liu, PhD
15
Measures of Similarity
Example Case I Case II Case III X1 X2 X1 X2 X1 X2 r=1, d2=0 r=1, d2=30 r=1, d2=270 2018/12/10 Copyright by Jen-pei Liu, PhD
16
Hierarchical Clustering
General Steps for n objects Step 1: There are n clusters at the beginning and each object is a cluster. Compute pairwise distances among all clusters Step 2: Find the minimum distance and merge the corresponding two clusters into one cluster Step 3: Based on n-1 clusters, compute pairwise distances among all n-1 clusters Step 4: Find the minimum distance and merge the corresponding two clusters into one cluster Step 5: Repeat 2-4 until all n objects merge into one big cluster 2018/12/10 Copyright by Jen-pei Liu, PhD
17
Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD
18
Hierarchical Clustering
Single Linkage (Nearest-neighbor) Method Use the minimal distance Distance matrix for 5 objects 1 0 2 9 0 2018/12/10 Copyright by Jen-pei Liu, PhD
19
Hierarchical Clustering
Single Linkage Method Step 1: 5 clusters:{1},{2},{3},{4},{5} Step 2: min{dij} = d35 = 2 and merge objects 3 and 5 into one cluster {35} Step 3: Find the minimal distance among {3,5},{1},{2},{4} d{35}1 = min[d31,d51]=min[3,11]=3 d{35}2 = min[d32,d52]=min[7,10]=7 d{35}4 = min[d34,d54]=min[9,8]=8 2018/12/10 Copyright by Jen-pei Liu, PhD
20
Hierarchical Clustering
Single Linkage Method Update the distance matrix {35} 1 2 4 {35} 0 1 3 0 2018/12/10 Copyright by Jen-pei Liu, PhD
21
Hierarchical Clustering
Single Linkage Method Step 4: Minimal distance is 3 between {35} and {1} and merge {35} and {1} into {135} Step 5: Find the distances between {135} and {2} and {4} d{135}2 = min[d{35}2,d12]=min[7,9]=7 d{135}4 = min[d{35}4,d14]=min[8,6]=6 2018/12/10 Copyright by Jen-pei Liu, PhD
22
Hierarchical Clustering
Single Linkage Method Update the distance matrix {135} 2 4 {135} 0 The minimal distance is 5 between {2} and {4} Merge {2} and {4} into {24} 2018/12/10 Copyright by Jen-pei Liu, PhD
23
Hierarchical Clustering
Single Linkage Method Find the minimum distance between {135} and {24} d{135}{24} = min[d{135}2,d{135}4]=min[7,6]=7 Update the distance matrix {135} {24} {135} 0 {24} 6 0 2018/12/10 Copyright by Jen-pei Liu, PhD
24
Hierarchical Clustering
Single Linkage Method Distance Clusters 2 {1},{35},{2},{4} 3 {135},{2},{4} 4 {135},{2},{4} 5 {135},{24} 6 {12345} 2018/12/10 Copyright by Jen-pei Liu, PhD
25
Hierarchical Clustering
Dendrograms A 2-dimensional tree structure rooted in the top One dimension is the distance measure Another dimension is the clustering results The height of vertical (horizontal) line represents the distance between the two clusters it mergers Greater height represents greater distance 2018/12/10 Copyright by Jen-pei Liu, PhD
26
Hierarchical Clustering
Complete Linkage (Farthest-neighbor) Method Use the maximal distance Distance matrix for 5 objects 1 0 2 9 0 2018/12/10 Copyright by Jen-pei Liu, PhD
27
Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD
28
Hierarchical Clustering
Complete Linkage Method Step 1: 5 clusters:{1},{2},{3},{4},{5} Step 2: min{dij} = d35 = 2 and merge objects 3 and 5 into one cluster {35} Step 3: Find the maximal distance among {3,5},{1},{2},{4} d{35}1 = max[d31,d51]=min[3,11]=11 d{35}2 = max[d32,d52]=min[7,10]=10 d{35}4 = max[d34,d54]=min[9,8]=9 2018/12/10 Copyright by Jen-pei Liu, PhD
29
Hierarchical Clustering
Complete Linkage Method Update the distance matrix {35} 1 2 4 {35} 0 1 11 0 2018/12/10 Copyright by Jen-pei Liu, PhD
30
Hierarchical Clustering
Complete Linkage Method Step 4: Minimal distance is 5 between {2},{4} and merge {2} and {4} into {24} Step 5: Find the maximal distances d{24}{35} = max[d2{35}, d4{35}]=max[10,9]=10 d{24}1 = max[d21, d41]=max[9,6]=9 2018/12/10 Copyright by Jen-pei Liu, PhD
31
Hierarchical Clustering
Complete Linkage Method Update the distance matrix {35} {24} 1 {35} 0 {24} The maximal distance is 9 between {1} and {24} Merge {1} and {24} into {124} 2018/12/10 Copyright by Jen-pei Liu, PhD
32
Hierarchical Clustering
Complete Linkage Method Find the maximal distance between {124} and {35} d{124}{35} = min[d1{35}d{25}{35}] =max[10,11]=11 Update the distance matrix {35} {124} {35} 0 {124} 11 0 2018/12/10 Copyright by Jen-pei Liu, PhD
33
Hierarchical Clustering
Complete Linkage Method Distance Clusters 2 {35},{1},{2},{4} 5 {35},{1},{24} 9 {35},{124} 11 {12345} 2018/12/10 Copyright by Jen-pei Liu, PhD
34
Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD
35
Copyright by Jen-pei Liu, PhD
Average Clustering Average Linkage Method Use the average distance 2018/12/10 Copyright by Jen-pei Liu, PhD
36
Copyright by Jen-pei Liu, PhD
Average Clustering Average Linkage Method Use the average distance Distance matrix for 5 objects 1 0 2 9 0 2018/12/10 Copyright by Jen-pei Liu, PhD
37
Hierarchical Clustering
Average Linkage Method Step 1: 5 clusters:{1},{2},{3},{4},{5} Step 2: min{dij} = d35 = 2 and merge objects 3 and 5 into one cluster {35} Step 3: Find the average distance among {3,5},{1},{2},{4} d{35}1 =(d31+d51)/(2x1)=(3+11)/2=7 d{35}2 = (d32+d52)/(2x1)=(7+10)/2=8.5 d{35}4 = (d34+d54)/(2x1)=(9+10)/2=8.5 2018/12/10 Copyright by Jen-pei Liu, PhD
38
Hierarchical Clustering
Average Linkage Method Update the distance matrix {35} 1 2 4 {35} 0 1 11 0 2018/12/10 Copyright by Jen-pei Liu, PhD
39
Hierarchical Clustering
Average Linkage Method Step 4: Minimal distance is 5 between {2} and {4} and merge {2} and {4} into {24} Step 5: Find the average distances d{24}{35} = (d23+ d25+d43+ d45)/(2x2) =( )/(2x2)=8.5 d{24}1 = (d21+d41)/(2x1)= =(9+6)/2=7.5 2018/12/10 Copyright by Jen-pei Liu, PhD
40
Hierarchical Clustering
Average Linkage Method Update the distance matrix {35} {24} 1 {35} 0 {24} The minimal distance is 7 between {1} and {35} Merge {1} and {35} into {135} 2018/12/10 Copyright by Jen-pei Liu, PhD
41
Hierarchical Clustering
Average Linkage Method Find the average distance between {24} and {135} d{24}{135} = (d12+d14 +d32+d34 +d52+d54)/(3x2) =( )/6 =8.17 Update the distance matrix {135} {24} {135} 0 {24} 2018/12/10 Copyright by Jen-pei Liu, PhD
42
Hierarchical Clustering
Average Linkage Method Distance Clusters 2 {35},{1},{2},{4} 5 {35},{1},{24} 7 {135},{24} 9 {12345} 2018/12/10 Copyright by Jen-pei Liu, PhD
43
Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD
44
Hierarchical Clustering
Example Manly (2005) Distance Matrix of 5 objects 1 0 2 2 0 2018/12/10 Copyright by Jen-pei Liu, PhD
45
Hierarchical Clustering
Single Linkage Method Distance Clusters 2 {12},{3},{4},{5} 3 {12},{3},{45} 4 {12},{345} 5 {12345} Same results are obtained from complete and average linkage methods 2018/12/10 Copyright by Jen-pei Liu, PhD
46
Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD
47
Hierarchical Clustering
Example: Canine group by single linkage clustering Distance Clusters # 0.72 {MD,PD},GJ,CW,IW,CU,DI 6 1.38 {MD,PD,CU},GJ,CW,IW,DI 5 1.68 {MD,PD,CU,DI},GJ,CW,IW 4 2.07 {MD,PD,CU,DI,GJ},CW,IW 3 2.31 {MD,PD,CU,DI,GJ},{CW,IW} 2 2.37 {MD,PD,CU,DI,GJ,CW,IW} 1 2018/12/10 Copyright by Jen-pei Liu, PhD
48
Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD
49
Results of single linkage method for European employment data
2018/12/10 Copyright by Jen-pei Liu, PhD
50
Hierarchical Clustering
Centroid (Center or Average) Method Start with each object being a cluster Merge the two clusters with the shortest distance Compute the centroid as the average of all variables in the new cluster and update the distance matrix using the averages of the new clusters Compute the centroid as the averages of all variables in the new clusters and update the distance matrix using the averages of the new clusters Repeat above steps until it forms one cluster 2018/12/10 Copyright by Jen-pei Liu, PhD
51
Copyright by Jen-pei Liu, PhD
Introduction Example Student Chinese (X1) Math (X2) 2018/12/10 Copyright by Jen-pei Liu, PhD
52
Hierarchical Clustering
Centroid Method Euclidean Distance matrix of 6 students 1 0 2018/12/10 Copyright by Jen-pei Liu, PhD
53
Hierarchical Clustering
Centroid Method The shortest distance is between student {1} and student {4} Merge {1} and {4} into {14} Compute the averages for Chinese and math Average of Chinese = (85+90)/2 = 87.5 Average of math = (82+95)/2=88.5 2018/12/10 Copyright by Jen-pei Liu, PhD
54
Hierarchical Clustering
Centroid Method Update the Euclidean distance matrix {14} {14} 0 2018/12/10 Copyright by Jen-pei Liu, PhD
55
Hierarchical Clustering
Centroid Method The shortest distance is between {2} and {5} Merge {2} and {5} into {35} The average of Chinese of {35} is 32.5 The average of math of {35} is 31.0 2018/12/10 Copyright by Jen-pei Liu, PhD
56
Hierarchical Clustering
Centroid Method Update the Euclidean distance matrix {14} {25} 3 6 {14} 0 {25} 2018/12/10 Copyright by Jen-pei Liu, PhD
57
Hierarchical Clustering
Centroid Method The shortest distance is between {3} and {6} Merge {3} and {6} into {36} The average of Chinese of {36} is 62.5 The average of math of {36} is 62.5 2018/12/10 Copyright by Jen-pei Liu, PhD
58
Hierarchical Clustering
Centroid Method Update the Euclidean distance matrix {14} {25} {36} {14} 0 {25} {36} 2018/12/10 Copyright by Jen-pei Liu, PhD
59
Hierarchical Clustering
Centroid Method The shortest distance is between {14} and {36} Merge {14} and {36} into {1346} Cluster means Cluster Chinese Math {25} (1346} 2018/12/10 Copyright by Jen-pei Liu, PhD
60
Hierarchical Clustering
Centroid Method Distance between {25} and {1346} is 61.53 Distance Clusters 13.93 {14},{2},{3},{5},{6} 15.13 {14},{25},{3},{6} 15.81 {14},{25},{36} 36.07 {1436},{25} 61.53 {123456} 2018/12/10 Copyright by Jen-pei Liu, PhD
61
Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD
62
Hierarchical Clustering
Application to gene expression data from microarray experiments # of genes >>> # of subjects Clustering in two directions Clusters of subjects (patients) Clusters of genes 2018/12/10 Copyright by Jen-pei Liu, PhD
63
Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD
64
Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD
65
Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD
66
Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD
67
Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD
68
Hierarchical Clustering
The complexity of a bottom-up method can vary between n2 and n3 depend on the linkage chosen. The complexity of a top-down method can vary between nlogn and n2 depend on the linkage chosen. 2018/12/10 Copyright by Jen-pei Liu, PhD
69
Hierarchical Clustering
Determination of the number of clusters Criteria Root-mean-square total-sample standard deviation (RMSSTD) Semipartial R-square (SPRSQ) R-square (RSQ) Minimum distance (MD) 2018/12/10 Copyright by Jen-pei Liu, PhD
70
Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD
71
Hierarchical Clustering
Determination of the number of clusters Example: test scores of 6 students # of clusters RMSSTD SPRSQ RSQ MD 2018/12/10 Copyright by Jen-pei Liu, PhD
72
Copyright by Jen-pei Liu, PhD
K-means Clustering Step 1: Select the number of clusters, say K and determine the distance measure such as Euclidean distance or 1-Pearson correlation coefficient Step 2: Divide n objects into K clusters, either randomly or based on a preliminary hierarchical clustering Step 3: Compute the centroids of each clusters and calculate the distances of each object to centroids of all clusters 2018/12/10 Copyright by Jen-pei Liu, PhD
73
Copyright by Jen-pei Liu, PhD
K-means Clustering Step 4: For each object, find the minimal distance and reallocate the object to the corresponding cluster with the minimal distance Step 5: Update the clusters and its centroids Step 6: Repeat Step 3 and Step 4 until no reallocation of objects among clusters occurs 2018/12/10 Copyright by Jen-pei Liu, PhD
74
Copyright by Jen-pei Liu, PhD
K-means Clustering 2018/12/10 Copyright by Jen-pei Liu, PhD
75
Copyright by Jen-pei Liu, PhD
K-means Clustering 2018/12/10 Copyright by Jen-pei Liu, PhD
76
Copyright by Jen-pei Liu, PhD
K-means Clustering 2018/12/10 Copyright by Jen-pei Liu, PhD
77
Copyright by Jen-pei Liu, PhD
K-means Clustering 2018/12/10 Copyright by Jen-pei Liu, PhD
78
Copyright by Jen-pei Liu, PhD
K-means Clustering The number of computations that need to be performed can be written as c*p where c is a value that does depend on the number of iterations and p is the number of variables (e.g., the number of genes) 2018/12/10 Copyright by Jen-pei Liu, PhD
79
Copyright by Jen-pei Liu, PhD
K-means Clustering The number of clusters is selected to maximize the between-cluster sum of squares (variation) and to minimize the within-cluster sum of squares (variation) The best-of-10 partition: to apply K-means method 10 times using 10 different randomly chosen sets of initial clusters and choose the result that minimizes the within-cluster sum of squares 2018/12/10 Copyright by Jen-pei Liu, PhD
80
Issues and Limitations
With considerable overlap between the initial groups, cluster analysis may produce a result that is quite different from the true situation Different approaches obtained different results. The dendrogram itself is almost never the answer to the research question. Hierarchical diagrams convey information only in their topology 2018/12/10 Copyright by Jen-pei Liu, PhD
81
Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD
82
Issues and Limitations
Shape of clusters will create difficulty in cluster analysis (a) and (b) by any reasonable algorithms (c) some methods will fail because of overlapping points (d), (e) and (f): great challenges for most of clustering algorithms 2018/12/10 Copyright by Jen-pei Liu, PhD
83
Issues and Limitations
Anything can be clustered The clustering algorithm applied to the same data may produce different results Ignore the magnitudes of distance measures in dendrogram Position of the patterns with the clusters does not reflect their relationship in the input space 2018/12/10 Copyright by Jen-pei Liu, PhD
84
Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD
85
Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD
86
Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD
87
Copyright by Jen-pei Liu, PhD
Summary Goals Methods Hierarchical Methods Single Complete Average Centroid K-means Clutering Limitations 2018/12/10 Copyright by Jen-pei Liu, PhD
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.