Presentation is loading. Please wait.

Presentation is loading. Please wait.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.

Similar presentations


Presentation on theme: "Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis."— Presentation transcript:

1 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis l Clustering methods n Hierarchical n Partitioned n Additive trees l Cluster distance metrics l Uses of cluster analysis l Clustering methods n Hierarchical n Partitioned n Additive trees l Cluster distance metrics 0123 4 Distance Modern dog Golden jackal Chinese wolf Cuon Dingo Pre-dog

2 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.2 Cluster analysis I: grouping objects l Given a set of p variables X 1, X 2,…, X p, and a set of N objects, the task is to group the objects into classes so that objects within classes are more similar to one another than to members of other classes. l Questions of interest: does the set of objects fall into a smaller set of “natural” groups? What are the relationships among different objects? l Note: in most cases, clusters are not defined a priori.

3 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.3 Cluster analysis II: grouping variables l Given a set of p variables X 1, X 2,…, X p, and a set of N objects, the task is to group the variables into classes so that variables within classes are more highly correlated with one another than to members of other classes. l Questions of interest: does the set of variables fall into a smaller set of “natural” groups? What are the relationships among different variables?

4 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.4 Cluster analysis III: grouping objects and variables l Given a set of p variables X 1, X 2,…, X p, and a set of N objects, the task is to group the objects and variables into classes so that variables and objects within classes are more highly correlated with/more similar to one another than to members of other classes. l Questions of interest: does the set of variables/objects combinations fall into a smaller set of “natural” groups? What are the relationships among the different combinations?

5 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.5 The basic principle l Objects that are similar to/highly correlated with one another should be in the same group, whereas objects that are dissimilar/uncorrelated should be in different groups. l Thus, all cluster analyses begin with measures of similarity/dissimilarity among objects (distance matrices) or correlation matrices. l Objects that are similar to/highly correlated with one another should be in the same group, whereas objects that are dissimilar/uncorrelated should be in different groups. l Thus, all cluster analyses begin with measures of similarity/dissimilarity among objects (distance matrices) or correlation matrices.

6 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.6 Clustering objects l Objects that are closer together based on pairwise multivariate distances or pairwise correlations are assigned to the same cluster, whereas those farther apart or having low pairwise correlations are assigned to different clusters.

7 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.7 Clustering variables l Variables that have high pairwise correlations are assigned to the same cluster, whereas those having low pairwise correlations are assigned to different clusters.

8 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.8 Clustering objects and variables l Object/variable combinations are classified into discrete categories determined by the magnitude of the corresponding entries in the original data matrix l Allows for easier visualization of object/variable combinations. l Object/variable combinations are classified into discrete categories determined by the magnitude of the corresponding entries in the original data matrix l Allows for easier visualization of object/variable combinations.

9 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.9 Types of clusters l Exclusive: each object/variable belongs to one and only one cluster. l Overlapping: an object or variable may belong to more than one cluster. l Exclusive: each object/variable belongs to one and only one cluster. l Overlapping: an object or variable may belong to more than one cluster. Overlapping clusters Exclusive clusters

10 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.10 Scale considerations l In general, correlation measures are not influenced by differences in scale, but distance measures (e.g. Euclidean distance) are affected. l So, use distance measures when variables are measured on common scales, or compute distance measures based on standardized values when variables are not on the same scale. l In general, correlation measures are not influenced by differences in scale, but distance measures (e.g. Euclidean distance) are affected. l So, use distance measures when variables are measured on common scales, or compute distance measures based on standardized values when variables are not on the same scale.

11 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.11 Exclusive clustering methods I. Hierarchical clustering of objects l Begins with calculation of distances / correlations among all pairs of objects… l … with groups being formed by agglomeration (lumping of objects) l The end result is a dendogram (tree) which shows the distances between pairs of objects. l Begins with calculation of distances / correlations among all pairs of objects… l … with groups being formed by agglomeration (lumping of objects) l The end result is a dendogram (tree) which shows the distances between pairs of objects. 0123 4 Distance Modern dog Golden jackal Chinese wolf Cuon Dingo Pre-dog

12 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.12 Exclusive clustering methods I. Hierarchical clustering of variables l Begins with calculation of correlations/distances between all pairs of variables… l … with groups being formed lumping of highly correlated variables. l The end result is a dendogram or tree which shows the distances between pairs of variables. l Begins with calculation of correlations/distances between all pairs of variables… l … with groups being formed lumping of highly correlated variables. l The end result is a dendogram or tree which shows the distances between pairs of variables. 051015 Distance MANDBRTH MANDHT MOLARL MOLARBR MOLARS MOLARS2

13 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.13 Hierarchical clustering of objects and variables l Standardized data matrix is used to produce a two- dimensional colour/shading graph with colour codes/shading intensities determined by the magnitude of the values in the original data matrix… l …which allows one to pick out “similar” objects and variables at a glance. l Standardized data matrix is used to produce a two- dimensional colour/shading graph with colour codes/shading intensities determined by the magnitude of the values in the original data matrix… l …which allows one to pick out “similar” objects and variables at a glance.

14 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.14 Hierarchical joining algorithms l Single (nearest-neighbour): distance between two clusters = distance between two closest members of the two clusters. l Complete (furthest neighbour): distance between two clusters = distance between two most distant cluster members. l Centroid : distance between two clusters = distance between multivariate means of each cluster. l Single (nearest-neighbour): distance between two clusters = distance between two closest members of the two clusters. l Complete (furthest neighbour): distance between two clusters = distance between two most distant cluster members. l Centroid : distance between two clusters = distance between multivariate means of each cluster. Cluster 1 Cluster 2 Cluster 3 Single Centroid Complete

15 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.15 Hierarchical joining algorithms (cont’d) l Average: distance between two clusters = average distance between all members of the two clusters. l Median: distance between two clusters = median distance between all members of the two clusters. l Ward: distance between two clusters = average distance between all members of the two clusters with adjustment for covariances. l Average: distance between two clusters = average distance between all members of the two clusters. l Median: distance between two clusters = median distance between all members of the two clusters. l Ward: distance between two clusters = average distance between all members of the two clusters with adjustment for covariances. Cluster 1 Cluster 2 Cluster 3 Mean/median/adjusted mean of all pairwise distances

16 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.16 Simple joining (nearest neighbour) Object12345 1 22 365 41094 59853 Distance matrix DistanceCluster 01,2,3,4,5 2(1, 2), 3, 4, 5 3(1, 2), 3, (4, 5) 4(1, 2), (3, 4, 5) 5(1, 2, 3, 4, 5)

17 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.17 Complete joining (furthest neighbour) Object12345 1 22 365 41094 59853 Distance matrix DistanceCluster 01,2,3,4,5 2(1, 2), 3, 4, 5 3(1, 2), 3, (4, 5) 5(1, 2), (3, 4, 5) 10(1, 2, 3, 4, 5)

18 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.18 Average joining Object12345 1 22 365 41094 59853 Distance matrix DistanceCluster 01,2,3,4,5 2(1, 2), 3, 4, 5 3(1, 2), 3, (4, 5) 4.5(1, 2), (3, 4, 5) 7.8(1, 2, 3, 4, 5)

19 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.19 Median joining Object12345 1 22 365 41094 59853 Distance matrix DistanceCluster 01,2,3,4,5 2(1, 2), 3, 4, 5 3(1, 2), 3, (4, 5) 3.75(1, 2), (3, 4, 5) 5.44(1, 2, 3, 4, 5)

20 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.20 Centroid joining Object12345 1 22 365 41094 59853 Distance matrix DistanceCluster 01,2,3,4,5 2(1, 2), 3, 4, 5 3(1, 2), 3, (4, 5) 3.75(1, 2), (3, 4, 5) 6.00(1, 2, 3, 4, 5)

21 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.21 Ward joining Object12345 1 22 365 41094 59853 Distance matrix DistanceCluster 01,2,3,4,5 2(1, 2), 3, 4, 5 3(1, 2), 3, (4, 5) 5(1, 2), (3, 4, 5) 14.4(1, 2, 3, 4, 5)

22 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.22 Important note! l Centroid, average, median and Ward joining need not produce a strictly hierarchical tree with increasing lumping distances, resulting in “unattached” branches. l If you encounter this problem, try another method! l Centroid, average, median and Ward joining need not produce a strictly hierarchical tree with increasing lumping distances, resulting in “unattached” branches. l If you encounter this problem, try another method! Cluster Tree 1 2 3 4 5 1 2 3 4 5 ?

23 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.23 Exclusive hierarchical clustering II. Partitioned clustering l In partitioned clustering, the object is to partition a set of N objects into a number k predetermined clusters by maximizing the distance between cluster centers while minimizing the within-cluster variation.

24 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.24 Partitioned clustering: the procedure l Choose k “seed” cases which are spread apart from center of all objects as much as possible. l Assign all remaining objects to nearest seed. l Reassign objects so that within-group sum of squares is reduced… l …and continue to do so until SS within is minimized. l Choose k “seed” cases which are spread apart from center of all objects as much as possible. l Assign all remaining objects to nearest seed. l Reassign objects so that within-group sum of squares is reduced… l …and continue to do so until SS within is minimized. X1X1 Seed 1 Seed 2 Seed 3 ObjectsSeedsObject center X2X2

25 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.25 K-means clustering l A method of partitioned clustering whereby a set of k clusters is produced by minimizing the SS within based on Euclidean distances. l This is very much like a single-classification MANOVA with k groups, except that groups are not known a priori. l A method of partitioned clustering whereby a set of k clusters is produced by minimizing the SS within based on Euclidean distances. l This is very much like a single-classification MANOVA with k groups, except that groups are not known a priori. l Because k-means clustering does not search though every possible partitioning, it is always possible that there are other solutions yielding smaller SS within.

26 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.26 K-means partitioning: example l Cluster profile plots give z- scores for each variable used in clustering objects, with variables ordered by univariate F ratios l Zero indicates mean of all objects. l Cluster profile plots give z- scores for each variable used in clustering objects, with variables ordered by univariate F ratios l Zero indicates mean of all objects. l The more similar the profiles for objects within a cluster, the smaller the within-cluster heterogeneity. k =2 clustering of 6 dog species

27 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.27 K-means partitioning: example l Cluster means plots give means for each variable used in clustering objects, with variables ordered by univariate F ratios l Dashed indicates mean of all objects. l Cluster means plots give means for each variable used in clustering objects, with variables ordered by univariate F ratios l Dashed indicates mean of all objects. l The greater the difference in group means, the greater the discriminating ability of the variable in question k =2 clustering of 6 dog species

28 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.28 Some clustering distances Distance metricDescriptionData type Gamma Computed using 1 –  correlation Ordinal, rank order Pearson1- r for each pair of objectsquantitative R2R2 1 – r 2 for each pair of objectsquantitative EuclideanNormalized Euclidean distancequantitative Minkowskipth root of mean pth powered distance quantitative 22  2 measure of independence of rows and columns on 2 X N frequency tables counts MWIncrement in SS within if object moved into a particular cluster quantitative

29 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.29 Exclusive non-hierarchical clustering : Additive trees l In additive trees clustering, the objective is to partition a set of N objects into a set of clusters represented by additive rather than hierarchical trees. l For hierarchical trees, we assume: (1) all within- cluster distances are smaller than between cluster distances; (2) all within-cluster distances are the same. For additive trees, neither assumption need hold. l In additive trees clustering, the objective is to partition a set of N objects into a set of clusters represented by additive rather than hierarchical trees. l For hierarchical trees, we assume: (1) all within- cluster distances are smaller than between cluster distances; (2) all within-cluster distances are the same. For additive trees, neither assumption need hold.

30 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.30 Additive trees l In additive tree clustering, branch length can vary within clusters… l … and objects within clusters are compared by considering the sum of the branch lengths connecting them l In additive tree clustering, branch length can vary within clusters… l … and objects within clusters are compared by considering the sum of the branch lengths connecting them 1 2 3 4 5 1 2 3 4 5 Hierarchical tree Additive tree

31 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.31 Object12345 1 22 365 41094 59853 Additive trees joining Distance matrix NodeLengthChild 11.5Object1 20.5Object2 64.0(1, 2) 72.25(4, 5) 80.25(6, 3) 1 2 3 4 5 6 7 8 9 D 1,3 = 1.5 + 4.0 + 0.5 = 6.0

32 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.32 Deciding what to cluster and how to cluster them QuestionDecision Am I interested in clustering objects, variables or both? Choose object (row), variable (column) or both (matrix) clustering Do I want strictly hierarchical clusters? Yes: hierarchical trees No: partitioned clusters (e.g. k- means) or additive trees. Are my variables quantitative?Yes: quantitative metrics (e.g. Euclidean, Minkowski, etc). No: non-quantitative metrics (e.g., ,  2, etc.)


Download ppt "Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis."

Similar presentations


Ads by Google