Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Discrimination amongst k populations. We want to determine if an observation vector comes from one of the k populations For this purpose we need to partition.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Lecture 3: A brief background to multivariate statistics
CHAPTER 24 MRPP (Multi-response Permutation Procedures) and Related Techniques From: McCune, B. & J. B. Grace Analysis of Ecological Communities.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Cluster Analysis.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Allied Multivariate Biostatistics L6.1 Lecture 6: Single-classification multivariate ANOVA (k-group.
Lecture 7: Principal component analysis (PCA)
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster analysis. Partition Methods Divide data into disjoint clusters Hierarchical Methods Build a hierarchy of the observations and deduce the clusters.
Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Clustering Unsupervised learning Generating “classes”
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
CLUSTER ANALYSIS.
بسم الله الرحمن الرحیم.. Multivariate Analysis of Variance.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
Multivariate Data Analysis  G. Quinn, M. Burgman & J. Carey 2003.
Distances Between Genes and Samples Naomi Altman Oct. 06.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Cluster Analysis.
Clustering.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Copyright © 2010 Pearson Education, Inc Chapter Twenty Cluster Analysis.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.1 Lecture 9: Discriminant function analysis (DFA) l Rationale.
Clustering / Scaling. Cluster Analysis Objective: – Partitions observations into meaningful groups with individuals in a group being more “similar” to.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Multivariate statistical methods Cluster analysis.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept.
Unsupervised Learning
Multivariate statistical methods
Discrimination and Classification
Clustering and Multidimensional Scaling
Multivariate Statistical Methods
Data Mining – Chapter 4 Cluster Analysis Part 2
Análisis de Cluster.
Cluster Analysis.
Clustering The process of grouping samples so that the samples are similar within each group.
Unsupervised Learning
Presentation transcript:

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis l Clustering methods n Hierarchical n Partitioned n Additive trees l Cluster distance metrics l Uses of cluster analysis l Clustering methods n Hierarchical n Partitioned n Additive trees l Cluster distance metrics Distance Modern dog Golden jackal Chinese wolf Cuon Dingo Pre-dog

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.2 Cluster analysis I: grouping objects l Given a set of p variables X 1, X 2,…, X p, and a set of N objects, the task is to group the objects into classes so that objects within classes are more similar to one another than to members of other classes. l Questions of interest: does the set of objects fall into a smaller set of “natural” groups? What are the relationships among different objects? l Note: in most cases, clusters are not defined a priori.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.3 Cluster analysis II: grouping variables l Given a set of p variables X 1, X 2,…, X p, and a set of N objects, the task is to group the variables into classes so that variables within classes are more highly correlated with one another than to members of other classes. l Questions of interest: does the set of variables fall into a smaller set of “natural” groups? What are the relationships among different variables?

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.4 Cluster analysis III: grouping objects and variables l Given a set of p variables X 1, X 2,…, X p, and a set of N objects, the task is to group the objects and variables into classes so that variables and objects within classes are more highly correlated with/more similar to one another than to members of other classes. l Questions of interest: does the set of variables/objects combinations fall into a smaller set of “natural” groups? What are the relationships among the different combinations?

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.5 The basic principle l Objects that are similar to/highly correlated with one another should be in the same group, whereas objects that are dissimilar/uncorrelated should be in different groups. l Thus, all cluster analyses begin with measures of similarity/dissimilarity among objects (distance matrices) or correlation matrices. l Objects that are similar to/highly correlated with one another should be in the same group, whereas objects that are dissimilar/uncorrelated should be in different groups. l Thus, all cluster analyses begin with measures of similarity/dissimilarity among objects (distance matrices) or correlation matrices.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.6 Clustering objects l Objects that are closer together based on pairwise multivariate distances or pairwise correlations are assigned to the same cluster, whereas those farther apart or having low pairwise correlations are assigned to different clusters.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.7 Clustering variables l Variables that have high pairwise correlations are assigned to the same cluster, whereas those having low pairwise correlations are assigned to different clusters.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.8 Clustering objects and variables l Object/variable combinations are classified into discrete categories determined by the magnitude of the corresponding entries in the original data matrix l Allows for easier visualization of object/variable combinations. l Object/variable combinations are classified into discrete categories determined by the magnitude of the corresponding entries in the original data matrix l Allows for easier visualization of object/variable combinations.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.9 Types of clusters l Exclusive: each object/variable belongs to one and only one cluster. l Overlapping: an object or variable may belong to more than one cluster. l Exclusive: each object/variable belongs to one and only one cluster. l Overlapping: an object or variable may belong to more than one cluster. Overlapping clusters Exclusive clusters

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.10 Scale considerations l In general, correlation measures are not influenced by differences in scale, but distance measures (e.g. Euclidean distance) are affected. l So, use distance measures when variables are measured on common scales, or compute distance measures based on standardized values when variables are not on the same scale. l In general, correlation measures are not influenced by differences in scale, but distance measures (e.g. Euclidean distance) are affected. l So, use distance measures when variables are measured on common scales, or compute distance measures based on standardized values when variables are not on the same scale.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.11 Exclusive clustering methods I. Hierarchical clustering of objects l Begins with calculation of distances / correlations among all pairs of objects… l … with groups being formed by agglomeration (lumping of objects) l The end result is a dendogram (tree) which shows the distances between pairs of objects. l Begins with calculation of distances / correlations among all pairs of objects… l … with groups being formed by agglomeration (lumping of objects) l The end result is a dendogram (tree) which shows the distances between pairs of objects Distance Modern dog Golden jackal Chinese wolf Cuon Dingo Pre-dog

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.12 Exclusive clustering methods I. Hierarchical clustering of variables l Begins with calculation of correlations/distances between all pairs of variables… l … with groups being formed lumping of highly correlated variables. l The end result is a dendogram or tree which shows the distances between pairs of variables. l Begins with calculation of correlations/distances between all pairs of variables… l … with groups being formed lumping of highly correlated variables. l The end result is a dendogram or tree which shows the distances between pairs of variables Distance MANDBRTH MANDHT MOLARL MOLARBR MOLARS MOLARS2

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.13 Hierarchical clustering of objects and variables l Standardized data matrix is used to produce a two- dimensional colour/shading graph with colour codes/shading intensities determined by the magnitude of the values in the original data matrix… l …which allows one to pick out “similar” objects and variables at a glance. l Standardized data matrix is used to produce a two- dimensional colour/shading graph with colour codes/shading intensities determined by the magnitude of the values in the original data matrix… l …which allows one to pick out “similar” objects and variables at a glance.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.14 Hierarchical joining algorithms l Single (nearest-neighbour): distance between two clusters = distance between two closest members of the two clusters. l Complete (furthest neighbour): distance between two clusters = distance between two most distant cluster members. l Centroid : distance between two clusters = distance between multivariate means of each cluster. l Single (nearest-neighbour): distance between two clusters = distance between two closest members of the two clusters. l Complete (furthest neighbour): distance between two clusters = distance between two most distant cluster members. l Centroid : distance between two clusters = distance between multivariate means of each cluster. Cluster 1 Cluster 2 Cluster 3 Single Centroid Complete

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.15 Hierarchical joining algorithms (cont’d) l Average: distance between two clusters = average distance between all members of the two clusters. l Median: distance between two clusters = median distance between all members of the two clusters. l Ward: distance between two clusters = average distance between all members of the two clusters with adjustment for covariances. l Average: distance between two clusters = average distance between all members of the two clusters. l Median: distance between two clusters = median distance between all members of the two clusters. l Ward: distance between two clusters = average distance between all members of the two clusters with adjustment for covariances. Cluster 1 Cluster 2 Cluster 3 Mean/median/adjusted mean of all pairwise distances

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.16 Simple joining (nearest neighbour) Object Distance matrix DistanceCluster 01,2,3,4,5 2(1, 2), 3, 4, 5 3(1, 2), 3, (4, 5) 4(1, 2), (3, 4, 5) 5(1, 2, 3, 4, 5)

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.17 Complete joining (furthest neighbour) Object Distance matrix DistanceCluster 01,2,3,4,5 2(1, 2), 3, 4, 5 3(1, 2), 3, (4, 5) 5(1, 2), (3, 4, 5) 10(1, 2, 3, 4, 5)

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.18 Average joining Object Distance matrix DistanceCluster 01,2,3,4,5 2(1, 2), 3, 4, 5 3(1, 2), 3, (4, 5) 4.5(1, 2), (3, 4, 5) 7.8(1, 2, 3, 4, 5)

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.19 Median joining Object Distance matrix DistanceCluster 01,2,3,4,5 2(1, 2), 3, 4, 5 3(1, 2), 3, (4, 5) 3.75(1, 2), (3, 4, 5) 5.44(1, 2, 3, 4, 5)

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.20 Centroid joining Object Distance matrix DistanceCluster 01,2,3,4,5 2(1, 2), 3, 4, 5 3(1, 2), 3, (4, 5) 3.75(1, 2), (3, 4, 5) 6.00(1, 2, 3, 4, 5)

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.21 Ward joining Object Distance matrix DistanceCluster 01,2,3,4,5 2(1, 2), 3, 4, 5 3(1, 2), 3, (4, 5) 5(1, 2), (3, 4, 5) 14.4(1, 2, 3, 4, 5)

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.22 Important note! l Centroid, average, median and Ward joining need not produce a strictly hierarchical tree with increasing lumping distances, resulting in “unattached” branches. l If you encounter this problem, try another method! l Centroid, average, median and Ward joining need not produce a strictly hierarchical tree with increasing lumping distances, resulting in “unattached” branches. l If you encounter this problem, try another method! Cluster Tree ?

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.23 Exclusive hierarchical clustering II. Partitioned clustering l In partitioned clustering, the object is to partition a set of N objects into a number k predetermined clusters by maximizing the distance between cluster centers while minimizing the within-cluster variation.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.24 Partitioned clustering: the procedure l Choose k “seed” cases which are spread apart from center of all objects as much as possible. l Assign all remaining objects to nearest seed. l Reassign objects so that within-group sum of squares is reduced… l …and continue to do so until SS within is minimized. l Choose k “seed” cases which are spread apart from center of all objects as much as possible. l Assign all remaining objects to nearest seed. l Reassign objects so that within-group sum of squares is reduced… l …and continue to do so until SS within is minimized. X1X1 Seed 1 Seed 2 Seed 3 ObjectsSeedsObject center X2X2

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.25 K-means clustering l A method of partitioned clustering whereby a set of k clusters is produced by minimizing the SS within based on Euclidean distances. l This is very much like a single-classification MANOVA with k groups, except that groups are not known a priori. l A method of partitioned clustering whereby a set of k clusters is produced by minimizing the SS within based on Euclidean distances. l This is very much like a single-classification MANOVA with k groups, except that groups are not known a priori. l Because k-means clustering does not search though every possible partitioning, it is always possible that there are other solutions yielding smaller SS within.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.26 K-means partitioning: example l Cluster profile plots give z- scores for each variable used in clustering objects, with variables ordered by univariate F ratios l Zero indicates mean of all objects. l Cluster profile plots give z- scores for each variable used in clustering objects, with variables ordered by univariate F ratios l Zero indicates mean of all objects. l The more similar the profiles for objects within a cluster, the smaller the within-cluster heterogeneity. k =2 clustering of 6 dog species

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.27 K-means partitioning: example l Cluster means plots give means for each variable used in clustering objects, with variables ordered by univariate F ratios l Dashed indicates mean of all objects. l Cluster means plots give means for each variable used in clustering objects, with variables ordered by univariate F ratios l Dashed indicates mean of all objects. l The greater the difference in group means, the greater the discriminating ability of the variable in question k =2 clustering of 6 dog species

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.28 Some clustering distances Distance metricDescriptionData type Gamma Computed using 1 –  correlation Ordinal, rank order Pearson1- r for each pair of objectsquantitative R2R2 1 – r 2 for each pair of objectsquantitative EuclideanNormalized Euclidean distancequantitative Minkowskipth root of mean pth powered distance quantitative 22  2 measure of independence of rows and columns on 2 X N frequency tables counts MWIncrement in SS within if object moved into a particular cluster quantitative

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.29 Exclusive non-hierarchical clustering : Additive trees l In additive trees clustering, the objective is to partition a set of N objects into a set of clusters represented by additive rather than hierarchical trees. l For hierarchical trees, we assume: (1) all within- cluster distances are smaller than between cluster distances; (2) all within-cluster distances are the same. For additive trees, neither assumption need hold. l In additive trees clustering, the objective is to partition a set of N objects into a set of clusters represented by additive rather than hierarchical trees. l For hierarchical trees, we assume: (1) all within- cluster distances are smaller than between cluster distances; (2) all within-cluster distances are the same. For additive trees, neither assumption need hold.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.30 Additive trees l In additive tree clustering, branch length can vary within clusters… l … and objects within clusters are compared by considering the sum of the branch lengths connecting them l In additive tree clustering, branch length can vary within clusters… l … and objects within clusters are compared by considering the sum of the branch lengths connecting them Hierarchical tree Additive tree

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.31 Object Additive trees joining Distance matrix NodeLengthChild 11.5Object1 20.5Object2 64.0(1, 2) 72.25(4, 5) 80.25(6, 3) D 1,3 = = 6.0

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.32 Deciding what to cluster and how to cluster them QuestionDecision Am I interested in clustering objects, variables or both? Choose object (row), variable (column) or both (matrix) clustering Do I want strictly hierarchical clusters? Yes: hierarchical trees No: partitioned clusters (e.g. k- means) or additive trees. Are my variables quantitative?Yes: quantitative metrics (e.g. Euclidean, Minkowski, etc). No: non-quantitative metrics (e.g., ,  2, etc.)