Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman
Motivation: Multidimensional data Pat1Pat2Pat3Pat4Pat5Pat6Pat7Pat8Pat _at _at _s_at _at _at _at _at _x_at _at _s_at _s_at _at _s_at _s_at _x_at _at _x_at _s_at _s_at _at _s_at _at _at _at _at _s_at _s_at _s_at _at _at _at _at _at _s_at _s_at _s_at _at _at _at _s_at _s_at _s_at _at _at _x_at _at _s_at _s_at _at _at
Dimension reduction methods Principal component analysis Cluster analysis Multidimensional scaling Correspondance analysis Singular value decomposition
Principal Component Analysis (PCA) used for visualization of complex data developed to capture as much of the variation in data as possible
Principal components 1. principal component (PC1) –the direction along which there is greatest variation 2. principal component (PC2) –the direction with maximum variation left in data, orthogonal to the 1. PC
Principal components
General about principal components –summary variables –linear combinations of the original variables –uncorrelated with each other –capture as much of the original variance as possible
PCA - example
PCA on all Genes Leukemia data, precursor B and T Plot of 34 patients, dimension of 8973 genes reduced to 2
PCA on 100 top significant genes Leukemia data, precursor B and T Plot of 34 patients, dimension of 100 genes reduced to 2
PCA of genes (Leukemia data) Plot of 8973 genes, dimension of 34 patients reduced to 2
Principal components - Variance
Clustering methods Hierarchical Partitioning –K-mean clustering –Self Organizing Maps (SOM)
Hierarchical clustering of leukemia data
Hierarchical clustering Representation of all pairwise distances Parameters: none (distance measure) Results: –in one large cluster –hierarchical tree (dendrogram) Deterministic
Hierarchical clustering - Algorithm Assign each item to its own cluster Join the nearest clusters Reestimate the distance between clusters Repeat for 1 to n
Hierarchical clustering
Data with clustering order and distances Dendrogram representation
Leukemia data - clustering of genes
Leukemia data - clustering of patients
Leukemia data - clustering of patients on top 100 significant genes
K-mean clustering Partition data into K clusters Parameter: Number of clusters (K) must be chosen Randomilized initialization: –different clusters each time
K-mean - Algorithm Assign each item a class in 1 to K (randomly) For each class 1 to K –Calculate the centroid (one of the K-means) –Calculate distance from centroid to each item Assign each item to the nearest centroid Repeat until no items are re-assigned (convergence)
K-mean clustering, K=3
K-mean clustering of Leukemia data
Self Organizing Maps (SOM) Partitioning method (similar to the K-means method) Clusters are organized in a two-dimensional grid Size of grid is specified –(eg. 2x2 or 3x3) SOM algoritm finds the optimal organization of data in the grid
SOM - example
Comparison of clustering methods Hierarchical clustering –Distances between all variables –Timeconsuming with a large number of gene –Advantage to cluster on selected genes K-mean clustering –Faster algorithm –Does only show relations between all variables SOM –more advanced algorithm
Distance measures Euclidian distance Vector angle distance Pearsons distance
Comparison of distance measures
Summary Dimension reduction important to visualize data Methods: –Principal Component Analysis –Clustering Hierarchical K-mean Self organizing maps (distance measure important)
Coffee break Next: Exercises in PCA and clustering