Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer
Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical Analysis Fit to Model (time series) Expression Index Calculation Advanced Data Analysis ClusteringPCAClassification Promoter Analysis Meta analysisSurvival analysisRegulatory Network Normalization Image analysis The DNA Array Analysis Pipeline Comparable Gene Expression Data
Motivation: Multidimensional data Pat1Pat2Pat3Pat4Pat5Pat6Pat7Pat8Pat _at _at _s_at _at _at _at _at _x_at _at _s_at _s_at _at _s_at _s_at _x_at _at _x_at _s_at _s_at _at _s_at _at _at _at _at _s_at _s_at _s_at _at _at _at _at _at _s_at _s_at _s_at _at _at _at _s_at _s_at _s_at _at _at _x_at _at _s_at _s_at _at _at
Dimension reduction methods Principal component analysis (PCA) –Singular value decomposition (SVD) MultiDimensional Scaling (MDS) Correspondence Analysis (CA) Cluster analysis –Can be thought of as a dimensionality reduction method as clusters summarize data
Principal Component Analysis (PCA) Used for visualization of high-dimensional data Projects high-dimensional data into a small number of dimensions –Typically 2-3 principle component dimensions Often captures much of the total data variation in a only few dimensions Exact solutions require a fully determined system (matrix with full rank) –i.e. A “square” matrix with independent entries
PCA
Singular Value Decomposition
Principal components 1 st Principal component (PC1) –Direction along which there is greatest variation 2 nd Principal component (PC2) –Direction with maximum variation left in data, orthogonal to PC1
PCA: Variance by dimension
PCA dimensions by experiment
PCA projections (as XY-plot)
PCA: Leukemia data, precursor B and T Plot of 34 patients, dimension of 8973 genes reduced to 2
PCA of genes (Leukemia data) Plot of 8973 genes, dimension of 34 patients reduced to 2
Why do we cluster? Organize observed data into meaningful structures Summarize large data sets Used when we have no a priori hypotheses Optimization: –Minimize within cluster distances –Maximize between cluster distances
Many types of clustering methods Methods: –Hierarchical, e.g. UPGMA Agglomerative (bottom-up) Divisive (top-down) –partitioning K-means PAM SOM
Hierarchical clustering Representation of all pair-wise distances Parameters: none (distance measure) Results: –One large cluster –Hierarchical tree (dendrogram) Deterministic
Hierarchical clustering – UPGMA Algorithm Assign each item to its own cluster Join the nearest clusters Re-estimate the distance between clusters Repeat for 1 to n –UPGMA: Unweighted Pair Group Method with Arithmetic mean
Hierarchical clustering
Data with clustering order and distances Dendrogram representation
Leukemia data - clustering of patients
Leukemia data - clustering of patients on top 100 significant genes
Leukemia data - clustering of genes
K-means clustering Input: N objects given as data points in R p Specify the number k of clusters Initialize k cluster centers. Iterate until convergence: - Assign each object to the cluster with the closest center (Euclidean distance) - The centroids of the obtained clusters are taken as new cluster centers K-means can be seen as an optimization problem: Minimize the sum of squared within-clusters distances The result is depended on the initialization
K-means - Algorithm
K-means clustering, k=3
K-means clustering of Leukemia data
K-means clustering of Cell Cycle data
Partioning Around Medoids (PAM) PAM is a partitioning method like K-means For a prespecified number of clusters k, the PAM procedure is based on the search for k representative objects, or medoids M = (m1,...,mk) The medoids minimize the sum of the distances of the observations to their closest medoid After finding a set of k medoids, k clusters are constructed by assigning each observation to the nearest medoid PAM can be applied to general data types and tends to be more robust than k-means
Self Organizing Maps (SOM) Partitioning method (similar to the K-means method) Clusters are organized in a two-dimensional grid SOM algorithm finds the optimal organization of data in the grid Iteration steps ( ): -Pick data point P at random -Move all nodes in direction of P, the closest node more -Decrease amount of movement
SOM - example
Comparison of clustering methods Hierarchical –Advantage: Fast to compute –Disadvantage: Rigid Partitioning –Advantage: Provides clusters that roughly satisfy an optimality criterion –Disadvantage: Needs initial k, and is time consuming
Distance measures Euclidian distance Vector angle distance Pearsons distance
Comparison of distance measures
Summary Dimension reduction important to visualize data Methods: –PCA/SVD –Clustering Hierarchical K-means/PAM SOM (distance measure important)
Coffee break Next: Exercises in Dimension Reduction and clustering