Download presentation
Presentation is loading. Please wait.
Published byRoy Hudson Modified over 6 years ago
1
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: Room 07-24, level 7, SOC1, NUS
2
Clustering Algorithms
Be weary - confounding computational artifacts are associated with all clustering algorithms. -You should always understand the basic concepts behind an algorithm before using it. Anything will cluster! Garbage In means Garbage Out.
3
Supervised vs. Unsupervised Learning
Supervised: there is a teacher, class labels are known Support vector machines Backpropagation neural networks Unsupervised: No teacher, class labels are unknown Clustering Self-organizing maps
4
Gene Expression Data Gene expression data on p genes for n samples
mRNA samples sample1 sample2 sample3 sample4 sample5 … Genes 3 Gene expression level of gene i in mRNA sample j Log (Red intensity / Green intensity) = Log(Avg. PM - Avg. MM)
5
Expression Vectors Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types. -0.8 1.5 1.8 0.5 -0.4 -1.3 0.8 1.5 Numeric Vector Line Graph Heatmap -2 2
6
Expression Vectors As Points in ‘Expression Space’
G1 -0.8 -0.3 -0.7 G2 -0.4 -0.8 -0.7 G3 -0.6 -0.8 -0.4 Similar Expression G4 0.9 1.2 1.3 G5 1.3 0.9 -0.6 Experiment 3 Experiment 2 Experiment 1
7
Cluster Analysis Group a collection of objects into subsets or “clusters” such that objects within a cluster are closely related to one another than objects assigned to different clusters.
8
How can we do this? What is closely related? Clustering algorithm
Distance or similarity metric What is close? Clustering algorithm How do we minimize distance between objects in a group while maximizing distances between groups?
9
Distance Metrics Euclidean Distance measures average distance
(5.5,6) Euclidean Distance measures average distance Manhattan (City Block) measures average in each dimension Correlation measures difference with respect to linear trends (3.5,4) Gene Expression 2 Gene Expression 1
10
Clustering Gene Expression Data
Expression Measurements Cluster across the rows, group genes together that behave similarly across different conditions. Cluster across the columns, group different conditions together that behave similarly across most genes. j Genes i
11
Clustering Time Series Data
Measure gene expression on consecutive days Gene Measurement matrix G1= [ ] G2= [ ] G3= [ ] G4= [ ]
12
Euclidean Distance 5.3 4.3 5.1 6.4 6.5 2.3 Distance is the square root of the sum of the squared distance between coordinates
13
City Block or Manhattan Distance
G1= [ ] G2= [ ] G3= [ ] G4= [ ] 7.8 6.8 9.1 11 11.3 4.3 Distance is the sum of the absolute value between coordinates
14
Correlation Distance Pearson correlation measures the degree of linear relationship between variables, [-1,1] Distance is 1-(pearson correlation), range of [0,2] .91 .98 1.6 1.9 1.7 .22
15
Similarity Measurements
Pearson Correlation Two profiles (vectors) and +1 Pearson Correlation – 1
16
Similarity Measurements
Cosine Correlation +1 Cosine Correlation – 1
17
Hierarchical Clustering
IDEA: Iteratively combines genes into groups based on similar patterns of observed expression By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships. Display the data as a heatmap and dendrogram Cluster genes, samples or both (HCL-1)
18
Hierarchical Clustering
Venn Diagram of Clustered Data Dendrogram
19
Hierarchical clustering
Merging (agglomerative): start with every measurement as a separate cluster then combine Splitting: make one large cluster, then split up into smaller pieces What is the distance between two clusters?
20
Distance between clusters
Single-link: distance is the shortest distance from any member of one cluster to any member of the other cluster Complete link: distance is the longest distance from any member of one cluster to any member of the other cluster Average: Distance between the average of all points in each cluster Ward: minimizes the sum of squares of any two clusters
21
Hierarchical Clustering-Merging
Euclidean distance Average linking Distance between clusters when combined Gene expression time series
22
Manhattan Distance Average linking
Distance between clusters when combined Average linking Gene expression time series
23
Correlation Distance
24
Data Standardization Data points are normalized with respect to mean and variance, “sphering” the data After sphering, Euclidean and correlation distance are equivalent Standardization makes sense if you are not interested in the size of the effects, but in the effect itself Results are misleading for noisy data
25
Distance Comments Every clustering method is based SOLELY on the measure of distance or similarity E.G. Correlation: measures linear association between two genes What if data are not properly transformed? What about outliers? What about saturation effects? Even good data can be ruined with the wrong choice of distance metric
26
Hierarchical Clustering
Initial Data Items Distance Matrix Dist A B C D 20 7 2 10 25 3 Let me explain the hierarchical clustering first for those of you who are not familiar with hierarchical clustering. Here is a very simple example. We have 4 data items. The initial distance matrix is given in this table. In hierarchical agglomerative clustering, we consider each data item as an independent cluster initially, so we have 4 clusters now. A B C D
27
Hierarchical Clustering
Initial Data Items Distance Matrix Dist A B C D 20 7 2 10 25 3 At first, we choose the most similar pair, A and D. These two items will be merged together to be a new cluster. A B C D
28
Hierarchical Clustering
Single Linkage Current Clusters Distance Matrix Dist A B C D 20 7 2 10 25 3 The height of this new subtree is 2. 2 A D B C
29
Hierarchical Clustering
Single Linkage Current Clusters Distance Matrix Dist AD B C 20 3 10 We should update the distance matrix since we have new clusters. The distances between the new cluster and the remaining clusters can be updated in many different ways. Let’s assume that we use ‘single linkage.’ When we calculate the distance between this new cluster(A and D) and ,for example, B, we choose the minimum of the distance between A and B, and the distance between D and B. In this example, the distance between A and B is 20, and the distance between D and B is 25. So the distance between the new cluster {A,D} and B will be 20. A D B C
30
Hierarchical Clustering
Single Linkage Current Clusters Distance Matrix Dist AD B C 20 3 10 And next, we choose 3 in the new distance matrix, so {A,D} and C are merged together. A D B C
31
Hierarchical Clustering
Single Linkage Current Clusters Distance Matrix Dist AD B C 20 3 10 The height of this new subtree is 3. 3 A D C B
32
Hierarchical Clustering
Single Linkage Current Clusters Distance Matrix Dist ADC B 10 We have a new distance matrix with 2 clusters. A D C B
33
Hierarchical Clustering
Single Linkage Current Clusters Distance Matrix Dist ADC B 10 A D C B
34
Hierarchical Clustering
Single Linkage Current Clusters Distance Matrix Dist ADC B 10 10 Finally, we merge the remaining two clusters. A D C B
35
Hierarchical Clustering
Single Linkage Final Result Distance Matrix Dist ADCB This binary tree is the result of hierarchical clustering using single linkage. If we use a different linkage method, the result can be different from this one. A D C B
36
Hierarchical Clustering
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8
37
Hierarchical Clustering
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8
38
Hierarchical Clustering
Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Gene 7
39
Hierarchical Clustering
Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6
40
Hierarchical Clustering
Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6
41
Hierarchical Clustering
Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6
42
Hierarchical Clustering
Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6
43
Hierarchical Clustering
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8
44
Hierarchical Clustering
45
Hierarchical Clustering
Samples Genes The Leaf Ordering Problem: Find ‘optimal’ layout of branches for a given dendrogram architecture 2N-1 possible orderings of the branches For a small microarray dataset of 500 genes, there are 1.6*E150 branch configurations
46
Hierarchical Clustering
The Leaf Ordering Problem:
47
Hierarchical Clustering
Pros: Commonly used algorithm Simple and quick to calculate Cons: Real genes probably do not have a hierarchical organization
48
Using Hierarchical Clustering
Choose what samples and genes to use in your analysis Choose similarity/distance metric Choose clustering direction Choose linkage method Calculate the dendrogram Choose height/number of clusters for interpretation Assess results Interpret cluster structure
49
Choose what samples/genes to include
Very important step Do you want to include housekeeping genes or genes that didn’t change in your results? How do you handle replicates from the same sample? Noisy samples? Dendrogram is a mess if everything is included in large datasets Gene screening
50
No Filtering
51
Filtering 100 relevant genes
52
2. Choose distance metric
Metric should be a valid measure of the distance/similarity of genes Examples Applying Euclidean distance to categorical data is invalid Correlation metric applied to highly skewed data will give misleading results
53
3. Choose clustering direction
Merging clustering (bottom up) Divisive split so that genes in the two clusters are the most similar, maximize distance between clusters
54
Nearest Neighbor Algorithm
Nearest Neighbor Algorithm is an agglomerative approach (bottom-up). Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.
55
Nearest Neighbor, Level 3, k = 6 clusters.
56
Nearest Neighbor, Level 4, k = 5 clusters.
57
Nearest Neighbor, Level 5, k = 4 clusters.
58
Nearest Neighbor, Level 6, k = 3 clusters.
59
Nearest Neighbor, Level 7, k = 2 clusters.
60
Nearest Neighbor, Level 8, k = 1 cluster.
61
Hierarchical Clustering
Calculate the similarity between all possible combinations of two profiles Keys Similarity Clustering Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters.
62
Hierarchical Clustering
Merge which pair of clusters? C2 C3
63
Hierarchical Clustering
Single Linkage Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “long chains”
64
Hierarchical Clustering
Complete Linkage Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “clumps”
65
Hierarchical Clustering
Average Linkage Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster). + + C2 C1
66
Hierarchical Clustering
Average Group Linkage Dissimilarity between two clusters = Distance between two cluster means. + + C2 C1
67
Which one? Both methods are “step-wise” optimal, at each step the optimal split or merge is performed Doesn’t mean that the final result is optimal Merging: Computationally simple Precise at bottom of tree Good for many small clusters Divisive More complex, but more precise at the top of the tree Good for looking at large and/or few clusters For Gene expression applications, divisive makes more sense
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.