Download presentation
Presentation is loading. Please wait.
Published byEdwin Nash Modified over 9 years ago
1
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008
2
Introduction Divisive Correlation Clustering Algorithm Results Conclusions 2
3
Introduction Divisive Correlation Clustering Algorithm Results Conclusions 3
4
Correlation Clustering 4
5
Correlation clustering is proposed by Bansal et al. in Machine Learning, 2004. It is basically based on the notion of graph partitioning. 5
6
How to construct the graph? Nodes: genes. Edges: correlation between the genes. Two types of edges: Positive edge. Negative edge. 6
7
For example: 7 X X Y Y Positive correlation coefficient: Positive edge( ) X X Y Y Negative correlation coefficient: Negative edge( ) C C G G B B D D A A H H G G F F E E Cluster 1 Cluster 2 Graph Construction Graph Partitioning C C G G B B D D A A H H G G F F E E
8
How to measure the quality of clusters? The number of agreements. The number of disagreements. The number of agreements: the number of genes that are in correct clusters. The number of disagreements: the number of genes wrongly clustered. 8
9
For example: 9 A A C C D D E E B B Cluster 1 Cluster 2 The measure of agreements is the sum of: (1) # of positive edges in the same clusters (2) # of negative edges in different clusters The measure of disagreements is the sum of: (1) # of negative edges in the same clusters (2) # of positive edges in different clusters 4 + 4 = 8 0 + 2 = 2
10
Minimization of disagreements or equivalently Maximization of agreements! However, it’s NP-Complete proved by Bansal et al., 2004. Another problem is without the magnitude of correlation coefficients. 10
11
Introduction Divisive Correlation Clustering Algorithm Results Conclusions 11
12
Pearson correlation coefficient Terms and measurements used in DCCA Divisive Correlation Clustering Algorithm 12
13
Consider a set of genes,, for each of which expression values are given. The Pearson correlation coefficient between two genes and is defined as: 13 lth sample value of gene mean value of gene from samples
14
: and are positively correlated with the degree of correlation as its magnitude. : and are negatively correlated with value. 14
15
We define some terms and measurements used in DCCA: Attraction Repulsion Attraction/Repulsion value Average correlation value 15
16
Attraction: There’s an attraction between and if. Repulsion: There’s a repulsion between and if. Attraction/Repulsion value: Magnitude of is the strength of attraction or repulsion. 16
17
The genes will be grouped into disjoint clusters. Average correlation value: Average correlation value for a gene with respect to cluster is defined as: 17 the number of data points in
18
indicates that the average correlation for a gene with other genes inside the cluster. Average correlation value reflects the degree of inclusion of to cluster. 18
19
19 Divisive Correlation Clustering Algorithm 1 1 m m m samples 1 1 m m n genes DCCA C1C1 C1C1 C2C2 C2C2 CkCk CkCk K disjoint clusters X1X1 XnXn
20
Step 1: Step 2: for each iteration, do: Step 2-i: 20
21
Step 2: Step 2-ii: Step 2-iii: 21 C1C1 C1C1 C2C2 C2C2 CpCp CpCp Which cluster exists the most repulsion value? Cluster C!
22
Step 2-iv: 22 xixi xixi xjxj xjxj xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk Cluster C xjxj xjxj xixi xixi CpCp CqCq
23
Step 2-v: 23 xkxk xkxk C1C1 C1C1 C2C2 C2C2 CKCK CKCK The highest average correlation value! C1C1 C1C1 C2C2 C2C2 CKCK CKCK xkxk xkxk Place a copy of x k CNEW: new clusters
24
Step 2-vi: 24 C1C1 C1C1 C2C2 C2C2 CKCK CKCK C1C1 C1C1 C2C2 C2C2 CKCK CKCK CNEW: new clusters Any change?
25
Introduction Divisive Correlation Clustering Algorithm Results Conclusions 25
26
Performance comparison A synthetic dataset ADS Nine gene expression datasets 26
27
A synthetic dataset ADS: 27 Three groups.
28
Experimental results: 28 Clustering correctly.
29
Experimental results: 29 Undesired Clusters.
30
Five yeast datasets: Yeast ATP, Yeast PHO, Yeast AFR, Yeast AFRt, Yeast Cho et al. Four mammalian datasets: GDS958 Wild type, GDS958 Knocked out, GDS1423, GDS2745. 30
31
Performance comparison: z-score is calculated by observing the relation between a clustering result and the functional annotation of the genes in the cluster. 31 Attributes Mutual information The entropies for each cluster-attribute pair. The entropies for clustering result independent of attributes. The entropies for each of the N A attributes independent of clusters.
32
z-score is defined as: 32 The computed MI for the clustered data, using the attribute database. MI random is computed by computing MI for a clustering obtained by randomly assigning genes to clusters of uniform size and repeating until a distribution of values is obtained. Mean of these MI-values. The standard deviation of these MI-values.
33
A higher value of z indicates that genes would be better clustered by function, indicating a more biologically relevant clustering result. Gibbons ClusterJudge tool is used to calculating z-score for five yeast datasets. 33
34
Experimental results: 34
35
Experimental results: 35
36
Experimental results: 36
37
Experimental results: 37
38
Experimental results: 38
39
Introduction Divisive Correlation Clustering Algorithm Results Conclusions 39
40
Pros: DCCA is able to obtain clustering solution from gene-expression dataset with high biological significance. DCCA detects clusters with genes in similar variation pattern of expression profiles, without taking the expected number of clusters as an input. 40
41
Cons: The computation cost for repairing any misplacement occurring in clustering step is high. DCCA will not work if dataset contains less than 3 samples. The correlation value will be either +1 or -1. 41
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.