Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008

 Introduction  Divisive Correlation Clustering Algorithm  Results  Conclusions 2

 Correlation Clustering 4

 Correlation clustering is proposed by Bansal et al. in Machine Learning, 2004.  It is basically based on the notion of graph partitioning. 5

 How to construct the graph?  Nodes: genes.  Edges: correlation between the genes.  Two types of edges:  Positive edge.  Negative edge. 6

 For example: 7 X X Y Y Positive correlation coefficient: Positive edge( ) X X Y Y Negative correlation coefficient: Negative edge( ) C C G G B B D D A A H H G G F F E E Cluster 1 Cluster 2 Graph Construction Graph Partitioning C C G G B B D D A A H H G G F F E E

 How to measure the quality of clusters?  The number of agreements.  The number of disagreements.  The number of agreements: the number of genes that are in correct clusters.  The number of disagreements: the number of genes wrongly clustered. 8

 For example: 9 A A C C D D E E B B Cluster 1 Cluster 2 The measure of agreements is the sum of: (1) # of positive edges in the same clusters (2) # of negative edges in different clusters The measure of disagreements is the sum of: (1) # of negative edges in the same clusters (2) # of positive edges in different clusters 4 + 4 = 8 0 + 2 = 2

 Minimization of disagreements or equivalently Maximization of agreements!  However, it’s NP-Complete proved by Bansal et al., 2004.  Another problem is without the magnitude of correlation coefficients. 10

 Pearson correlation coefficient  Terms and measurements used in DCCA  Divisive Correlation Clustering Algorithm 12

 Consider a set of genes,, for each of which expression values are given.  The Pearson correlation coefficient between two genes and is defined as: 13 lth sample value of gene mean value of gene from samples

 : and are positively correlated with the degree of correlation as its magnitude.  : and are negatively correlated with value. 14

 We define some terms and measurements used in DCCA:  Attraction  Repulsion  Attraction/Repulsion value  Average correlation value 15

 Attraction: There’s an attraction between and if.  Repulsion: There’s a repulsion between and if.  Attraction/Repulsion value: Magnitude of is the strength of attraction or repulsion. 16

 The genes will be grouped into disjoint clusters.  Average correlation value: Average correlation value for a gene with respect to cluster is defined as: 17 the number of data points in

 indicates that the average correlation for a gene with other genes inside the cluster.  Average correlation value reflects the degree of inclusion of to cluster. 18

19  Divisive Correlation Clustering Algorithm 1 1 m m m samples 1 1 m m n genes DCCA C1C1 C1C1 C2C2 C2C2 CkCk CkCk K disjoint clusters X1X1 XnXn

 Step 1:  Step 2: for each iteration, do:  Step 2-i: 20

 Step 2:  Step 2-ii:  Step 2-iii: 21 C1C1 C1C1 C2C2 C2C2 CpCp CpCp Which cluster exists the most repulsion value? Cluster C!

 Step 2-iv: 22 xixi xixi xjxj xjxj xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk Cluster C xjxj xjxj xixi xixi CpCp CqCq

 Step 2-v: 23 xkxk xkxk C1C1 C1C1 C2C2 C2C2 CKCK CKCK The highest average correlation value! C1C1 C1C1 C2C2 C2C2 CKCK CKCK xkxk xkxk Place a copy of x k CNEW: new clusters

 Step 2-vi: 24 C1C1 C1C1 C2C2 C2C2 CKCK CKCK C1C1 C1C1 C2C2 C2C2 CKCK CKCK CNEW: new clusters Any change?

 Performance comparison  A synthetic dataset ADS  Nine gene expression datasets 26

 A synthetic dataset ADS: 27 Three groups.

 Experimental results: 28 Clustering correctly.

 Experimental results: 29 Undesired Clusters.

 Five yeast datasets:  Yeast ATP, Yeast PHO, Yeast AFR, Yeast AFRt, Yeast Cho et al.  Four mammalian datasets:  GDS958 Wild type, GDS958 Knocked out, GDS1423, GDS2745. 30

 Performance comparison: z-score is calculated by observing the relation between a clustering result and the functional annotation of the genes in the cluster. 31 Attributes Mutual information The entropies for each cluster-attribute pair. The entropies for clustering result independent of attributes. The entropies for each of the N A attributes independent of clusters.

 z-score is defined as: 32 The computed MI for the clustered data, using the attribute database. MI random is computed by computing MI for a clustering obtained by randomly assigning genes to clusters of uniform size and repeating until a distribution of values is obtained. Mean of these MI-values. The standard deviation of these MI-values.

 A higher value of z indicates that genes would be better clustered by function, indicating a more biologically relevant clustering result.  Gibbons ClusterJudge tool is used to calculating z-score for five yeast datasets. 33

 Experimental results: 34

 Pros:  DCCA is able to obtain clustering solution from gene-expression dataset with high biological significance.  DCCA detects clusters with genes in similar variation pattern of expression profiles, without taking the expected number of clusters as an input. 40

 Cons:  The computation cost for repairing any misplacement occurring in clustering step is high.  DCCA will not work if dataset contains less than 3 samples. The correlation value will be either +1 or -1. 41

Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Similar presentations

Presentation on theme: "Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Similar presentations

Presentation on theme: "Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008."— Presentation transcript:

Similar presentations

About project

Feedback