Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Similar presentations


Presentation on theme: "Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008."— Presentation transcript:

1 Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008

2  Introduction  Divisive Correlation Clustering Algorithm  Results  Conclusions 2

3  Introduction  Divisive Correlation Clustering Algorithm  Results  Conclusions 3

4  Correlation Clustering 4

5  Correlation clustering is proposed by Bansal et al. in Machine Learning, 2004.  It is basically based on the notion of graph partitioning. 5

6  How to construct the graph?  Nodes: genes.  Edges: correlation between the genes.  Two types of edges:  Positive edge.  Negative edge. 6

7  For example: 7 X X Y Y Positive correlation coefficient: Positive edge( ) X X Y Y Negative correlation coefficient: Negative edge( ) C C G G B B D D A A H H G G F F E E Cluster 1 Cluster 2 Graph Construction Graph Partitioning C C G G B B D D A A H H G G F F E E

8  How to measure the quality of clusters?  The number of agreements.  The number of disagreements.  The number of agreements: the number of genes that are in correct clusters.  The number of disagreements: the number of genes wrongly clustered. 8

9  For example: 9 A A C C D D E E B B Cluster 1 Cluster 2 The measure of agreements is the sum of: (1) # of positive edges in the same clusters (2) # of negative edges in different clusters The measure of disagreements is the sum of: (1) # of negative edges in the same clusters (2) # of positive edges in different clusters 4 + 4 = 8 0 + 2 = 2

10  Minimization of disagreements or equivalently Maximization of agreements!  However, it’s NP-Complete proved by Bansal et al., 2004.  Another problem is without the magnitude of correlation coefficients. 10

11  Introduction  Divisive Correlation Clustering Algorithm  Results  Conclusions 11

12  Pearson correlation coefficient  Terms and measurements used in DCCA  Divisive Correlation Clustering Algorithm 12

13  Consider a set of genes,, for each of which expression values are given.  The Pearson correlation coefficient between two genes and is defined as: 13 lth sample value of gene mean value of gene from samples

14  : and are positively correlated with the degree of correlation as its magnitude.  : and are negatively correlated with value. 14

15  We define some terms and measurements used in DCCA:  Attraction  Repulsion  Attraction/Repulsion value  Average correlation value 15

16  Attraction: There’s an attraction between and if.  Repulsion: There’s a repulsion between and if.  Attraction/Repulsion value: Magnitude of is the strength of attraction or repulsion. 16

17  The genes will be grouped into disjoint clusters.  Average correlation value: Average correlation value for a gene with respect to cluster is defined as: 17 the number of data points in

18  indicates that the average correlation for a gene with other genes inside the cluster.  Average correlation value reflects the degree of inclusion of to cluster. 18

19 19  Divisive Correlation Clustering Algorithm 1 1 m m m samples 1 1 m m n genes DCCA C1C1 C1C1 C2C2 C2C2 CkCk CkCk K disjoint clusters X1X1 XnXn

20  Step 1:  Step 2: for each iteration, do:  Step 2-i: 20

21  Step 2:  Step 2-ii:  Step 2-iii: 21 C1C1 C1C1 C2C2 C2C2 CpCp CpCp Which cluster exists the most repulsion value? Cluster C!

22  Step 2-iv: 22 xixi xixi xjxj xjxj xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk xkxk Cluster C xjxj xjxj xixi xixi CpCp CqCq

23  Step 2-v: 23 xkxk xkxk C1C1 C1C1 C2C2 C2C2 CKCK CKCK The highest average correlation value! C1C1 C1C1 C2C2 C2C2 CKCK CKCK xkxk xkxk Place a copy of x k CNEW: new clusters

24  Step 2-vi: 24 C1C1 C1C1 C2C2 C2C2 CKCK CKCK C1C1 C1C1 C2C2 C2C2 CKCK CKCK CNEW: new clusters Any change?

25  Introduction  Divisive Correlation Clustering Algorithm  Results  Conclusions 25

26  Performance comparison  A synthetic dataset ADS  Nine gene expression datasets 26

27  A synthetic dataset ADS: 27 Three groups.

28  Experimental results: 28 Clustering correctly.

29  Experimental results: 29 Undesired Clusters.

30  Five yeast datasets:  Yeast ATP, Yeast PHO, Yeast AFR, Yeast AFRt, Yeast Cho et al.  Four mammalian datasets:  GDS958 Wild type, GDS958 Knocked out, GDS1423, GDS2745. 30

31  Performance comparison: z-score is calculated by observing the relation between a clustering result and the functional annotation of the genes in the cluster. 31 Attributes Mutual information The entropies for each cluster-attribute pair. The entropies for clustering result independent of attributes. The entropies for each of the N A attributes independent of clusters.

32  z-score is defined as: 32 The computed MI for the clustered data, using the attribute database. MI random is computed by computing MI for a clustering obtained by randomly assigning genes to clusters of uniform size and repeating until a distribution of values is obtained. Mean of these MI-values. The standard deviation of these MI-values.

33  A higher value of z indicates that genes would be better clustered by function, indicating a more biologically relevant clustering result.  Gibbons ClusterJudge tool is used to calculating z-score for five yeast datasets. 33

34  Experimental results: 34

35  Experimental results: 35

36  Experimental results: 36

37  Experimental results: 37

38  Experimental results: 38

39  Introduction  Divisive Correlation Clustering Algorithm  Results  Conclusions 39

40  Pros:  DCCA is able to obtain clustering solution from gene-expression dataset with high biological significance.  DCCA detects clusters with genes in similar variation pattern of expression profiles, without taking the expected number of clusters as an input. 40

41  Cons:  The computation cost for repairing any misplacement occurring in clustering step is high.  DCCA will not work if dataset contains less than 3 samples. The correlation value will be either +1 or -1. 41


Download ppt "Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008."

Similar presentations


Ads by Google