Presentation is loading. Please wait.

Presentation is loading. Please wait.

1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept.

Similar presentations


Presentation on theme: "1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept."— Presentation transcript:

1 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept of BiClustering 5.Concept of DNA sequencing Today’s lecture will cover the following topics

2 Outline Introduction Some basic concepts The proposed algorithm The DPClus software Results & Discussion Conclusions On finding clusters in undirected simple graphs: application to protein complex detection

3 Introduction There is no universal definition of a cluster. But clustering is an important issue. Consequently there are diverse definitions and various methods. The major purpose of clustering is finding cohesive groups. Here, we are going to discuss a graph clustering algorithm.

4 Regarding a graph, a cluster is a subgraph whose nodes are densely connected with each other compared to their connections with other nodes in the graph. This is a flexible definition of a cluster. Intuitively, we can recognize two clusters in this arbitrary graph. Introduction But it is difficult to draw a big graph revealing its clusters.

5 An E. coli protein-protein interaction network---consisting of 3007 proteins and 11531 interactions (From Mori Lab NAIST, Japan) Some algorithm is needed to detect locally dense regions…… Introduction

6 Md. Altaf-Ul-Amin, Yoko Shinbo, Kenji Mihara, Ken Kurokawa and Shigehiko Kanaya, “Development and implementation of an algorithm for detection of protein complexes in large interaction networks”, BMC Bioinformatics 7:207, April 2006. Introduction

7 Some basic concepts It is likely that two nodes belong to the same cluster have more common neighbors than two nodes that are not

8 Some basic concepts

9 The density d of a cluster is the ratio of the number of edges present in it and the maximum possible number of edges in it. It is easy to realize that d = |E|/|E| max = 2*|E|/|N|*(|N|-1). d is a real number ranging from 0 to 1. Some basic concepts

10 Density of the total graph = 0.241 d=0.9 d=1.0 The density of the complexes are relatively higher Some basic concepts

11 Considering density alone is not enough Such situations can be tackled by keeping track of the periphery Some basic concepts Both the graphs consist of 8 nodes and both are of density 0.5 But one of them seems to be a single cluster while the other is divided into two clusters

12 Some basic concepts The cluster property of any node n with respect to any cluster k of density d k and size N k is defined as follows: cp nk =|E nk |/(d k * |N k |) Here, |E nk | is the total number of edges between the node n and each of the nodes of cluster k. Cluster property of node f  0.57 Cluster property of node f = 0.2

13 The proposed algorithm is a sequential constructive algorithm: It initializes the complex/cluster by choosing a seed node. It then repeatedly add other nodes on the basis of priority and some conditions. The major methods of the algorithm Choosing a seed node. Selecting a priority node. Checking necessary conditions before adding a node to a complex. The proposed Algorithm

14 Inputs to the algorithm are: The associated matrix of the network. A minimum threshold density for the generated clusters. A parameter to determine how we separate a complex from its periphery. Output of the algorithm are : Overlapping/non-overlapping complexes whose densities are more or equal to the given density. The proposed Algorithm

15 - Flowchart of the proposed Algorithm

16 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 M = M uv = 1 if there is an edge between nodes u and v and 0 otherwise. The proposed Algorithm

17 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 4 2 2 3 2 1 1 0 0 0 0 0 0 1 2 4 3 2 3 1 1 0 0 0 0 0 0 1 2 3 5 2 3 1 0 1 0 0 0 0 0 0 3 2 2 3 2 1 1 0 0 0 0 0 0 1 2 3 3 2 5 0 1 0 0 1 0 0 0 0 1 1 1 1 0 2 0 0 1 0 0 0 0 0 1 1 0 1 1 0 2 0 1 0 0 1 1 0 0 0 1 0 0 0 0 4 2 1 1 2 2 0 0 0 0 0 0 1 1 2 4 0 1 2 2 0 0 0 0 0 1 0 0 1 0 2 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1 2 2 1 0 4 2 0 0 0 0 0 0 0 1 2 2 1 1 2 3 M 2 = (M 2 ) uv for u  v represents the number of common neighbor of the nodes u and v. The proposed Algorithm

18 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 4 2 2 3 2 1 1 0 0 0 0 0 0 1 2 4 3 2 3 1 1 0 0 0 0 0 0 1 2 3 5 2 3 1 0 1 0 0 0 0 0 0 3 2 2 3 2 1 1 0 0 0 0 0 0 1 2 3 3 2 5 0 1 0 0 1 0 0 0 0 1 1 1 1 0 2 0 0 1 0 0 0 0 0 1 1 0 1 1 0 2 0 1 0 0 1 1 0 0 0 1 0 0 0 0 4 2 1 1 2 2 0 0 0 0 0 0 1 1 2 4 0 1 2 2 0 0 0 0 0 1 0 0 1 0 2 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1 2 2 1 0 4 2 0 0 0 0 0 0 0 1 2 2 1 1 2 3 M 2 = (M 2 ) uv for u  v represents the number of common neighbor of the nodes u and v. The proposed Algorithm

19 2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 The weights of edges are derived by squaring the associated matrix of the graph

20 2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 The proposed Algorithm The weights of nodes (sum of the weights of the connecting edges)

21 2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P121 P331 P421 P531 The proposed Algorithm Seed Neighbors

22 2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P331 P531 P121 P421 The proposed Algorithm Neighbors cp of P3 = 1

23 2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P142 P442 P562 P701 d=1.0 Neighbors The proposed Algorithm

24 2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P562 P142 P442 P701 d=1.0 Neighbors The proposed Algorithm cp of P5 = 1

25 2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P142 P442 P601 P701 d=1.0 Neighbors The proposed Algorithm cp of P1 = 1

26 2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P001 P442 P601 P701 d=1.0 Neighbors The proposed Algorithm

27 2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P442 P001 P601 P701 d=1.0 Neighbor s The proposed Algorithm cp of P4 = 0.75

28 2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 d=0.9 Neighbors The proposed Algorithm Sum of edge weights # of edges cp- value P001~0.22 P601~0.22 P701~0.22

29 0 2 2 2 2 2 0 0 0 2 6 0 6 6 0 6 0 0 The proposed Algorithm The remaining graph Seed

30 0 2 2 2 2 2 0 0 0 2 6 0 6 6 0 6 0 0 d=1.0 The proposed Algorithm

31 0 2 2 2 2 2 0 0 0 2 6 0 6 6 0 6 0 0 d=1.0 The proposed Algorithm

32 0 2 2 2 2 2 0 0 0 2 6 0 6 6 0 6 0 0 d=1.0 The proposed Algorithm

33 The remaining graph

34 The proposed Algorithm Clustering by the proposed algorithm

35 Results: Complexes in the E. coli PPI Network The network of E. coli proteins consists of 363 interactions involving a total of 336 proteins DIP:339NGroELDIP:1081NPrnP DIP:1025NCarBDIP:1026NCarA DIP:539NMalGDIP:508NMalE DIP:124NXerDDIP:726NXerC DIP:367NPntBDIP:366NPntA DIP:342NSbcCDIP:572NGam ---------------------------------------------- http://dip.mbi.ucla.edu/

36 components of RNA polymerase (RpoA, RpoB, RpoC, Rsd, RpoZ RpoD, RpoN, FliA) Results: Complexes in the E. coli PPI Network

37 components of ATP synthetase (AtpA, AtpB, AtpE, AtpF, AtpG, AtpH, AtpL); Results: Complexes in the E. coli PPI Network

38 Proteins involved in cell division (FtsQ, FtsI, FtsW, FtsN, FtsK and FtsL) Results: Complexes in the E. coli PPI Network

39 components of DNA polymerase (DnaX, HolA, HolB, HolD, and HolC); Results: Complexes in the E. coli PPI Network

40 We extract a set of 12487 unique binary interactions involving 4648 proteins by discarding self-interactions of the PPI data obtained from ftp://ftpmips.gsf.de/yeast/PPI/. Results: Complexes in the S. cerevisiae PPI Network

41 Results: Details of a Group of Predicted Complexes Information on the complexes that are of size  6 of the set generated using din=0.7, cpin=0.50 and non-overlapping mode. We considered 15 functional classes: (1) Cell cycle and DNA processing, (2) Protein with binding function or cofactor requirement (structural or catalytic), (3) Protein fate (folding, modification, destination), (4) Biogenesis of cellular components, (5) Cellular transport, transport facilitation and transport routes, (6) Metabolism, (7) Interaction with the cellular environment, (8) Transcription, (9) Energy, (10) Cell rescue, defense and virulence, (11) Cell type differentiation, (12) Cellular communication/signal transduction mechanism, (13) Protein activity regulation, (14) Protein synthesis, and (15) Transposable elements, viral and plasmid proteins

42 Results: Hypergeometric distribution N= Total number of proteins in the network F= Number of proteins of a functional group in the network C= Number of proteins in a cluster k= Number of proteins of a functional group in a cluster The p-value of a cluster implies the probability that the proteins of the cluster have been randomly selected The lower the p-value the higher the statistical significance

43 3 green and 4 red balls Put them in a box Randomly choose any 3 P 0 (# of red ball is 0) = P 1 (# of red ball is 1) = P 2 (# of red ball is 2) = P 3 (# of red ball is 3) = Notice that, P 0 +P 1 +P 2 +P 3 =1 P-value & Hyper geometric distribution

44 P 0 (# of red ball is 0) = P 1 (# of red ball is 1) = P 2 (# of red ball is 2) = P 3 (# of red ball is 3) = 0132 P-value & Hyper geometric distribution

45 P 0 (# of red ball is 0) = P 1 (# of red ball is 1) = P 2 (# of red ball is 2) = P 3 (# of red ball is 3) = P(# of red ball ≤ 1)= P0 +P1 P(# of red ball ≥ 2)=1-(P0 +P1) P(# of red ball ≥ k)=1-(P0 +P1+…+P k-1 ) N=7, F=4, C=3 P-value & Hyper geometric distribution

46 Results: Details of a Group of Predicted Complexes Information on the complexes that are of size  6 of the set generated using din=0.7, cpin=0.50 and non-overlapping mode. Protein YDR425w of complex 19 is related to cellular transport and YIP1, YGL198w, YGL161c and GCS1 are related to vesicular transport. Hence, we predict the function-unknown protein YPL095c of this complex is a transport related protein most likely related to vesicular transport.

47 Conclusions In this work, we present an algorithm to detect locally dense regions in undirected simple graphs. The algorithm can be used to detect protein complexes in large protein-protein interaction networks or co-expressed gene clusters based on microarray data. It can also be used for protein/gene function prediction by way of finding complexes/clusters in networks consisting of function known and function unknown proteins. Also, DPClus can be applied to other networks where finding cohesive groups is an agenda. The DPClus software is available at http://kanaya.naist.jp/DPClus/

48 Md. Altaf-Ul-Amin, Hisashi Tsuji, Ken Kurokawa, Hiroko Asahi, Yoko Shinbo, Shigehiko Kanaya, “DPClus: A Density-periphery Based Graph Clustering Software Mainly Focused on Detection of Protein Complexes in Interaction Networks”, Journal of Computer Aided Chemistry, Vol.7, 150-156, 2006. 2. The DPClus Software The DPClus software is available at http://kanaya.naist.jp/DPClus/ The DPClus software has been developed based on the proposed algorithm.

49 The main window of DPClus The DPClus Software

50 AtpBAtpA AtpGAtpE AtpAAtpH AtpBAtpH AtpGAtpH AtpEAtpH The input file format 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 0 List of edges Corresponding network Adjacency matrix The DPClus Software Adjacency list AtpA AtpB, AtpH AtpB AtpA, AtpH AtpH AtpB, AtpA, AtpG, AtpE AtpG AtpH, AtpE AtpE AtpG

51 ClusterLength of cluster 1 is: 8 RpoA RpoB RpoC Rsd RpoZ RpoD RpoN FliA ClusterLength of cluster 2 is: 8 AtpH AtpG AtpB AtpA AtpF AtpL AtpE AtpB(A) ClusterLength of cluster 3 is: 5 -------------------------------------- Output file format The DPClus Software

52 Intra cluster edges are green and inter cluster edges are red Nodes have been arranged by dragging The DPClus Software

53 Click Hierarchical graph of the clusters The DPClus Software

54 Clustering of microarray data Sample microarray data To apply DPCcus, we need to convert this data to a network The DPClus Software

55 Experiment ID Genes Gene-Gene correlation Select highly correlated gene pairs Edges of a Network At3g10060At3g54150 At3g10060At3g63140 At3g10060At5g07020 --------------------------- The DPClus Software

56 # of experiments 626 Threshold correlation 0.95 cp value 0.5 density value 0.9 Minimum cluster size 3 The DPClus Software

57 Ribosomal protein clusters Electron transport clusters Photosynthesis clusters The DPClus Software

58 Partitioning a PPI Network into Overlapping Modules Constrained by High-Density and Periphery Tracking Md. Altaf-Ul-Amin, Masayoshi Wada and Shigehiko Kanaya Volume 2012 (2012), Article ID 726429, ISRN Biomathematics The DPClusO Algorithm

59 DPClusO has been developed with similar concepts like DPClus but DPClusO is more general and advantageous. each node goes to at least one cluster no two clusters are completely the same density of each cluster is more than or equal to user given density clusters are constrained by periphery if that exists Major differences with DPClus each node goes to at least one cluster as big as possible Memory efficient Faster computation

60 C D A B E F G H I K L M Q R S T O N J P C D A B E F G Q R S T O L K I H M J G I N J M H E F O M N P N J Clustering by DPClus Clustering by DPClusO Example showing difference in clustering by DPClus and DPClusO In both cases clustering was done using din = 0.6 and cpin = 0.5

61 Evaluation of DPClusO

62 Measures used for Evaluation Overlapping score: How two clusters match with each other How a set of predicted clusters match with a set of known clusters How rich a cluster is with similar function proteins

63 Plot of the number of clusters generated by DPClusO with respect to maximum overlapping. OVmax=0 means all modules are completely non-overlapping. For other points OVmax indicates the maximum overlapping score between any two modules. DPClusO generated clusters are not too overlapping

64 64 Plots showing how many and to what extent the known protein complexes (all complexes and size 3 or more complexes shown separately) of yeast matched with modules predicted by DPClusO, COACH and CORE corresponding to five different datasets. DPClusO detected more known protein complexes

65 65 Variation of F-measure with maximum overlapping score (used as a filtering parameter) for modules of size 3 or more generated by DPClusO, COACH and CORE. The marked horizontal lines indicate F-measures for three algorithms in case of no filtering. By adding simple filtering DPClusO achieved the best F-measure

66 0 500 Original Add Remove Rearrange 0.00.51.00.00.51.0 0.00.51.00.00.51.0 0 500 (a) for 5% changes(b) for 5% changes (S3) (c) for 10% changes(d) for 10% changes (S3) OV # of matched clusters Verifying robustness of DPClusO by comparing generated modules from real and randomly altered PPI networks in the context of matching with known complexes. (a) & (b) In case of addition, removal and rearrangement of 5% edges in the context of all and size 3 or more known complexes respectively. (c) & (d) In case of addition, removal and rearrangement of 10% edges in the context of all and size 3 or more known complexes respectively. DPClusO is a robust algorithm

67 67 Comparison between the distributions of the high density modules and randomly selected protein groups with respect to –log(p-value) in the contexts of three types of gene ontology terms: (a), (b) biological process(BP), (c), (d) cellular cpmpartment (CC), (e), (f) molecular function(MF). DPClusO detected modules are rich with similar function proteins

68 Comparison between the distributions of the star and star like modules and randomly selected protein groups with respect to –log(p-value) in the contexts of three types of gene ontology terms: (a), (b) biological process(BP), (c), (d) cellular cpmpartment (CC), (e), (f) molecular function(MF). Also as a consequence of DPClusO clustering it was learnt that a PPI network is a combination of mainly high density and star-like modules.

69 DPClusO is a network clustering algorithm Easily we can convert multivariate data into networks and apply DPClusO for clustering DPClusO is freely available at: http://kanaya.naist.jp/DPClusO

70 Given a nxp data matrix X, where n is the number of objects (e.g. genes) and p is the number of conditions (e.g. array), a bicluster is defined as a submatrix X IJ of X within which a subset of objects I express similar behavior across the subset of conditions J. A nxp data matrix X can be easily converted to a bipartite graph by considering a threshold or so. Finding bicluster (densely connected regions) in a bipartite graph is a similar problem. Definition of a bicluster

71 A Graph G=(V,E) is bipartite if its vertex set V can be partitioned into two subsets V 1, V 2 such that each edge of E has one end vertex in V 1 and another in V 2. V1 V2

72 Biclusters are densely connected regions in a bipartite graph CdAaGgIfKk DcAbGhIgLi DdBaHeIhLj EcBbHfJfLk EdCaHgJgMl FcCbHhKhMm FdDaIeGfNl GdDbKiCcNm Kj

73 Gene expression data can be represented as bipartite graphs gene/cond.cond0cond1cond2cond3cond4 YAL005C2.853.34000 YAL012W0.210.030.18-0.27-0.32 YAL014C-0.03-0.070.280.32-0.27 YAL015C-0.250.580.770.280.32 YAL016W0.110.040.750.820.21 YAL017W0.240.310.950.120.18 YAL021C-0.30.220.02-0.640.06 gene/cond.cond0cond1cond2cond3cond4 YAL005C11000 YAL012W00000 YAL014C00000 YAL015C00000 YAL016W00010 YAL017W00100 YAL021C00000 By transforming highest 5% values to 1 Before transforming, the data can be normalized Biclusters in gene expression data represents transcription modules/co- expressed gene groups

74 Tanay,A. et al. (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18 (Suppl. 1), S136–S144. Ihmels,J. et al. (2002) Revealing modular organization in the yeast transcriptional network. Nat. Genet., 31, 370–377. Ben-Dor,A., Chor,B., Karp,R. and Yakhini,Z. (2002) Discovering local structure in gene expression data: the order-preserving sub-matrix problem. In Proceedings of the 6th Annual International Conference on Computational Biology, ACM Press, New York, NY, USA, pp. 49–57. Cheng,Y. and Church,G. (2000) Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol. pp. 93–103. Murali,T.M. and Kasif,S. (2003) Extracting conserved gene expression motifs from gene expression data. Pac. Symp. Biocomput., 8, 77–88.

75 We propose a biclustering method incorporating DPClus G/Eabcdefghijklm A1100000000000 B1100000000000 C1111000000000 D1111000000000 E0011000000000 F0011000000000 G0001111000000 H0000111100000 I0000111100000 J0000110000000 K0000000111100 L0000000011100 M0000000000011 N0000000000011 An example bipartite graph and its corresponding matrix (for i  k)

76 BiClus:Biclustering method incorporating DPClus Concerning each row i (i=0 to |G|-1) of M CN, we calculate threshold i =avg i +(max i - avg i )  G margin and set (M SG ) ik =(M SG ) ki =1if (M CN ) ik  threshold i and threshold i is not an indeterminate number (for k=0 to |G|-1). Here, avg i = SUM i /n i where n i is the number of non-zero entries in row i of M CN and max i is the maximum value of the entries in row i of M CN G margin is a user defined value  1. ABCDEFGHIJKLMN A02220000000000 B20220000000000 C22042210000000 D22402210000000 E00220210000000 F00222010000000 G00111103320000 H00000030421000 I00000034021000 J00000022200000 K00000001100300 L00000000003000 M00000000000002 N00000000000020 Common neighbor matrix of the bipartite graph

77 ABCDEFGHIJKLMN A01110000000000 B10110000000000 C11011100000000 D11101100000000 E00110110000000 F00111000000000 G00001001110000 H00000010110000 I00000011010000 J00000011100000 K00000000000100 L00000000001000 M00000000000001 N00000000000010 BiClus:Biclustering method incorporating DPClus This matrix represents a simple graph

78 BiClus:Biclustering method incorporating DPClus Simple graph derived from the common neighbor matrix. We can use DPClus to find clusters in the simple graph.

79 BiClus:Biclustering method incorporating DPClus Clustering by DPClus

80 BiClus:Biclustering method incorporating DPClus Clustering by DPClus

81 BiClus:Biclustering method incorporating DPClus Finally determined biclusters

82 Evaluation of BiClus -Using Synthetic data -Using real data

83 Synthetic data Artificially embedded biclusters with noise Evaluation of BiClus

84 Synthetic data Artificially embedded biclusters with overlap Evaluation of BiClus

85 Let M1, M2 be two sets of biclusters. The gene match score of M1 with respect to M2 is given by the function Evaluation of BiClus A systematic comparison and evaluation of biclustering methods for gene expression data Amela Prelic´, Stefan Bleuler, Philip Zimmermann, Anja Wille, Peter Bu¨ hlmann, Wilhelm Gruissem, Lars Hennig, Lothar Thiele and Eckart Zitzle BIOINFORMATICS, Vol. 22 no. 9 2006, pages 1122–1129

86 Evaluation of BiClus Synthetic data Artificially embedded biclusters with noise

87 Evaluation of BiClus Synthetic data Artificially embedded biclusters with overlap

88 Gasch,A.P. et al. (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, 11, 4241–4257. Gene expression data collected from the above work

89 Gene expression data can be represented as bipartite graphs gene/cond.cond0cond1cond2cond3cond4 YAL005C2.853.34000 YAL012W0.210.030.18-0.27-0.32 YAL014C-0.03-0.070.280.32-0.27 YAL015C-0.250.580.770.280.32 YAL016W0.110.040.750.820.21 YAL017W0.240.310.950.120.18 YAL021C-0.30.220.02-0.640.06 gene/cond.cond0cond1cond2cond3cond4 YAL005C11000 YAL012W00000 YAL014C00000 YAL015C00000 YAL016W00010 YAL017W00100 YAL021C00000 By transforming highest 5% values to 1 Before transforming, the data can be normalized Biclusters in gene expression data represents transcription modules

90 0.0010.010.0030.002 Evaluation of BiClus Real gene expression data of yeast P-values represents statistical significance of functional richness of the modules P-Values calculated using FuncAssociate: The Gene Set Functionator from http://llama.med.harvard.edu/cgi/func/funcassociate

91 Application of network concepts in DNA sequencing

92 Sequencing by hybridization (SBH) Input: A spectrum S representing all l-mers from an unknown string s Output: The string s such that spectrum (s,l) = S. Given an unknown DNA sequence, an array provides information about all strings of length l that the sequence contains s=TATGGTGC S(s,l)={TAT, ATG, TGG, GGT, GTG, TGC} S(s,l)={GTG, ATG, TGG, TAT, GGT, TGC} Orderly placed Randomly placed

93 Input: A spectrum S representing all l-mers from an unknown string s Output: The string s such that spectrum (s,l) = S. The reduction of the SBH problem to an Eulerian path problem is to construct a graph whose edges correspond to l-mers from spectrum(s,l) and then to find a path in this graph visiting every edge exactly once. Sequencing by hybridization (SBH)

94 The reduction of the SBH problem to an Eulerian path problem is to construct a graph whose nodes correspond to (l-1)-mers and edges correspond to l-mers from spectrum(s,l) and then to find a path in this graph visiting every edge exactly once. S(s,l)={GTG, ATG, TGG, TAT, GGT, TGC} (l-1)-mers: GT, TG, AT, TG, TG, GG, TA, AT, GG, GT, TG, GC (l-1)-mers(redundancy removed): GT, TG, AT, GG, TA, GC GT AT GG TA GC TG s=TATGGTGC Sequencing by hybridization (SBH)

95 A path in a graph visiting every edge exactly once is called Eulerian (pronounced Oilerian) path A connected graph has an Eulerian path, if and only if it contains at most two semibalanced nodes and all other nodes are balanced. Balanced node, indegree=outdegree Semibalanced node |indegree-outdegree|=1 GT AT GG TA GC TG Semibalanced Sequencing by hybridization (SBH)

96 S(s,l)={ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT} (l-1)-mers:AT, TG, TG, GG, TG, GC, GT, TG, GG, GC, GC, CA, GC, CG, CG, GT (l-1)-mers(redundancy removed):AT, TG, GG, GC, GT, CA, CG GG AT GC TG GT CA CG ATGGCGTGCA Sequencing by hybridization (SBH) Another example

97 S(s,l)={ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT} (l-1)-mers:AT, TG, TG, GG, TG, GC, GT, TG, GG, GC, GC, CA, GC, CG, CG, GT (l-1)-mers(redundancy removed):AT, TG, GG, GC, GT, CA, CG GG AT GC TG GT CA CG ATGCGTGGCA Sequencing by hybridization (SBH)


Download ppt "1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept."

Similar presentations


Ads by Google