Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891
Motivation: Finding patterns across multiple networks, to identify biological modules, and function prediction Current algorithms are too costly Developed a novel algorithm: CODENSE – Scalable in number and size – Adjustable based on the exact or approximate pattern mining
Clustering can detect meaningful biological modules – e.g. a dense protein interaction sub-network may correspond to a protein complex – Dense co-expression sub-network may represent a co- expression cluster Biological modules are expected to be active across multiple conditions One idea: aggregate all the networks and identify dense sub-graphs in the aggregated network – Risk of false positive detection
Aggregated graph: False positive in the aggregated graph Adding six graphs together, and deleting the edges that occur less than 3 times resulting summary graph
Solution to the false-positive summary-graph Frequent sub-graphs Mine the dense sub-graphs directly in each original network A sub-graph is frequent if it occurs in multiple times in a set of graphs In biological networks, each gene occur only once in a graph no isomorphism problem
Frequent dense sub-grpah A frequent dense sub-graph doesn’t show accurate information – Some edges in the frequent sub-graph shown above do not occur in the original set – It is more meaningful to divide this to two sub-graphs
Coherent Dense Sub-graphs All edges in a coherent sub-graphs should have correlated occurrences in the original graph set CODENSE divides the networks into 2 meta- graphs and perform clustering on these two graphs only (instead of individual networks) – CODENSE can distinguish the two modules – Good scalability – Discovery of overlapping clusters
Overlapping Sub-graphs Partition-based clustering algorithms fail to identify overlapping sub-graphs Mining Overlapping Dense Sub- graphs (MODES)
Application Identify frequent co-expression clusters across multiple microarray datasets Microarray dataset: – Un-weighted, undirected graph – Each gene represents a node – Two genes are connected by an edge if they show high expression correlation A densely connected sub-graph tight co-expression cluster Clusters from a single microarray dataset include spurious links, and may not be homogenous in function and regulation
Problem Formulation A relation graph contains n simple graphs, such as – A common vertex set V is shared by the graphs Support(G): the numbers of graphs in a relation graph dataset ( D ) A graph is frequent if support( G ) > threshold Summary graph: is an un-weighted graph extracted from D, where an edge exists only if it occurs in more than k graphs in D
Problem Formulation Edge Support Vector: is the weight of edge e in graph i (for an un-weighted graph it would be 0 or 1)
Second-Order Graph: where each node represents an edge from the relation graph dataset ( D ) and an edge between nodes u and v exists if w(u) and w(v) are highly correlated For efficiency, only construct the S graph for a sub-graph of the summary graph
Coherent Graph: a sub-graph extracted from the summary graph is coherent if – All its edges have support > k – Its second-order graph is dense Graph Density: m: number of edges n: n umber of nodes
Two facts: If a frequent sub-graph is dense, then it must be dense in the summary graph as well, but the reverse way doesn’t hold true always If a sub-graph is coherent (its edges have high correlation across the dataset), then its second-order sub-graph is dense
Aggregate the graphs into a summary graph Eliminate infrequent edges
MODES: Mining Overlapping DEnse Subgraphs Developed based on HCS: Highly Connected Sub- graphs Can efficiently identify dense sub-graphs Can mine overlapping sub-graphs Two approaches: – Minimum cut – Normalized cut (Shi, Malik 2000) Apply the normalized cut in the initial steps of HCS algorothm, then if the size of partitions is small proceed with minimum cut
C
CODENSE analysis Simplify the identification of coherent dense sub- graphs across n graphs into mining in two special graphs: summary graph + second-order graph Can mine network modules Can mine both exact and approximate patterns (by modifying the similarity threshold) Can be extended to weighted graph ( using Pearson correlation instead of Euclidean distance )
Experimental Study: co-expression network 39 yeast microarray datasets 6661 genes Calculate the Pearson correlation between the expression levels (r) Construct the relation graph, (connectivity of two genes determined by the Pearson correlation) n: number of measurements
Create the summary graph, while removing edges that occur less than 6 times across 39 graphs Apply MODES to identify dense sub-grahs: sub( ) with cutoff density d1 For each sub( ), construct the second-order graph S Apply MODES to S to identify sub-grpahs with density > d2 Transform the edges vertices, and apply MODES again to identify the dense sub-graphs with density > d3
Functional Module Discovery: MODES vs CODENSE A cluster is considered functionally homogenous if: 1.The functional homogeneity modeled by hypergeometric distribution shall be significant at α= At least 40% of its memebr genes belong to a specific G.O. functional category MODES identified 366 clusters, but only 151 were functionally homogenous (42%) CODENSE identified 770 clusters, which 76% of those were homogenous Improvement is due to second-order graph by eliminating edges which do not show co-occurrence across all networks
Example of MODES false positive: MODES identified 5 genes: MSF1, PHB1, CBP4, NDI1, SCO2 which are not functionally homogenous Protein biosynthesis replicative cell aging mitochondrial electron transfer
Functional prediction: CODENSE identified this 6-nodes sub-graph 5 genes belong to “protein biosynthesis” category Predict: ASC1 must be involved in protein biosynthesis as well Test with 448 known genes: 50% accuracy