Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.

Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891

Motivation: Finding patterns across multiple networks, to identify biological modules, and function prediction Current algorithms are too costly Developed a novel algorithm: CODENSE – Scalable in number and size – Adjustable based on the exact or approximate pattern mining

Clustering can detect meaningful biological modules – e.g. a dense protein interaction sub-network may correspond to a protein complex – Dense co-expression sub-network may represent a co- expression cluster Biological modules are expected to be active across multiple conditions One idea: aggregate all the networks and identify dense sub-graphs in the aggregated network – Risk of false positive detection

Aggregated graph: False positive in the aggregated graph Adding six graphs together, and deleting the edges that occur less than 3 times  resulting summary graph

Solution to the false-positive summary-graph Frequent sub-graphs Mine the dense sub-graphs directly in each original network A sub-graph is frequent if it occurs in multiple times in a set of graphs In biological networks, each gene occur only once in a graph  no isomorphism problem

Frequent dense sub-grpah A frequent dense sub-graph doesn’t show accurate information – Some edges in the frequent sub-graph shown above do not occur in the original set – It is more meaningful to divide this to two sub-graphs

Coherent Dense Sub-graphs All edges in a coherent sub-graphs should have correlated occurrences in the original graph set CODENSE divides the networks into 2 meta- graphs and perform clustering on these two graphs only (instead of individual networks) – CODENSE can distinguish the two modules – Good scalability – Discovery of overlapping clusters

Overlapping Sub-graphs Partition-based clustering algorithms fail to identify overlapping sub-graphs Mining Overlapping Dense Sub- graphs (MODES)

Application Identify frequent co-expression clusters across multiple microarray datasets Microarray dataset: – Un-weighted, undirected graph – Each gene represents a node – Two genes are connected by an edge if they show high expression correlation A densely connected sub-graph  tight co-expression cluster Clusters from a single microarray dataset include spurious links, and may not be homogenous in function and regulation

Problem Formulation A relation graph contains n simple graphs, such as – A common vertex set V is shared by the graphs Support(G): the numbers of graphs in a relation graph dataset ( D ) A graph is frequent if support( G ) > threshold Summary graph: is an un-weighted graph extracted from D, where an edge exists only if it occurs in more than k graphs in D

Problem Formulation Edge Support Vector: is the weight of edge e in graph i (for an un-weighted graph it would be 0 or 1)

Second-Order Graph: where each node represents an edge from the relation graph dataset ( D ) and an edge between nodes u and v exists if w(u) and w(v) are highly correlated For efficiency, only construct the S graph for a sub-graph of the summary graph

Coherent Graph: a sub-graph extracted from the summary graph is coherent if – All its edges have support > k – Its second-order graph is dense Graph Density: m: number of edges n: n umber of nodes

Two facts: If a frequent sub-graph is dense, then it must be dense in the summary graph as well, but the reverse way doesn’t hold true always If a sub-graph is coherent (its edges have high correlation across the dataset), then its second-order sub-graph is dense

Aggregate the graphs into a summary graph Eliminate infrequent edges

MODES: Mining Overlapping DEnse Subgraphs Developed based on HCS: Highly Connected Sub- graphs Can efficiently identify dense sub-graphs Can mine overlapping sub-graphs Two approaches: – Minimum cut – Normalized cut (Shi, Malik 2000) Apply the normalized cut in the initial steps of HCS algorothm, then if the size of partitions is small proceed with minimum cut

CODENSE analysis Simplify the identification of coherent dense sub- graphs across n graphs into mining in two special graphs: summary graph + second-order graph Can mine network modules Can mine both exact and approximate patterns (by modifying the similarity threshold) Can be extended to weighted graph ( using Pearson correlation instead of Euclidean distance )

Experimental Study: co-expression network 39 yeast microarray datasets 6661 genes Calculate the Pearson correlation between the expression levels (r)  Construct the relation graph, (connectivity of two genes determined by the Pearson correlation) n: number of measurements

Create the summary graph, while removing edges that occur less than 6 times across 39 graphs Apply MODES to identify dense sub-grahs: sub( ) with cutoff density d1 For each sub( ), construct the second-order graph S Apply MODES to S to identify sub-grpahs with density > d2 Transform the edges  vertices, and apply MODES again to identify the dense sub-graphs with density > d3

Functional Module Discovery: MODES vs CODENSE A cluster is considered functionally homogenous if: 1.The functional homogeneity modeled by hypergeometric distribution shall be significant at α=0.01 2.At least 40% of its memebr genes belong to a specific G.O. functional category MODES identified 366 clusters, but only 151 were functionally homogenous (42%) CODENSE identified 770 clusters, which 76% of those were homogenous Improvement is due to second-order graph by eliminating edges which do not show co-occurrence across all networks

Example of MODES false positive: MODES identified 5 genes: MSF1, PHB1, CBP4, NDI1, SCO2 which are not functionally homogenous Protein biosynthesis replicative cell aging mitochondrial electron transfer

Functional prediction: CODENSE identified this 6-nodes sub-graph 5 genes belong to “protein biosynthesis” category Predict: ASC1 must be involved in protein biosynthesis as well Test with 448 known genes: 50% accuracy

Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.

Similar presentations

Presentation on theme: "Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.

Similar presentations

Presentation on theme: "Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891."— Presentation transcript:

Similar presentations

About project

Feedback