Presentation is loading. Please wait.

Presentation is loading. Please wait.

SiS: Significant Subnetworks in Massive Number of Network Topologies

Similar presentations


Presentation on theme: "SiS: Significant Subnetworks in Massive Number of Network Topologies"— Presentation transcript:

1 SiS: Significant Subnetworks in Massive Number of Network Topologies
Md Mahmudul Hasan, Yusuf Kavurucu and Tamer Kahveci University of Florida

2 Goal A collection D = {G1, G2, … ,Gm} Two positive integers, n and k.
Goal: Find the n connected subnetworks each containing k edges that appear in the largest number of these networks. Goal: Find the n size-k subnetworks that appear in most of these networks. 11/24/2018 Md Mahmudul Hasan et al.

3 Gene Regulatory Network
Errors in the measurement of high-throughput experiments. Gene Regulatory Network Inference Problem Different models can explain the data equally well. Inference Algorithm 11/24/2018 Md Mahmudul Hasan et al.

4 Definitions exists(G’, G1) = 1 exists(G’, G2) = 0 frequency(G’, D) = 2
A B A B D C A B A B D C D C D C G1 G2 G3 G4 A exists(G’, G1) = 1 exists(G’, G2) = 0 frequency(G’, D) = 2 G’ = G’ = D C 11/24/2018 Md Mahmudul Hasan et al.

5 Given D, n and k; find n most frequent size-k subgraphs in D.
Definitions (cont.) A B A B A B A B D C D C D C D C G1 G2 G3 G4 Most Frequent Size-k Subgraph: The most frequent size-k subgraph is a size-k subgraph G' for which frequency(G', D) is maximum. Given D, n and k; find n most frequent size-k subgraphs in D. The most frequent size-2 subgraph in D: frequecy(G’, D) = 4. A B D 11/24/2018 Md Mahmudul Hasan et al.

6 Definitions (cont.) A B A B A B A B D C D C D C D C G1 G2 G3 G4 Normalized Frequency, fr(e, D) is equal to the probability that a randomly selected graph Gi  D contains e. frequency(C → A) = 2 fr(C → A) = 0.5 frequency(A → B) = 4 fr(A → B) = 1 11/24/2018 Md Mahmudul Hasan et al.

7 Definitions (cont.) Multiplication of probabilities,
For any collection of edges {e1, e2, ..., ek} from the graphs in D, the probability that a randomly drawn graph in D contains all of these edges is as follows: Multiplication of probabilities, resulting in a very small number. The score of a subgraph G’ is: 11/24/2018 Md Mahmudul Hasan et al.

8 Template graph G1 G2 G3 G4 A B D C 1.0 0.25 0.5
ψ(e) = -log(fr(e, D)) A B D C 1.0 0.25 0.5 Graph with fr(e, D) values A B D C 0.0 1.38 0.69 Template graph 11/24/2018 Md Mahmudul Hasan et al.

9 Frequent subgraphs consist frequent edges.
Frequent edges have smaller -log(fr(e, D)) values. The more frequent a subgraph, the smaller the score value. 0.0 G’ = A D C B A B 0.69 0.0 0.69 D C 1.38 score(G’, D) = = 0.69 11/24/2018 Md Mahmudul Hasan et al.

10 Given D, n and k; find n most probable size-k subgraphs in D.
Definitions (cont.) Most Probable Size-k Subgraph: The most probable size-k subgraph is a size-k subgraph G’ for which score(G’, D) is minimum. Given D, n and k; find n most probable size-k subgraphs in D. 11/24/2018 Md Mahmudul Hasan et al.

11 Three phases of SiS Pre-processing Exploration
Lower-bound to score(G’, D) calculation Upper-bound to score(G’, D) calculation Exploration Looks for the most probable subgraphs Post-processing and Extension Calculates the frequency of the most probable subgraphs 11/24/2018 Md Mahmudul Hasan et al.

12 Pre-processing Pre-processing Lower-bound to score(G’, D) calculation
Upper-bound to score(G’, D) calculation Exploration Looks for the most probable subgraphs Post-processing and Extension Calculates the frequency of the most probable subgraphs 11/24/2018 Md Mahmudul Hasan et al.

13 Pre-processing: Lower-bound
0.0 A B lower-bound values(i.e., LB) to score(G’, D) for size-k’ subgraphs G’s (k’ = 1… k) 0.1 0.3 D Relax the connectedness constraint. 0.2 0.3 E C How to calculate LB[1, 2, 3]: 0.0 LB[1] LB[2] LB[3] 0.1 = 0.1 F = 0.3 LB[i] ≤ score (optimal size-i subgraph) Template graph 11/24/2018 Md Mahmudul Hasan et al.

14 Pre-processing: Upper-bound
Upper-bound for size-3 subgraph: If we start with the edge {C→F}: A 0.0 B 0.1 0.3 G’ = {C→F} score(G’, D) = 0.1 G’ = {C→F, C → D, A → D} score(G’, D) = 0.5 G’ = {C→F, C → D} score(G’, D) = 0.4 D 0.2 0.3 UB[1] UB[2] UB[3] E C 0.1 0.5 F Template graph 11/24/2018 Md Mahmudul Hasan et al.

15 Pre-processing: Upper-bound
Upper-bound for size-3 subgraph: Starting with {A→B}: A 0.0 B 0.1 G’ = {A→B, A → D, E → D} score(G’, D) = 0.3 G’ = {A→B} score(G’, D) = 0 G’ = {A→B, A → D} score(G’, D) = 0.1 0.3 D 0.2 0.3 UB[1] UB[2] UB[3] E C 0.1 0.5 0.3 F Template graph 11/24/2018 Md Mahmudul Hasan et al.

16 Exploration Exploration Looks for the most probable subgraphs
Pre-processing Lower-bound to score(G’, D) calculation Upper-bound to score(G’, D) calculation Exploration Looks for the most probable subgraphs Post-processing and Extension Calculates the frequency of the most probable subgraphs 11/24/2018 Md Mahmudul Hasan et al.

17 Example for a size-6 subgraph
LB[1] LB[2] 0.5 LB[6] UB[1] UB[2] UB[6] 2.05 UB[1] UB[2] UB[6] 2.1 Size-6 subgraph, G’ Size-3 subgraph, G’ score(G’’) = 2.05 |E’| = 6 score(G’) = 1.5 |E’| = 3 0.2 score(G’’) = 1.7 |E’| = 4 Tighter upper-bound values, on-the-fly Template graph score(G’’) + LB[2] > UB[6] Example for size-6 subgraph 11/24/2018 Md Mahmudul Hasan et al.

18 Exploration Post-processing and Extension
Pre-processing Lower-bound to score(G’, D) calculation Upper-bound to score(G’, D) calculation Exploration Looks for the most probable subgraphs Post-processing and Extension Calculates the frequency of the most probable subgraphs 11/24/2018 Md Mahmudul Hasan et al.

19 Post-processing G’ G’ score(G’, D) frequency(G’, D) 1 2 … n UB[1]
UB[k] LB[1] LB[2] LB[k] 11/24/2018 Md Mahmudul Hasan et al.

20 Extension Maximal frequent subgraph: 11/24/2018
Md Mahmudul Hasan et al.

21 Datasets KEGG database (February, 2012 -- freeze). Dataset
Number of Organisms Entire Dataset Template Graph Nodes Edges Eukaryote-G 145 45, 315 78, 499 1, 413 1, 541 Prokaryote-G 1, 486 393, 681 616, 546 1, 676 Eukaryote-AAG 2, 048 2, 951 43 54 Prokaryote-AAG 1, 442 20, 692 27, 334 79 Eukaryote-P 2, 942 5, 757 63 103 Prokaryote-P 31, 130 60, 802 114 KEGG database (February, freeze). 11/24/2018 Md Mahmudul Hasan et al.

22 Frequency of top 50 size-k (k = 6 … 20) subgraphs in Global dataset.
Results Frequency of top 50 size-k (k = 6 … 20) subgraphs in Global dataset. Frequency of top 50 size-k (k = 6 … 15) subgraphs in eukaryotes for AAG and Pyrimidine network. 11/24/2018 Md Mahmudul Hasan et al.

23 Results (cont.) Correlation of frequent subgraphs in eukaryotes and prokaryotes in Global dataset. 11/24/2018 Md Mahmudul Hasan et al.

24 Running time (minutes)
Results (cont.) Dataset Support Running time (minutes) MULE SiS Eukaryote-G 83% 20 0.07 80% 447 4.3 Prokaryote-G Didn’t run. 0.09 5 Maximal frequent size-25 subgraph in eukaryotes (global map). 11/24/2018 Md Mahmudul Hasan et al.

25 Conclusion A method (SiS) that efficiently discovers significant subnetworks in a collection of networks. SiS scales to very large datasets (i.e., datasets with over a thousand networks, having a total of hundreds of thousands of nodes and edges) easily. SiS shows significant improvement over state-of-the-art maximal frequent subnetwork detection algorithm (MULE). 11/24/2018 Md Mahmudul Hasan et al.

26 Acknowledgements Yusuf Kavurucu NSF CCF-0829867 NSF IIS-0845439
Tamer Kahveci 11/24/2018 Md Mahmudul Hasan et al.

27 Thank You  11/24/2018 Md Mahmudul Hasan et al.

28 Additional - 1 Calculate the upper-bound values(i.e., UB) to score(G’, D) for size-k’ subgraphs G’s (k’ = 1… k) Greedy Approach: Start at any edge e  ET and initialize G’ to {e}. Add an edge e’ adjacent to G’ that has the smallest ψ(e’) values. Repeat until G’ has k’ edges. UB[k’] = smallest score(G’, D) among these |ET| values. score (optimal size-i subgraph) ≤ UB[i] 11/24/2018 Md Mahmudul Hasan et al.

29 (let the graph obtained after adding {ej } into E' be G”)
Additional - 2 Initialize G’= (V', E') to a size-1 subgraph (i.e., E' = {ei } Grow G' by inserting a new edge ej into E' if: (i) ej is incident to v such that v  V'. (ii) The index of the new edge (i.e., j) is greater than the index of the first edge in E‘ (i.e., i). (iii) score(G”, D) + LB[k - |E'| - 1] ≤ UB[k] (let the graph obtained after adding {ej } into E' be G”) 11/24/2018 Md Mahmudul Hasan et al.

30 Additional - 3 A B Z-1 Z-2 C Metabolic Network
Corresponding Enzyme Network Metabolic networks are downloaded from KEGG database.

31 Additional - 4 A frequent size-20 subgraph in prokaryotes which is infrequent in eukaryotes. A size-20 subgraph which is frequent both in prokaryotes and eukaryotes. 11/24/2018 Md Mahmudul Hasan et al.


Download ppt "SiS: Significant Subnetworks in Massive Number of Network Topologies"

Similar presentations


Ads by Google