NeMoFinder: Dissecting genome- wide protein-protein intractions with meso-scale network motifs Mike Yuan.

Slides:



Advertisements
Similar presentations
Graph Mining Laks V.S. Lakshmanan
Advertisements

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline.
gSpan: Graph-based substructure pattern mining
www.brainybetty.com1 MAVisto A tool for the exploration of network motifs By Guo Chuan & Shi Jiayi.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Introduction to Graph Mining
Mining Graphs.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Frequent Subgraph Pattern Mining on Uncertain Graph Data
1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006.
Association Analysis (7) (Mining Graphs)
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Structure discovery in PPI networks using pattern-based network decomposition Philip Bachman and Ying Liu BIOINFORMATICS System biology Vol.25 no
Graph-Based Data Mining Diane J. Cook University of Texas at Arlington
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
Mining Graphs with Constrains on Symmetry and Diameter Natalia Vanetik Deutsche Telecom Laboratories at Ben-Gurion University IWGD10 workshop July 14th,
Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
What Is Sequential Pattern Mining?
Slides are modified from Jiawei Han & Micheline Kamber
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Advanced Association Rule Mining and Beyond. Continuous and Categorical Attributes Example of Association Rule: {Number of Pages  [5,10)  (Browser=Mozilla)}
Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE Fall 1.
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
A Graph-based Friend Recommendation System Using Genetic Algorithm
An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis ICDM, 2001 報告者:蔡明瑾.
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Data Mining: Concepts and Techniques — Chapter 9 — Graph mining: Part II Graph Classification and Clustering Jiawei Han and Micheline Kamber Department.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Graph Indexing From managing and mining graph data.
1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Data Mining: Principles and Algorithms Graph Pattern Mining Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
1 Substructure Similarity Search in Graph Databases R 陳芃安.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Gspan: Graph-based Substructure Pattern Mining
Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.
Cohesive Subgraph Computation over Large Graphs
Mining in Graphs and Complex Structures
Mining Frequent Subgraphs
September 19, 2018.
Graph Search with Indexing
Data Mining: Concepts and Techniques — Chapter 9 — 9.1. Graph mining
Discrete Kernels.
Mining, Indexing and Searching Graphs in Biological Databases
Graph Database Mining and Its Applications
Mining Frequent Subgraphs
Mining and Searching Graphs in Biological Databases
Discovering Larger Network Motifs
Slides are modified from Jiawei Han & Micheline Kamber
Mining Frequent Subgraphs
Approximate Graph Mining with Label Costs
Presentation transcript:

NeMoFinder: Dissecting genome- wide protein-protein intractions with meso-scale network motifs Mike Yuan

Outline of this presentation Introduction to PPI Introduction to Graph Mining Related work Problem statement Details of the NeMoFinder algorithm Summary References

Protein Interactions A Protein may interact with: –Other proteins –Nucleic Acids –Small molecules

Finding Protein Partners

Motivation Important for biological functions To understand the function of a protein, we need to find its interacting partners

Vertex (node) Edge Cycle -5 Directed Edge (Arc) Weighted Edge 7 10 Graph Theory Molecular interaction networks are mapped as graphs

The protein protein interaction network…

Graph mining Methods for Mining Frequent Subgraphs Mining Variant and Constrained Substructure Patterns Applications: –Graph Indexing –Similarity Search –Classification and Clustering

Why Graph Mining? Graphs are ubiquitous –Chemical compounds (Cheminformatics) –Protein structures, biological pathways/networks (Bioinformactics) –Program control flow, traffic flow, and workflow analysis –XML databases, Web, and social network analysis Graph is a general model –Trees, lattices, sequences, and items are degenerated graphs Complexity of algorithms: many problems are of high complexity

Graph, Graph, Everywhere Aspirin Yeast protein interaction network from H. Jeong et al Nature 411, 41 (2001) Internet Co-author network

Graph Pattern Mining Frequent subgraphs –A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold Applications of graph pattern mining –Mining biochemical structures –Program control flow analysis –Mining XML structures or Web communities –Building blocks for graph classification, clustering, compression, comparison, and correlation analysis

Example: Frequent Subgraphs GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2) (A)(B)(C) (1)(2)

Frequent Subgraph Mining Approaches Apriori-based approach: i f a graph is frequent, all of its subgraphs are frequent ─ the Apriori property –AGM/AcGM: Inokuchi, et al. (PKDD’00) –FSG: Kuramochi and Karypis (ICDM’01) –PATH # : Vanetik and Gudes (ICDM’02, ICDM’04) –FFSM: Huan, et al. (ICDM’03) Pattern growth approach –MoFa, Borgelt and Berthold (ICDM’02) –gSpan: Yan and Han (ICDM’02) –Gaston: Nijssen and Kok (KDD’04)

Problem Statement PPI network G=(V,E) _ each vertex represents a unique protein _ each edge between v A and v B indicates there is an interaction between A and B Network motif _frequently occurring subgraph pattern in a network f g is the number of occurrences of a subgraph g, g is repeated if fg>F. f g_randi is the frequency of g in a randomized network G randi, for 1 ≤ i ≤ N, N is the number of the randomized networks. s g is the number of times f g ≥ f g_randi, g is unique if its s g > S. Network motif discovery algorithm

Problem Statement (cont) Motivation of NeMoFinder- existing research has following limitations: _Number of network motifs candidates increases exponentially _Interesting network motifs are repeated and unique and Apirori algorithms are not applicable _The graph isomorphism problem is an NP problem NeMoFinder _ a network motif discovery algorithm to discover repeated and unique meso-scale network motifs in a large PPI network

Key procedures Example graph G Find repeated trees Use repeated trees to partition a network into a set of graphs Introduce graph cousins to facilitate the candidate generation and frequency counting processes.

Step1. Discover Repeated Subgraphs Step1.1 find repeated size-k trees Eg. Size 2 to size 5 trees t 2 t 3 t 4_1 t 4_2 t 5_1 t 5_2 t 5_3

Step1. discover repeated subgraphs (cont) f t2 = 7, f t3 = 13, f t4_1 = 6, f t4_2 =17, f t5_1 =1, f t5_2 = 5, f t5_3 = 7. T 2 = {t 2 }, T 3 = {t 3 }, T 4 ={t 4_1, t 4_2 } and T5 = {t 5_2, t 5_3 }.

Step 1.2 Use repeated size-k trees to partition graph Occurrences of t 4_1 in G.

Step 1.2 Use repeated size-k trees to partition graph (cont) Occurrences of t 4_2 in G.

Step1.2 Use repeated size-k trees to partition graph (cont) Set of graphs GD 4 G 4_1 G 4_2 G 4_3 G 4_4 G 4_5

Step 1.3: perform graph join operation to find repeated size-k graphs Generate 3-edge subgraphs from size-4 trees t 4_1 h 1 h 2 t 4_2 h 3 h 4 h 5

Step 1.3: perform graph join operation to find repeated size-k graphs (cont) Examples for graph join operations for subgraphs t 4_1 h 2 g 1_2 t 4_2 h 3 g 1_1 f g1_1 = 2 and f g1_2 = 5

Step 1.3: perform graph join operation to find repeated size-k graphs (cont) Use subgraphs obtained to generate subgraphs g 1_2 h 6 h 7 Graph join operations for subgraphs g 1_2 h 6 g 2 f(g 2 )<2, algorithm stops

Algorithm1 NeMoFinder 1: Input: G - PPI network;N - Number of randomized networks;K - Maximal network motif size;F - Frequency threshold;S - Uniqueness threshold; 2: Output: U - Repeated and unique network motif set; 3: D ← ∅ ; 4: for motif-size k from 3 to K do 5: T ← FindRepeatedTrees(k); 6: GD k ← GraphPartition(G, T) 7: D ← D  T; 8: D’ ← T; 9: i ← k; 10: while D’≠ ∅ and i ≤ k × (k − 1)/2 do 11: D’ ← FindRepeatedGraphs(k,i,D’); 12: D ← D  D’; 13: i ← i + 1; 14: end while 15: end for Step1: Discover repeated subgraphs Step 1.1: Find repeated size-k trees Step 1.2: use repeated size-k trees to partition graph Step 1.3: perform graph join operation to find repeated size-k graphs

Algorithm1 NeMoFinder (cont) 16: for counter i from 1 to N do 17: G rand ← RandomizedNetworkGeneration(); 18: for each g  D do 19: GetRandFrequency(g,G rand ); 20: end for 21: end for 22: U ← ∅ ; 23: for each g D do 24: s ← GetUniqunessValue(g); 25: if s ≥ S then 26: U ← U  {g}; 27: end if 28: end for 29: return U; Step 2: Determine subgraph frequency in randomized networks Step 3: Compute uniqueness of subgraphs

Algorithm Steps (cont) Step 2: Determine subgraph frequency in randomized networks _Generate randomized networks G randi (1≤i≤N) _check the frequency of the subgraphs in each of the randomized networks G randi Step 3: Compute uniqueness of subgraphs _ Based on frequencies in the input PPI network and the randomized networks _f g_randi is the frequency of g in a randomized network G randi, for 1 ≤ i ≤ N, N is the number of the randomized networks. s g is the number of times f g ≥ f g_randi, g is unique if its s g > S.

Find repeated subgraphs Algorithm 2 FindRepeatedGraphs(k, i,D’) 1: Input: D’ - Set of repeated subgraphs with k vertices and i − 1 edges; 2: Output: D’’ - Set of repeated subgraphs with k vertices and i edges; 3: C ← CandidateGeneration(k, i, D’); 4: D’’ ← FrequencyCounting(k, i, C); 5: return D’’;

Candidate generation using graph cousins Represent subgraphs by adjacency matrices Code(M): a sequence formed by linking the lower triangular entries of M in the following order: m 1,1 m 2,1 m 2,2 …m n,1 m n,2 …m n,n Transform adjancy matrix into canonical adjacency matrix (CAM) which has the maximal code Definition of subCAM of a graph _ A matrix obtained by setting the last edge entry in CAM(g) to 0.

Candidate generation using graph cousins (cont) Definition of cousin _ Given two subgraphs g and h, if subCAM(g) = subCAM(h), then h is a cousin of g. Three types of cousin relationship between g and h: _ Type I: Direct Cousin h is isomorphic to a subgraph g’ which has the same number of vertices and edges as g, and g’ ≠ g; _ Type II: Twin Cousin h is isomorphic to subgraph g; _ Type III: Distant Cousin h is a disconnected subgraph.

Candidate generation using graph cousins (cont) Adjacency matrices for the graphs in figure 6 t 4_1 h 1 h

Candidate generation using graph cousins (cont) Adjacency matrices for the graphs in figure 6 t 4_2 h 3 h 4 h 5

Candidate generation using graph cousins (cont) Observations of above example _h 1 is a type 1 direct cousin of t 4_1 _h 2 is a type 3 distant cousin of t 4_1 _h 3 is a type 2 twin cousin of t 4_2 _h 4 is a type 1 direct cousin of t 4_2 _h 5 is a type 3 distant cousin of t 4_2

Candidate generation using graph cousins (cont) Algorithm 3 CandidateGeneration(k, i,D’) 1: Input: D’ - Set of repeated subgraphs with k vertices and i − 1 edges; 2: Output: C - Set of candidates with k vertices and i edges; 3: C ← ∅ ; 4: for each g  D do 5: H ← GetCousin(g); 6: for each h  H do 7: g’ ← join(g, h); 8: C ← C  {g}; 9: end for 10: end for 11: return C; Step 1: Find set of cousins Step2: join g with cousins to form new subgraph

Frequency counting Leveraging properties of the different types of cousins _L x : set of graphs in GD k embedding x _If type of h=type I direct cousin of g, g’ is subgraph obtained by g and h, then L g’ = L g ∩ L h, f g’= |L g ∩ L h | _if type of h = Type III distant cousin,then f g’= |L g ∩ L h | _if type of h = Type II twin cousin then f g’ =CheckAllOccurances(g) _L t4_1 ={G 4_1,G 4_2,G 4_3,G 4_5 }, L h2 = {G 4_1,G 4_2,G 4_3,G 4_4,G 4_5 } L g1_2 = L t4_1 ∩ L h2 ={G 4_1,G 4_2,G 4_3,G 4_5 }, f g1_2 =4>2

Frequency counting Algorithm 4 FrequencyCounting(k, i,C) 1: Input: GDk - Set of graphs generated by partitioning G with size-k repeated trees; C - Set of subgraph candidates with k vertices and i edges; F - Frequency threshold; 2: Output: D’’ - Set of repeated subgraphs with k vertices and i edges; 3: D’’ ← ∅ ; 4: for each g’  C do 5: Get the join parameter of g’: g and h; 6: L g ← set of graphs in GDk embedding g; 7: L h ← set of graphs in GDk embedding h; 8: if f g < F or f h < F then 9: f g’ ← 0; 10: else if type of h = Type I direct cousin then 11: f g’ ← |L g ∩ L h | 12: else if type of h = Type III distant cousin then 13: f g’ ← |L g ∩ L h | 14: else if type of h = Type II twin cousin then 15: f g’ ← CheckAllOccurances(g); 16: end if 17: if f g’ > F then 18: D’’ ← D’’  {g’}; 19: end if 20: end for 21: return D’’; Case h is direct cousin Case h is distant cousin Case h is twin cousin

Summary NemoFinder-an efficient network motif discovery algorithm to discover larger- sized repeated and unique network motifs in PPI networks. Use repeated trees to partition network into graphs Graph cousins for candidate generation and frequency counting

References (1) T. Asai, et al. “Efficient substructure discovery from large semi-structured data”, SDM'02 C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant substructures of molecules”, ICDM'02 D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Community Mining from Multi-Relational Networks”, PKDD'05. J.Chen, W.Hsu, M.Lee,NeMoFinder: Dissecting genome wide protein-protein interactions with repeated and unique network motifs, Seekiong Ng, SIGKDD 2006 M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent Sub-structure Based Approaches for Classifying Chemical Compounds”, ICDM 2003 M. Deshpande, M. Kuramochi, and G. Karypis. “Automated approaches for classifying structures”, BIOKDD'02 C. Faloutsos, K. McCurley, and A. Tomkins, “Fast Discovery of 'Connection Subgraphs”, KDD'04 H. Fröhlich, J. Wegner, F. Sieker, and A. Zell, “Optimal Assignment Kernels For Attributed Molecular Graphs”, ICML’05

References (2) L. Holder, D. Cook, and S. Djoko. “Substructure discovery in the subdue system”, KDD'94 J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. “Mining spatial motifs from protein structure graphs”, RECOMB’04 J. Huan, W. Wang, and J. Prins. “Efficient mining of frequent subgraph in the presence of isomorphism”, ICDM'03 H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs across Massive Biological Networks for Functional Discovery”, ISMB'05 A. Inokuchi, T. Washio, and H. Motoda. “An apriori-based algorithm for mining frequent substructures from graph data”, PKDD'00 C. James, D. Weininger, and J. Delany. “Daylight Theory Manual Daylight Version 4.82”. Daylight Chemical Information Systems, Inc., G. Jeh, and J. Widom, “Mining the Space of Graph Properties”, KDD'04 H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels Between Labeled Graphs”, ICML’03

References (3) M. Koyuturk, A. Grama, and W. Szpankowski. “An efficient algorithm for detecting frequent subgraphs in biological networks”, Bioinformatics, 20:I200--I207, T. Kudo, E. Maeda, and Y. Matsumoto, “An Application of Boosting to Graph Classification”, NIPS’04 M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01 M. Kuramochi and G. Karypis, “GREW: A Scalable Frequent Subgraph Discovery Algorithm”, ICDM’04 C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for ‘Backtrace'' of Noncrashing Bugs’'', SDM'05 P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert, “Extensions of Marginalized Graph Kernels”, ICML’04 S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference. KDD'04 J. Prins, J. Yang, J. Huan, and W. Wang. “Spin: Mining maximal frequent subgraphs from graph databases”. KDD'04

References (4) D. Shasha, J. T.-L. Wang, and R. Giugno. “Algorithmics and applications of tree and graph searching”, PODS'02 J. R. Ullmann. “An algorithm for subgraph isomorphism”, J. ACM, 23:31--42, N. Vanetik, E. Gudes, and S. E. Shimony. “Computing frequent graph patterns from semistructured data”, ICDM'02 C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi. “Scalable mining of large disk-base graph databases”, KDD'04 T. Washio and H. Motoda, “State of the art of graph-based data mining”, SIGKDD Explorations, 5:59-68, 2003 X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining”, ICDM'02 X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, KDD'03 X. Yan, P. S. Yu, and J. Han, “Graph Indexing: A Frequent Structure-based Approach”, SIGMOD'04 X. Yan, X. J. Zhou, and J. Han, “Mining Closed Relational Graphs with Connectivity Constraints”, KDD'05 X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, SIGMOD'05 X. Yan, F. Zhu, J. Han, and P. S. Yu, “Searching Substructures with Superimposed Distance”, ICDE'06 M. J. Zaki. “Efficiently mining frequent trees in a forest”, KDD'02