Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining, Indexing and Searching Graphs in Biological Databases

Similar presentations


Presentation on theme: "Mining, Indexing and Searching Graphs in Biological Databases"— Presentation transcript:

1 Mining, Indexing and Searching Graphs in Biological Databases
Jiawei Han Department of Computer Science & Institute of Genomic Biology University of Illinois at Urbana-Champaign In collaboration with Xifeng Yan (UIUC Ph.D.’06 and IBM Watson), Philip S. Yu (IBM Watson), et al. (Core material for tutorials at ICDM’05 & KDD’06) November 20, 2018

2 References: “Covering” Five Papers
X. Yan and J. Han, gSpan: Graph-Based Substructure Pattern Mining, Proc Int. Conf. on Data Mining (ICDM'02) (Google Scholar: ranked #3 out of 83,800 entries on “Graph Pattern Mining” on November 20, 2018) X. Yan and J. Han, CloseGraph: Mining Closed Frequent Graph Patterns, Proc ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'03) (Google Scholar: ranked #1 out of 83,800 entries on “Graph Pattern Mining” on November 20, 2018November 20, 2018) X. Yan, P. S. Yu, and J. Han, Graph Indexing: A Frequent Structure-based Approach, Proc ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'04) (invited to TODS and published 2005, Google Scholar: ranked #1 out of 39,300 entries on “Graph Indexing” on November 20, 2018) X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, Proc ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'05) (invited and published in ACM TODS’06) H. Hu, X. Yan, H. Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs across Massive Biological Networks for Functional Discovery”, Proc Int. Conf. Intelligent Systems for Molecular Biology (ISMB'05) (Also in Bioinformatics, 2005) November 20, 2018

3 November 20, 2018

4 Graph, Graph, Everywhere
from H. Jeong et al Nature 411, 41 (2001) Aspirin Yeast protein interaction network November 20, 2018 Co-author network An Internet Web

5 Why Graph Mining and Searching?
Graphs are ubiquitous Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: many problems are of high complexity! November 20, 2018

6 Outline Mining frequent graph patterns Graph indexing methods
Similairty search in graph databases Biological network analysis Some recent progress on graph mining November 20, 2018

7 Graph Pattern Mining Frequent subgraphs
A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold Applications of graph pattern mining Mining biochemical structures Program control flow analysis Mining XML structures or Web communities Building blocks for graph classification, clustering, comparison, and correlation analysis November 20, 2018

8 Example: Frequent Subgraphs
Graph Dataset (A) (B) (C) Frequent Patterns (min support is 2) (1) (2) November 20, 2018

9 Frequent Subgraph Mining Approaches
Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03) Pattern growth-based approach MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04) November 20, 2018

10 Properties of Graph Mining Algorithms
Search order breadth vs. depth Generation of candidate subgraphs apriori vs. pattern growth Elimination of duplicate subgraphs passive vs. active Support calculation embedding store or not Discover order of patterns path  tree  graph November 20, 2018

11 Apriori-Based Approach
(k+1)-edge k-edge G1 G G2 G’ G’’ Gn JOIN November 20, 2018

12 Pattern Growth-Based Span and Pruning
1-edge ... 2-edge ... ... If redundant, prune it! ... 3-edge G1 1: Show that G1=G2  Minimum DFS code of G1 should be equal to Minimum DFS code of G2, thus by calculating the minimum DFS code, we can tell whether two graphs are the same. 2: MOST IMPORTANT RESULT !!!: if two graphs G1 and G2 are the same, assume G1 is discovered first, then any graph which is right most extension of graph G2 must be discovered before. Therefore, we need not check the whole branch under G2 in the search tree. (This is the efficiency comes from) 3: ANOTHER MOST IMPORTANT RESULT: G2 is presented as a code C, thus we need not even calculate Min(G2), we need only to know whether C is equal to Min(G2). If not, we know G2 must be discovered before. Very critical for the efficiency. ... ... PRUNED ... November 20, 2018

13 gSpan (Yan and Han ICDM’02)
Right-Most Extension Theorem: Completeness The Enumeration of Graphs using Right-most Extension is COMPLETE November 20, 2018

14 DFS Code Flatten a graph into a sequence using depth first search
e2: (2,0) e4: (3,1) e1: (1,2) 1 2 e5: (2,4) e3: (2,3) 4 3 November 20, 2018

15 Graph Pattern Explosion Problem
If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property An n-edge frequent graph may have 2n subgraphs Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5% November 20, 2018

16 Closed Frequent Graphs
Motivation: Handling graph pattern explosion problem Closed frequent graph A frequent graph G is closed if there exists no supergraph of G that carries the same support as G If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) Lossless compression: still ensures that the mining result is complete November 20, 2018

17 CLOSEGRAPH (Yan & Han, KDD’03)
A Pattern-Growth Approach (k+1)-edge At what condition, can we stop searching their children i.e., early termination? G1 k-edge G2 G If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’. Apriori: Step1: Join two k-1 edge graphs (these two graphs share a same k-2 edge subgraph) to generate a k-edge graph Step2: Join the tid-list of these two k-1 edge graphs, then see whether its count is larger than the minimum support Step3: Check all k-1 subgraph of this k-edge graph to see whether all of them are frequent Step4: After G successfully pass Step1-3, do support computation of G in the graph dataset, See whether it is really frequent. gSpan: Step1: Right-most extend a k-1 edge graph to several k edge graphs. Step2: Enumerate the occurrence of this k-1 edge graph in the graph dataset, meanwhile, counting these k edge graphs. Step3: Output those k edge graphs whose support is larger than the minimum support. Pros: 1: gSpan avoid the costly candidate generation and testing some infrequent subgraphs. 2: No complicated graph operations, like joining two graphs and calculating its k-1 subgraphs. 3. gSpan is very simple The key is how to do right most extension efficiently in graph. We invented DFS code for graph. Gn November 20, 2018

18 Experimental Result The AIDS antiviral screen compound dataset from NCI/NIH The dataset contains 43,905 chemical compounds Among these 43,905 compounds, 423 of them belongs to CA, 1081 are of CM, and the remaining are in class CI November 20, 2018

19 Discovered Patterns 20% 10% 5% November 20, 2018

20 Number of Patterns: Frequent vs. Closed
CA Number of patterns minimum support November 20, 2018

21 Runtime: Frequent vs. Closed
CA runtime (sec) minimum support November 20, 2018

22 Do the Odds Beat the Curse of Complexity?
Potentially exponential number of frequent patterns The worst case complexty vs. the expected probability Ex.: Suppose Walmart has 104 kinds of products The chance to pick up one product 10-4 The chance to pick up a particular set of 10 products: 10-40 What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions? Have we solved the NP-hard problem of subgraph isomorphism testing? No. But the real graphs in bio/chemistry is not so bad A carbon has only 4 bounds and most proteins in a network have distinct labels November 20, 2018

23 Outline Mining frequent graph patterns Graph indexing methods
Similairty search in graph databases Biological network analysis Some recent progress on graph mining November 20, 2018

24 Graph Search: Querying Graph Databases
Given a graph database and a query graph, find all graphs containing this query graph query graph graph database November 20, 2018

25 Scalability Issue Sequential scan Disk I/O
Subgraph isomorphism testing An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03 (a) (b) (c) Query graph Sample database Two issues have to be solved: Disk I/O for false positive and subgraph isomorphism testing for false positive. Therefore, an indexing mechanism is needed. The list just shows a partial list of the previous research. DayLight system is a commercial product targeted for chemical information system. The following is copied from Once a large graph database is formed, how can we do database management, and further data warehouse and data mining. 2. GraphGrep uses a path-based approach to do graph indexing. Its implementation is available publicly. 3. Grace develops a hierarchical vector space method. It tries to extract features from the graphs in a hierarchical way. DayLight System “At Infinity Pharmaceuticals we use the Daylight toolkits for in-house development and have implemented DayCart (as a central component of our chemical registration system. This system has been fully integrated within all aspects of our drug discovery processes and software applications. We have successfully "pushed" the registration process to the chemists; they can quickly and easily register compounds within the context of their electronic notebooks. This has provided a significant enhancement to both the efficiency of data capture and the quality of the data within the chemistry database(s). DayCart integrates well into our overall application and data environment providing a flexible chemical data model which facilitates a data architecture that is tailored to Infinity's processes and scientific approach while at the same time allows speed and scalability. The benefit delivered to Infinity is measured in the strong capability to rapidly and efficiently develop intellectual property and access corporate knowledge effectively. “ November 20, 2018

26 Indexing Strategy Remarks
Query graph (Q) Graph (G) If graph G contains query graph Q, G should contain any substructure of Q Substructure Remarks Index substructures of a query graph to prune graphs that do not contain these substructures Our work, also with all the previous work follows this indexing strategy. November 20, 2018

27 Framework Two steps in processing graph queries
Step 1. Index Construction Enumerate structures in the graph database, build an inverted index between structures and graphs Step 2. Query Processing Enumerate structures in the query graph Calculate the candidate graphs containing these structures Prune the false positive answers by performing subgraph isomorphism test November 20, 2018

28 Cost Analysis Query Response Time Disk I/O time
Isomorphism testing time Graph index access time T_io is the time of fetching a graph from a disk, which may result in one disk block IO. I did not show it in the paper. If the graph dataset can not be held in the main memory, then we need also count the IO time Our goal is to minimize |Cq| Size of candidate answer set Remark: make |Cq| as small as possible November 20, 2018

29 Path-Based Approach Sample database (a) (b) (c) Paths
0-length: C, O, N, S 1-length: C-C, C-O, C-N, C-S, N-N, S-O 2-length: C-C-C, C-O-C, C-N-C, ... 3-length: ... Built an inverted index between paths and graphs November 20, 2018

30 Problems of Path-Based Approach
Sample database (a) (b) (c) Query graph Only graph (c) contains this query graph. However, if we only index paths: C, C-C, C-C-C, C-C-C-C, we cannot prune graph (a) and (b). Only graph (C) contains the query graph. Unfortunately, if we use path only, we cannot prune graphs (a) and (b) The question is why we use path. The reason is path is very simple, easier to manipulate. November 20, 2018

31 gIndex: Indexing Graphs by Data Mining
Our methodology on graph index: Identify frequent structures in the database, the frequent structures are subgraphs that appear quite often in the graph database Prune redundant frequent structures to maintain a small set of discriminative structures Create an inverted index between discriminative frequent structures and graphs in the database The reason of doing frequent graphs is just because the number of frequent graphs is much less than the number of all subgraphs in the graph database although this number is still too large for indexing. November 20, 2018

32 IDEAS: Indexing with Two Constraints
discriminative (~103) frequent (~105) structure (>106) November 20, 2018

33 Why Discriminative Subgraphs?
Sample database (a) (b) (c) All graphs contain structures: C, C-C, C-C-C Why bother indexing these redundant frequent structures? Only index structures that provide more information than existing structures November 20, 2018

34 Discriminative Structures
Pinpoint the most useful frequent structures Given a set of structures f1, f2, …, fn and a new structure x , we measure the extra indexing power provided by x, When P is small enough, x is a discriminative structure and should be included in the index Index discriminative frequent structures only Reduce the index size by an order of magnitude November 20, 2018

35 Why Frequent Structures?
We cannot index (or even search) all of substructures Large structures will likely be indexed well by their substructures Size-increasing support threshold minimum support threshold support size November 20, 2018

36 Experimental Setting The AIDS antiviral screen compound dataset from NCI/NIH, containing 43,905 chemical compounds Query graphs are randomly extracted from the dataset. GraphGrep: maximum length (edges) of paths is set at 10 gIndex: maximum size (edges) of structures is set at 10 The fingerprint set size is set as large as possible until the program takes 1G memory. Here it is 10K. Other parameter setting is provided in the paper November 20, 2018

37 Experiments: Index Size
# OF FEATURES Remarks: It shows the raw feature set of path-based approach and structure-based approach. Of course, we can compress the paths into an arbitrary size of index with performance loss. In our experiment setting, we do not compress the path-based index in order to illustrate the best performance it can achieve. It is also possible to apply discriminative filtering on paths. We have not done experiments on it. The number of frequent structures or discriminative frequent structures does not change over the different size of databases. DATABASE SIZE November 20, 2018

38 Experiments: Answer Set Size
# OF CANDIDATES 1. That is, the number of graphs containing the query graph is less than 50. 2. Query size are about the number of edges in a query. 3. Answer set size is the number of candidate graphs that may contain the query graph. 4. The actual match is the lower bound that all algorithms can achieve. That is, it is the actual number of graphs that really contain the query graph. 5. The performance of queries whose support is less than 50 QUERY SIZE November 20, 2018

39 Experiments: Incremental Maintenance
The fact that the number of frequent structures does not change over the different size of databases inspires us to design an algorithm which construct the index incrementally. 2. The black line shows the performance of indexing from the scratch, for example, from a database with 2,000 graphs, 4,000 graphs, 6,000 graphs, 8,000 graphs and 10,000 graphs. 3. The red line shows the performance of incremental indexing, for example, we first built an index for 2,000 graphs, then we add another 2,000 graphs into the database, update the indexing of the original discriminative frequent graphs, then we add another 2,000, … and so on Frequent structures are stable to database updating Index can be built based on a small portion of a graph database, but be used for the whole database November 20, 2018

40 Outline Mining frequent graph patterns Graph indexing methods
Similairty search in graph databases Biological network analysis Some recent progress on graph mining November 20, 2018

41 Structure Similarity Search
CHEMICAL COMPOUNDS (a) caffeine (b) diurobromine (c) viagra QUERY GRAPH November 20, 2018

42 Some “Straightforward” Methods
Method1: Directly compute the similarity between the graphs in the DB and the query graph Sequential scan Subgraph similarity computation Method 2: Form a set of subgraph queries from the original query graph and use the exact subgraph search Costly: If we allow 3 edges to be missed in a 20-edge query graph, it may generate 1,140 subgraphs November 20, 2018

43 Index: Precise vs. Approximate Search
Precise Search Use frequent patterns as indexing features Select features in the database space based on their selectivity Build the index Approximate Search Hard to build indices covering similar subgraphs—explosive number of subgraphs in databases Idea: (1) keep the index structure (2) select features in the query space November 20, 2018

44 Substructure Similarity Measure
Query relaxation measure The number of edges that can be relabeled or missed; but the position of these edges are not fixed QUERY GRAPH November 20, 2018

45 Substructure Similarity Measure
Feature-based similarity measure Each graph is represented as a feature vector X = {x1, x2, …, xn} The similarity is defined by the distance of their corresponding vectors Advantages Easy to index Fast Rough measure November 20, 2018

46 Intuition: Feature-Based Similarity Search
Graph (G1) If graph G contains the major part of a query graph Q, G should share a number of common features with Q Query (Q) Graph (G2) Given a relaxation ratio, calculate the maximal number of features that can be missed ! Suppose we can delete any red edge in the query graph Q. Substructure At least one of them should be contained November 20, 2018

47 Feature-Graph Matrix graphs in database G1 G2 G3 G4 G5 f1 1 f2 f3 f4 f5 features Assume a query graph has 5 features and at most 2 features to miss due to the relaxation threshold November 20, 2018

48 Edge Relaxation – Feature Misses
If we allow k edges to be relaxed, J is the maximum number of features to be hit by k edges—it becomes the maximum coverage problem NP-complete A greedy algorithm exists We design a heuristic to refine the bound of feature misses November 20, 2018

49 Query Processing Framework
Three steps in processing approximate graph queries Step 1. Index Construction Select small structures as features in a graph database, and build the feature-graph matrix between the features and the graphs in the database November 20, 2018

50 Framework (cont.) Step 2. Feature Miss Estimation
Determine the indexed features belonging to the query graph Calculate the upper bound of the number of features that can be missed for an approximate matching, denoted by J On the query graph, not the graph database November 20, 2018

51 Framework (cont.) Step 3. Query Processing
Use the feature-graph matrix to calculate the difference in the number of features between graph G and query Q, FG – FQ If FG – FQ > J, discard G. The remaining graphs constitute a candidate answer set November 20, 2018

52 Performance Study Database
Chemical compounds of Anti-Aids Drug from NCI/NIH, randomly select 10,000 compounds Query Randomly select 30 graphs with 16 and 20 edges as query graphs Competitive algorithms Grafil: Graph Filter—our algorithm Edge: use edges only All: use all the features November 20, 2018

53 Comparison of the Three Algorithms
# of candidates When the number of edge relaxations increases, the performance of Edge and Grafil will be similar, as expected. Queries with 16 edges edge relaxation November 20, 2018

54 Outline Mining frequent graph patterns Graph indexing methods
Similairty search in graph databases Biological network analysis Some recent progress on graph mining November 20, 2018

55 Biological Networks Protein-protein interaction network
Metabolic network Transcriptional regulatory network Co-expression network Genetic Interaction network November 20, 2018

56 Data Mining Across Multiple Networks
f f a b c d e f g h i j k j j a a h c c h e e b b k k d d i g i g f f f j j a j a h a h c h c c e e e b b k b k k d d g i d i g i g November 20, 2018

57 Data Mining Across Multiple Networks
b d e f g h i j k c a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k f f j a j c h a h c e e b k b k d g i d g i November 20, 2018

58 Identify Frequent Co-expression Clusters across Multiple Microarray Data Sets
b c d e f g h i j k . a b c d e f g h i j k . c1 c2… cm g … .2 g … .4 g … .2 g … .4 g … .1 g … .5 g … .8 g … .3 . November 20, 2018

59 Our Solution We develop a novel algorithm, called CODENSE, to mine frequent coherent dense subgraphs. The target subgraphs have three characteristics: All edges occur in >= k graphs (frequency) All edges should exhibit correlated occurrences in the given graph set (coherency) The subgraph is dense, where density d is higher than a threshold  and d=2m/(n(n-1)) (density) m: #edges, n: #nodes November 20, 2018

60 CODENSE: Mine Coherent Dense Subgraphs
(1) Builds a summary graph by eliminating infrequent edges f a b c d e g h i G3 G2 G6 G5 G4 f a b d e g h i c G1 a b d e g h i c f summary graph Ĝ November 20, 2018

61 CODENSE: Mine Coherent Dense Subgraphs
(2) Identify dense subgraphs of the summary graph a b d e g h i c f summary graph Ĝ Sub(Ĝ) Step 2 MODES Observation: If a frequent subgraph is dense, it must be a dense subgraph in the summary graph. However, the reverse is not true. November 20, 2018

62 CODENSE: Mine Coherent Dense Subgraphs
(3) Construct the edge occurrence profiles for each dense summary subgraph e g h i c f Sub(Ĝ) Step 3 1 e-f c-i c-h c-f c-e G6 G5 G4 G3 G2 G1 E edge occurrence profiles November 20, 2018

63 CODENSE: Mine Coherent Dense Subgraphs
(4) builds a second-order graph for each dense summary subgraph 1 e-f c-i c-h c-f c-e G6 G5 G4 G3 G2 G1 E edge occurrence profiles Step 4 e-h f-h e-i e-g g-i h-i second-order graph S g-h f-i November 20, 2018

64 CODENSE: Mine Coherent Dense Subgraphs
(5) Identify dense subgraphs of the second-order graph c-f c-h c-e e-h e-f f-h c-i e-i e-g g-i h-i second-order graph S g-h f-i Step 4 Sub(S) Observation: If a subgraph is coherent (its edges show high correlation in their occurrences across a graph set), then its 2nd-order graph must be dense November 20, 2018

65 CODENSE: Mine Coherent Dense Subgraphs
(6) Identify the coherent dense subgraphs c-f c-h c-e e-h e-f f-h e-i e-g g-i h-i Sub(S) g-h Step 5 c e f h g i Sub(G) November 20, 2018

66 CODENSE: Mine Coherent Dense Subgraphs
1 e-f c-i c-h c-f c-e G6 G5 G4 G3 G2 G1 E edge occurrence profiles c e f h g i Step 4 Step 5 Sub(G) a b d e-h f-h e-i e-g g-i h-i second-order graph S g-h f-i Step 1 Step 3 summary graph Ĝ Sub(Ĝ) Step 2 Sub(S) Step 6 MODES Add/Cut Restore G and MODES November 20, 2018

67 Applying CoDense to 39 Yeast Microarray Data Sets
f f a j h a h j c1 c2… cm g … .2 g … .4 c c e e b k b k d g i d g i f f c1 c2… cm g … .2 g … .4 a e j a j c h c h e b b k k d g i d g i c1 c2… cm g … .1 g … .5 f f a j h a j c c h e e b b k k d g i d g i f f c1 c2… cm g … .8 g … .3 a j a j h c h c e e b b k k d g i d g i November 20, 2018

68 Discovery of New Genes Based on Similar Genes
ATP12 MRPL38 MRPL37 MRPL39 FMC1 MRPS18 MRPL32 ACN9 MRPL51 MRP49 YDR115W PHB1 PET100 ATP17 November 20, 2018

69 Network of Known Similar Genes
ATP17 ATP12 MRPL38 MRPL39 FMC1 MRPS18 MRPL32 ACN9 MRPL51 MRP49 YDR115W PHB1 PET100 Brown: YDR115W, FMC1, ATP12, MRPL37, MRPS18 GO: (protein metabolism; pvalue = ) November 20, 2018

70 Network Involved in the New Genes
YDR115W MRP49 PHB1 MRPL51 PET100 ATP12 ATP17 MRPL37 MRPL38 ACN9 MRPL32 MRPL39 MRPS18 FMC1 Red:PHB1,ATP17,MRPL51,MRPL39, MRPL49, MRPL51,PET100 GO: (generation of precursor metabolites and energy; pvalue= ) November 20, 2018

71 Outline Mining frequent graph patterns Graph indexing methods
Similairty search in graph databases Biological network analysis Some recent progress on graph mining November 20, 2018

72 Recent Developments: Graph Mining
Colossal pattern mining: F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, “Mining Colossal Frequent Patterns by Core Pattern Fusion”, in Proc Int. Conf. on Data Engineering (ICDE'07), April 2007 (Best student paper award) Constraint-based mining: F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing Framework for Graph Pattern Mining”, in Proc Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD'07), May 2007 (Best student paper award) Approximate graph mining: C. Chen, X. Yan, F. Zhu, and J. Han, “gApprox: Mining Frequent Approximate Patterns from a Massive Network”, Proc Int. Conf. on Data Mining (ICDM'07), Oct. 2007 November 20, 2018

73 Recent Developments: Graph Mining
Graph-containment indexing: C. Chen, X. Yan, P. S. Yu, J. Han, D. Zhang, and X. Gu, “Towards Graph Containment Search and Indexing”, in Proc Int. Conf. on Very Large Data Bases (VLDB'07), Vienna, Austria, Sept. 2007 Pattern-based classification: H. Cheng, X. Yan, J. Han, and C.-W. Hsu, “Discriminative Frequent Pattern Analysis for Effective Classification”, in Proc Int. Conf. on Data Engineering (ICDE'07), Istanbul, Turkey, April 2007 DDPMine: H. Cheng, X. Yan, J. Han, and P. S. Yu, "Direct Discriminative Pattern Mining for Effective Classification", Proc Int. Conf. on Data Engineering (ICDE'08), Cancun, Mexico, April 2008 November 20, 2018

74 Discriminative Frequent Pattern Analysis for Effective Classification [ICDE’07]
November 20, 2018

75 Conclusions Graph mining has wide applications
Frequent and closed subgraph mining methods gSpan and CloseGraph: pattern-growth depth-first search approach Graph indexing techniques: Frequent and discirminative subgraphs as indexing fatures Similairty search in graph databases Indexing and approximate matching help similar subgraph search Biological network analysis Mining coherent, dense, multiple biological networks Many new developments along the line of graph pattern mining November 20, 2018

76 Thanks and Questions November 20, 2018


Download ppt "Mining, Indexing and Searching Graphs in Biological Databases"

Similar presentations


Ads by Google