September 19, 2018.

Slides:



Advertisements
Similar presentations
Graph Mining Laks V.S. Lakshmanan
Advertisements

CrowdER - Crowdsourcing Entity Resolution
 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
gSpan: Graph-based substructure pattern mining
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
IGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
Mining Graphs with Constrains on Symmetry and Diameter Natalia Vanetik Deutsche Telecom Laboratories at Ben-Gurion University IWGD10 workshop July 14th,
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Slides are modified from Jiawei Han & Micheline Kamber
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.
Graph Indexing: A Frequent Structure-­based Approach 指導老師:曾新穆 教授 組員:李彥寬、洪世敏、丁鏘巽、 黃冠霖、詹博丞 日期: 2013/11/ /11/141.
Graph Indexing From managing and mining graph data.
Data Mining: Principles and Algorithms Graph Pattern Mining Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
1 Substructure Similarity Search in Graph Databases R 陳芃安.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.
Gspan: Graph-based Substructure Pattern Mining
Cohesive Subgraph Computation over Large Graphs
Finding Dense and Connected Subgraphs in Dual Networks
Outline Introduction State-of-the-art solutions
Semi-Supervised Clustering
Subject Name: File Structures
Optimizing Parallel Algorithms for All Pairs Similarity Search
Record Storage, File Organization, and Indexes
Mining in Graphs and Complex Structures
Database Management System
Rule Induction for Classification Using
A paper on Join Synopses for Approximate Query Answering
RE-Tree: An Efficient Index Structure for Regular Expressions
Physical Database Design for Relational Databases Step 3 – Step 8
Haim Kaplan and Uri Zwick
Mining Frequent Subgraphs
Evaluation of Relational Operations
Jiawei Han Department of Computer Science
Graph Search with Indexing
Data Mining: Concepts and Techniques — Chapter 9 — 9.1. Graph mining
TT-Join: Efficient Set Containment Join
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
On Efficient Graph Substructure Selection
Mining, Indexing and Searching Graphs in Biological Databases
Graph Database Mining and Its Applications
Mining Frequent Subgraphs
Mining and Searching Graphs in Biological Databases
Lectures on Graph Algorithms: searching, testing and sorting
Discriminative Frequent Pattern Analysis for Effective Classification
Efficient Subgraph Similarity All-Matching
Slides are modified from Jiawei Han & Micheline Kamber
Mining Frequent Subgraphs
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Chapter 17 Designing Databases
Important Problem Types and Fundamental Data Structures
Approximate Graph Mining with Label Costs
Presentation transcript:

September 19, 2018

Mining, Indexing & Searching Graphs in Large Data Sets Jiawei Han Department of Computer Science, University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj In collaboration with Xifeng Yan (IBM Watson), Philip S. Yu (IBM Watson), Feida Zhu (UIUC), Chen Chen (UIUC) September 19, 2018

Research Papers Covered in this Talk X. Yan and J. Han, gSpan: Graph-Based Substructure Pattern Mining, ICDM'02 X. Yan and J. Han, CloseGraph: Mining Closed Frequent Graph Patterns, KDD'03 X. Yan, P. S. Yu, and J. Han, Graph Indexing: A Frequent Structure-based Approach, SIGMOD'04 (also in TODS’05, Google Scholar: ranked #1 out of 63,300 entries on “Graph Indexing”) X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, SIGMOD'05 (also in TODS’06) F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing Framework for Graph Pattern Mining”, PAKDD'07 (Best Student Paper Award) C. Chen, X. Yan, P. S. Yu, J. Han, D. Zhang, and X. Gu, “Towards Graph Containment Search and Indexing”, VLDB'07, Vienna, Austria, Sept. 2007 September 19, 2018

Graph, Graph, Everywhere from H. Jeong et al Nature 411, 41 (2001) Aspirin Yeast protein interaction network September 19, 2018 Co-author network An Internet Web

Why Graph Mining and Searching? Graphs are ubiquitous Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: many problems are of high complexity! September 19, 2018

Outline Mining frequent graph patterns Constraint-based graph pattern mining Graph indexing methods Similairty search in graph databases Graph containment search and indexing September 19, 2018

Graph Pattern Mining Frequent subgraphs A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold Applications of graph pattern mining Mining biochemical structures Program control flow analysis Mining XML structures or Web communities Building blocks for graph classification, clustering, comparison, and correlation analysis September 19, 2018

Example: Frequent Subgraphs Graph Dataset (A) (B) (C) Frequent Patterns (min support is 2) (1) (2) September 19, 2018

Frequent Subgraph Mining Approaches Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03) Pattern growth-based approach MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04) September 19, 2018

Properties of Graph Mining Algorithms Search order breadth vs. depth Generation of candidate subgraphs apriori vs. pattern growth Elimination of duplicate subgraphs passive vs. active Support calculation embedding store or not Discover order of patterns path  tree  graph September 19, 2018

Apriori-Based Approach (k+1)-edge k-edge G1 G G2 G’ … G’’ Gn JOIN September 19, 2018

Pattern Growth-Based Span and Pruning 1-edge ... 2-edge ... ... If redundant, prune it! ... 3-edge G1 1: Show that G1=G2  Minimum DFS code of G1 should be equal to Minimum DFS code of G2, thus by calculating the minimum DFS code, we can tell whether two graphs are the same. 2: MOST IMPORTANT RESULT !!!: if two graphs G1 and G2 are the same, assume G1 is discovered first, then any graph which is right most extension of graph G2 must be discovered before. Therefore, we need not check the whole branch under G2 in the search tree. (This is the efficiency comes from) 3: ANOTHER MOST IMPORTANT RESULT: G2 is presented as a code C, thus we need not even calculate Min(G2), we need only to know whether C is equal to Min(G2). If not, we know G2 must be discovered before. Very critical for the efficiency. ... ... PRUNED ... September 19, 2018

gSpan (Yan and Han ICDM’02) Right-Most Extension Theorem: Completeness The Enumeration of Graphs using Right-most Extension is COMPLETE September 19, 2018

DFS Code Flatten a graph into a sequence using depth first search e2: (2,0) e4: (3,1) e1: (1,2) 1 2 e5: (2,4) e3: (2,3) 4 3 September 19, 2018

Graph Pattern Explosion Problem If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property An n-edge frequent graph may have 2n subgraphs Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5% September 19, 2018

Closed Frequent Graphs Motivation: Handling graph pattern explosion problem Closed frequent graph A frequent graph G is closed if there exists no supergraph of G that carries the same support as G If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) Lossless compression: still ensures that the mining result is complete September 19, 2018

CLOSEGRAPH (Yan & Han, KDD’03) A Pattern-Growth Approach (k+1)-edge At what condition, can we stop searching their children i.e., early termination? G1 k-edge G2 G If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’. Apriori: Step1: Join two k-1 edge graphs (these two graphs share a same k-2 edge subgraph) to generate a k-edge graph Step2: Join the tid-list of these two k-1 edge graphs, then see whether its count is larger than the minimum support Step3: Check all k-1 subgraph of this k-edge graph to see whether all of them are frequent Step4: After G successfully pass Step1-3, do support computation of G in the graph dataset, See whether it is really frequent. gSpan: Step1: Right-most extend a k-1 edge graph to several k edge graphs. Step2: Enumerate the occurrence of this k-1 edge graph in the graph dataset, meanwhile, counting these k edge graphs. Step3: Output those k edge graphs whose support is larger than the minimum support. Pros: 1: gSpan avoid the costly candidate generation and testing some infrequent subgraphs. 2: No complicated graph operations, like joining two graphs and calculating its k-1 subgraphs. 3. gSpan is very simple The key is how to do right most extension efficiently in graph. We invented DFS code for graph. … Gn September 19, 2018

Experimental Result The AIDS antiviral screen compound dataset from NCI/NIH The dataset contains 43,905 chemical compounds Among these 43,905 compounds, 423 of them belongs to CA, 1081 are of CM, and the remaining are in class CI September 19, 2018

Discovered Patterns 20% 10% 5% September 19, 2018

Number of Patterns: Frequent vs. Closed CA Number of patterns minimum support September 19, 2018

Runtime: Frequent vs. Closed CA runtime (sec) minimum support September 19, 2018

Outline Mining frequent graph patterns Constraint-based graph pattern mining Graph indexing methods Similairty search in graph databases Graph containment search and indexing September 19, 2018

Constraint-Based Graph Pattern Mining F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing Framework for Graph Pattern Mining”, PAKDD'07 There are often various kinds of constraints specified for mining graph pattern P, e.g., max_degree(P) ≥ 10 diameter(P) ≥δ Most constraints can be pushed deeply into the mining process, thus greatly reduces search space Constraints can be classified into different categories Different categories require different pushing strategies 2007-5-23 September 19, 2018 32 32

Pattern Pruning vs. Data Pruning Pruning a pattern saves the mining associated with all the patterns that grow out of this pattern, which is DP Data Pruning Data pruning considers both the pattern P and a graph G ∈ DP, and data pruning saves a portion of DP DP is the data search space of a pattern P. ST,P is the portion of DP that can be pruned by data pruning. September 19, 2018

Pruning Properties Overview Pruning property: A property of the constraint that helps prune either the pattern search space or the data search space. Pruning Pattern Search Space Strong P-antimonotonicity Weak P-antimonotoniciy Pruning Data Search Space Pattern-separable D-antimonotonicity Pattern-inseparable D-antimonotonicity September 19, 2018 34 34

Pruning Pattern Search Space Strong P-antimonotonicity A constraint C is strong P-antimonotone if a pattern violates C, all of its super-patterns do so too E.g., C: “The pattern is acyclic” Weak P-antimonotoniciy A constraint C is weak P-antimonotone if a graph P (with at least k vertices) satisfies C, there is at least one subgraph of P with one vertex less that satisfies C E.g., C: “The density ratio of pattern P ≥ 0.1”, i.e., A densely connected graph can always be grown from a smaller densely connected graph with one vertex less September 19, 2018 35 35

Pruning Data Space (I): Pattern-Separable D-Antimonotonicity A constraint C is pattern-separable D-antimonotone if a graph G cannot make P satisfy C, then G cannot make any of P’s super-patterns satisfy C C: “the number of edges ≥ 10”, or “the pattern contains a benzol ring”. Use this property: recursive data reduction A graph is pruned from the data search space for pattern P if G cannot satisfy this C September 19, 2018

Pruning Data Space (II): Pattern-Inseparable D-Antimonotonicity The tested pattern is not separable from the graph E.g., “the vertex connectivity of the pattern ≥ 10” Idea: “put pattern P back to G” Embed the current pattern P into each G ∈ DP Compute by a measure function M, for all supergraphs P’ such that P ⊂ P’ ⊂ G, an upper/lower bound M(P,G) of the graph property This bound serves as a necessary condition for the existence of a constraint-satisfying supergragh P’. We discard G if this necessary condition is violated. G =M(P,G) M P September 19, 2018

Graph Constraints: A General Picture September 19, 2018 38 38

Outline Mining frequent graph patterns Constraint-based graph pattern mining Graph indexing methods Similairty search in graph databases Graph containment search and indexing September 19, 2018

Graph Search: Querying Graph Databases Given a graph database and a query graph, find all graphs containing this query graph query graph graph database September 19, 2018

Scalability Issue Sequential scan Disk I/O Subgraph isomorphism testing An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03 (a) (b) (c) Query graph Sample database Two issues have to be solved: Disk I/O for false positive and subgraph isomorphism testing for false positive. Therefore, an indexing mechanism is needed. The list just shows a partial list of the previous research. DayLight system is a commercial product targeted for chemical information system. The following is copied from www.daylight.com. Once a large graph database is formed, how can we do database management, and further data warehouse and data mining. 2. GraphGrep uses a path-based approach to do graph indexing. Its implementation is available publicly. 3. Grace develops a hierarchical vector space method. It tries to extract features from the graphs in a hierarchical way. DayLight System “At Infinity Pharmaceuticals we use the Daylight toolkits for in-house development and have implemented DayCart (as a central component of our chemical registration system. This system has been fully integrated within all aspects of our drug discovery processes and software applications. We have successfully "pushed" the registration process to the chemists; they can quickly and easily register compounds within the context of their electronic notebooks. This has provided a significant enhancement to both the efficiency of data capture and the quality of the data within the chemistry database(s). DayCart integrates well into our overall application and data environment providing a flexible chemical data model which facilitates a data architecture that is tailored to Infinity's processes and scientific approach while at the same time allows speed and scalability. The benefit delivered to Infinity is measured in the strong capability to rapidly and efficiently develop intellectual property and access corporate knowledge effectively. “ September 19, 2018

Indexing Strategy Remarks Query graph (Q) Graph (G) If graph G contains query graph Q, G should contain any substructure of Q Substructure Our work, also with all the previous work follows this indexing strategy. Remarks Index substructures of a query graph to prune graphs that do not contain these substructures September 19, 2018

Framework Two steps in processing graph queries Step 1. Index Construction Enumerate structures in the graph database, build an inverted index between structures and graphs Step 2. Query Processing Enumerate structures in the query graph Calculate the candidate graphs containing these structures Prune the false positive answers by performing subgraph isomorphism test September 19, 2018

Cost Analysis Query Response Time Disk I/O time Isomorphism testing time Graph index access time T_io is the time of fetching a graph from a disk, which may result in one disk block IO. I did not show it in the paper. If the graph dataset can not be held in the main memory, then we need also count the IO time Our goal is to minimize |Cq| Size of candidate answer set Remark: make |Cq| as small as possible September 19, 2018

Path-Based Approach Sample database (a) (b) (c) Paths 0-length: C, O, N, S 1-length: C-C, C-O, C-N, C-S, N-N, S-O 2-length: C-C-C, C-O-C, C-N-C, ... 3-length: ... Built an inverted index between paths and graphs September 19, 2018

Problems of Path-Based Approach Sample database (a) (b) (c) Query graph Only graph (c) contains this query graph. However, if we only index paths: C, C-C, C-C-C, C-C-C-C, we cannot prune graph (a) and (b). Only graph (C) contains the query graph. Unfortunately, if we use path only, we cannot prune graphs (a) and (b) The question is why we use path. The reason is path is very simple, easier to manipulate. September 19, 2018

gIndex: Indexing Graphs by Data Mining Our methodology on graph index: Identify frequent structures in the database, the frequent structures are subgraphs that appear quite often in the graph database Prune redundant frequent structures to maintain a small set of discriminative structures Create an inverted index between discriminative frequent structures and graphs in the database The reason of doing frequent graphs is just because the number of frequent graphs is much less than the number of all subgraphs in the graph database although this number is still too large for indexing. September 19, 2018

IDEAS: Indexing with Two Constraints discriminative (~103) frequent (~105) structure (>106) September 19, 2018

Why Discriminative Subgraphs? Sample database (a) (b) (c) All graphs contain structures: C, C-C, C-C-C Why bother indexing these redundant frequent structures? Only index structures that provide more information than existing structures September 19, 2018

Discriminative Structures Pinpoint the most useful frequent structures Given a set of structures f1, f2, …, fn and a new structure x, we measure the extra indexing power provided by x, P (x|f1, f2, …, fn), where fi is contained in x When P is small enough, x is a discriminative structure and should be included in the index Index discriminative frequent structures only Reduce the index size by an order of magnitude September 19, 2018

Why Frequent Structures? We cannot index (or even search) all of substructures Large structures will likely be indexed well by their substructures Size-increasing support threshold minimum support threshold support size September 19, 2018

Experimental Setting The AIDS antiviral screen compound dataset from NCI/NIH, containing 43,905 chemical compounds Query graphs are randomly extracted from the dataset. GraphGrep: maximum length (edges) of paths is set at 10 gIndex: maximum size (edges) of structures is set at 10 The fingerprint set size is set as large as possible until the program takes 1G memory. Here it is 10K. Other parameter setting is provided in the paper September 19, 2018

Experiments: Index Size # OF FEATURES Remarks: It shows the raw feature set of path-based approach and structure-based approach. Of course, we can compress the paths into an arbitrary size of index with performance loss. In our experiment setting, we do not compress the path-based index in order to illustrate the best performance it can achieve. It is also possible to apply discriminative filtering on paths. We have not done experiments on it. The number of frequent structures or discriminative frequent structures does not change over the different size of databases. DATABASE SIZE September 19, 2018

Experiments: Answer Set Size # OF CANDIDATES 1. That is, the number of graphs containing the query graph is less than 50. 2. Query size are about the number of edges in a query. 3. Answer set size is the number of candidate graphs that may contain the query graph. 4. The actual match is the lower bound that all algorithms can achieve. That is, it is the actual number of graphs that really contain the query graph. 5. The performance of queries whose support is less than 50 QUERY SIZE September 19, 2018

Experiments: Incremental Maintenance The fact that the number of frequent structures does not change over the different size of databases inspires us to design an algorithm which construct the index incrementally. 2. The black line shows the performance of indexing from the scratch, for example, from a database with 2,000 graphs, 4,000 graphs, 6,000 graphs, 8,000 graphs and 10,000 graphs. 3. The red line shows the performance of incremental indexing, for example, we first built an index for 2,000 graphs, then we add another 2,000 graphs into the database, update the indexing of the original discriminative frequent graphs, then we add another 2,000, … and so on Frequent structures are stable to database updating Index can be built based on a small portion of a graph database, but be used for the whole database September 19, 2018

Outline Mining frequent graph patterns Constraint-based graph pattern mining Graph indexing methods Similairty search in graph databases Graph containment search and indexing September 19, 2018

Structure Similarity Search CHEMICAL COMPOUNDS (a) caffeine (b) diurobromine (c) viagra QUERY GRAPH September 19, 2018

Some “Straightforward” Methods Method1: Directly compute the similarity between the graphs in the DB and the query graph Sequential scan Subgraph similarity computation Method 2: Form a set of subgraph queries from the original query graph and use the exact subgraph search Costly: If we allow 3 edges to be missed in a 20-edge query graph, it may generate 1,140 subgraphs September 19, 2018

Index: Precise vs. Approximate Search Precise Search Use frequent patterns as indexing features Select features in the database space based on their selectivity Build the index Approximate Search Hard to build indices covering similar subgraphs—explosive number of subgraphs in databases Idea: (1) keep the index structure (2) select features in the query space September 19, 2018

Substructure Similarity Measure Query relaxation measure The number of edges that can be relabeled or missed; but the position of these edges are not fixed QUERY GRAPH … September 19, 2018

Substructure Similarity Measure Feature-based similarity measure Each graph is represented as a feature vector X = {x1, x2, …, xn} The similarity is defined by the distance of their corresponding vectors Advantages Easy to index Fast Rough measure September 19, 2018

Intuition: Feature-Based Similarity Search Graph (G1) If graph G contains the major part of a query graph Q, G should share a number of common features with Q Query (Q) Graph (G2) Given a relaxation ratio, calculate the maximal number of features that can be missed ! Suppose we can delete any red edge in the query graph Q. Substructure At least one of them should be contained September 19, 2018

Feature-Graph Matrix graphs in database G1 G2 G3 G4 G5 f1 1 f2 f3 f4 f5 features Assume a query graph has 5 features and at most 2 features to miss due to the relaxation threshold September 19, 2018

Edge Relaxation – Feature Misses If we allow k edges to be relaxed, J is the maximum number of features to be hit by k edges—it becomes the maximum coverage problem NP-complete A greedy algorithm exists We design a heuristic to refine the bound of feature misses September 19, 2018

Query Processing Framework Step 1. Index Construction Select small structures as features in a graph database, and build the feature-graph matrix between the features and the graphs in the database Step 2. Feature Miss Estimation Determine the indexed features belonging to the query graph Calculate the upper bound of the number of features that can be missed for an approximate matching, denoted by J On the query graph, not the graph database Step 3. Query Processing Use the feature-graph matrix to calculate the difference in the number of features between graph G and query Q, FG – FQ If FG – FQ > J, discard G. The remaining graphs constitute a candidate answer set September 19, 2018

Performance Study Database Chemical compounds of Anti-Aids Drug from NCI/NIH, randomly select 10,000 compounds Query Randomly select 30 graphs with 16 and 20 edges as query graphs Competitive algorithms Grafil: Graph Filter—our algorithm Edge: use edges only All: use all the features September 19, 2018

Comparison of the Three Algorithms # of candidates When the number of edge relaxations increases, the performance of Edge and Grafil will be similar, as expected. Queries with 16 edges edge relaxation September 19, 2018

Outline Mining frequent graph patterns Constraint-based graph pattern mining Graph indexing methods Similairty search in graph databases Graph containment search and indexing September 19, 2018

Graph Search vs. Graph Containment Search Given a graph DB and a query graph q, Graph search: Finds all graphs containing q Graph containment search: Finds all graphs contained by q Why graph containment search ? Chem-informatics: Searching for “descriptor” structures by full molecules Pattern Recognition: Searching for model objects by the captured scene Attributed Relational Graphs (ARGs) Object recognition search Cyber Security: Virus signature detection September 19, 2018

Example: Graph Search vs. Graph Containment Search Graph database Query graph We need index to search large datasets, but two searches need rather different index structures! Containment Traditional September 19, 2018 70

Different Philosophies in Two Searches Graph search: Feature-based pruning strategy Each query graph is represented as a vector of features, where features are subgraphs in the database If a graph in the database contains the query, it must also contain all the features of the query Different logics: Given a data graph g and a query graph q, (Traditional) graph search: inclusion logic If feature f is in q then the graphs not having f are pruned Graph containment search: exclusion logic If feature f is not in q then the graphs having f are pruned September 19, 2018

Contrast Features for C-Search Pruning Contrast Features: Those contained by many database graphs, but unlikely to be contained by query graphs Why contrast feature? ―because they can prune a lot in containment search! Challenges: There are nearly infinite number of subgraphs in the database that can be taken as features Contrast features should be those contained in many database graphs; thus, we only focus on those frequent subgraphs of the database 72 September 19, 2018

The Basic Framework Off-line index construction Generate and select a feature set F from the graph database D For feature f in F, Df records the set of graphs containing f, i.e., , which is stored as an inverted list on the disk Online search For each indexed feature , test it against the query q, pruning takes place iff. f is not contained in q Candidate answer set Verification Check each candidate in Cq by a graph isomorphism test September 19, 2018

Cost Analysis Given a query graph q and a set of features F, the search time can be formulated as A simplistic model: Of course, it can be extended Neglected because ID-list operations are cheap compared to isomorphism tests between graphs September 19, 2018

Feature Selection Core problem for index construction Carefully choose the set of indexed features F to maximize pruning capability, i.e., minimize for the query workload Q September 19, 2018

Feature-Graph Matrix The (i, j)-entry tells whether the jth model graph has the ith feature, i.e., if the ith feature is not contained in the query graph, then the jth model graph can be pruned iff. the (i, j)-entry is 1 September 19, 2018 76

Contrast Graph Matrix If the ith feature is contained in the query, then the corresponding row of the feature-graph matrix is set to 0, because the ith feature does not have any pruning power now September 19, 2018

Training by the Query Log Given a query log L = {q1, q2, . . . , qr}, we can concatenate the contrast graph matrix of all queries to form a contrast graph matrix for the whole query set What if there are no query logs? As the query graphs are usually not too different from database graphs, we can start the system by setting L = D, and then real queries will flow in Our experiments confirm the effectiveness of this alternative September 19, 2018

Maximum Coverage with Cost Including the ith feature Gain: the sum of the ith row, which is the number of (d-graph, q-graph) pairs it can prune Cost: |L| = r, because for each query q, we need to decide whether it contains the ith feature at first Select the optimal set of features that can maximize this gain-cost difference Maximum Coverage with Cost It is NP-complete September 19, 2018

The Basic Containment Search Index Greedy algorithm As the cost (|L| = r) is equal among features, the 1st feature is chosen as the one with greatest gain Update the contrast graph matrix, remove selected rows and pruned columns Stop if there are no features with gain over r cIndex-Basic A redundancy-aware fashion It can approximate the optimal index within a ratio of 1 − 1/e September 19, 2018

The Bottom-Up Hierarchical Index View indexed features as another database on which a second-level index can be built Iterate from the bottom of the tree The cascading effect: If f1 is not contained in q, then the whole tree rooted at f1 need not be examined September 19, 2018

The Top-Down Hierarchical Index Strongest features are put on the top The 2nd test takes messages from the 1st test The differentiating effect: index different features for different queries September 19, 2018

Experiment Setting Chemical Descriptor Search NCI/NIH AIDS anti-viral drugs 10,000 chemical compounds – queries Characteristic substructures - database Object Recognition Search TREC Video Retrieval Evaluation 3,000 key frame images – queries About 2,500 model objects – database Compare with: Naïve SCAN FB (Feature-Based): gIndex, state-of-art index built for (traditional) graph search OPT: corresp. to search database graphs really contained in the query September 19, 2018

Chemical Descriptor Search In terms of iso. test # In terms of processing time Trends are similar, meaning that our simplistic model is accurate enough September 19, 2018

Hierarchical Indices Space-time tradeoff September 19, 2018

Object Recognition Search September 19, 2018

Conclusions Graph mining has wide applications Frequent and closed subgraph mining methods gSpan and CloseGraph: pattern-growth depth-first search approach gPrune: Pruning graph mining search space with constraints gIndex: Graph indexing Frequent and discirminative subgraphs are high-quality indexing fatures Grafill: Similairty (subgraph) search in graph databases Graph indexing and feature-based approximate matching cIndex: Containment graph indexing A contrast feature-based indexing model September 19, 2018

Thanks and Questions September 19, 2018