33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip S. Yu 2, Jiawei Han 1, Dong-Qing Zhang 3, Xiaohui Gu 2 1 University of Illinois at Urbana-Champaign 2 IBM T. J. Watson Research Center 3 Thomson Research
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Problem Definition Given a graph database D = {g 1, … g n }, and a graph query q, one could formulate two basic search problems: (1) (traditional) graph search: find all graphs g i in D s.t. q is a subgraph of g i. GraphGrep: D. Shasha, J. T.-L. Wang, and R. Giugno. PODS gIndex: X. Yan, P. S. Yu, and J. Han. SIGMOD C-Tree: H. He and A. K. Singh. ICDE Tree+Δ: P. Zhao, J. X. Yu, and P. S. Yu. VLDB (2) graph containment search: find all graphs g i in D s.t. q is a supergraph of g i
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Applications Chem-Informatics Pattern Recognition Cyber Security (Virus Signature Detection) Information Management (User-interest Mapping)
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Example
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Preliminary Definitions Subgraph Isomorphism: For two labeled graphs g and g’, a subgraph isomorphism is an injective function f : V(g) -> V(g’), s.t. ∀ v ∈ V(g), l(v) = l’(f(v)); ∀ (u, v) ∈ E(g), (f(u), f(v)) ∈ E(g’) and l(u, v) = l’(f(u), f(v)). f is called an embedding of g in g’ Subgraph and Supergraph If there exists an embedding of g in g’, then g is a subgraph of g’, denoted by g ⊆ g’, and g’ is a supergraph of g.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Feature-based Indexing Methodology Naïve solution (SCAN): Examines the database D sequentially and compares each graph g i with the query graph q to decide whether q ⊇ g i. Subgraph isomorphism problem is NP-complete. Feature-based indexing: Similar model graphs g i and g j are likely to have similar isomorphism testing results w.r.t. the same query graph. Let f be a common substructure shared by g i and g j. If f ⊆ q, then g i ⊆ q and g j ⊆ q. Therefore, we can save on isomorphism test. Select a feature set F from graph database D. If feature f ∈ F is not a subgraph of q, then the graphs having f as subgraph are pruned.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Basic Framework Off-line index construction: Generate and select a feature set F from the graph database D. For feature f ∈ F, D f = {g|f ⊆ g, g ∈ D}, which can be represented by an inverted list over D. Search: Test indexed features in F against the query q which returns all f ⊆ q, and compute the candidate query answer set, C q = D – ∪ f D f (f ⊆ q, f ∈ F). Verification: Check each graph g in the candidate set C q to see whether g is really a subgraph of q.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Cost Model Search Time Formula: |F| + |C q | (negligible)
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Feature Graph Matrix gaga gbgb gcgc f1f1 111 f2f2 110 f3f3 110 f4f4 100
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Feature Generation Good features should be frequent, but not too frequent in the database. frequent: index more graphs in database too frequent: simple and easy to be contained by query graph Use frequent subgraph mining algorithms, e.g. gSpan[1], to generate an initial set of frequent subgraphs.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Feature Selection Given a set of queries {q 1, q 2, …, q r }, an optimal index should be able to maximize the total gain from naïve SCAN:
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Feature Selection Set i-th row to 0 if the query has feature f i as its subgraph. Concatenate feature graph matrix to form a global matrix. f i covers a set of columns -> Maximum Coverage with Cost:
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Maximum Coverage with Cost Given a set of subsets S = {S 1, S 2, …, S m } of the universal set U = {1, 2, …, n} and a cost parameter λassociated with any S i ∈ S, find a subset T of S such that | ∪ Si ∈ T S i | - λ|T| is maximized. Can be reduced from set cover, and therefore is NP- complete. Greedy heuristic method, in each iteration: Select a row i with the most # of non-zero entries from global matrix M. Set j-th column to 0 if M ij = 1 Note that selecting a row is associate with a cost r, so stop the iteration if no rows have more than r non-zero entries. The greedy heuristic achieves an approximation ratio of 1-1/e.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Algorithm: cIndex-Basic Input: Graph Matrix M over r queries Output: Selected Features F. 1: F = ∅ ; 2: while ∃ i, ∑ j M ij > r do 3: select row i with most non-zero entries in M; 4: F = F ∪ {f i }; 5: for each column j s.t. M ij is not zero do 6: delete column j from M 7: delete row i; 8: return F;
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Complexity Time Complexity: O(|F 0 ||D||r|), where |D| and |r| can be reduced by sampling and clustering on graph database and queries. Space Complexity: Use a compact matrix, reduce the space complexity from O(|F 0 ||D||r|) to O(|F 0 ||D| + |F 0 ||r|) q1q1 q2q2 q3q3 f1f1 030 f2f2 220 f3f3 022 f4f4 111
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Hierarchical Indexing Models The cIndex-Basic algorithm builds a flat index structure, where each feature is tested sequentially and deterministically against any input queries. Hierarchical index may improve the performance: Bottom-up Top-down
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna cIndex-BottomUp Build index layer by layer staring from the bottom-level graphs. The first-level index L 1 is built on the original graph database by cIndex-Basic. The features in L 1 can be regarded as another graph database, where cIndex- Basic can be executed again to form second-level index L 2. Disadvantage: high-level features are simple and easy to be contained be queries.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna cIndex-TopDown Select feature f i that covers most columns in global graph matrix M. Divide queries into two groups: contain f i and do not contain f i. Divide M into two parts according to query groups. Run the above steps recursively on new matrices, until we reach a small number of queries in a group (to avoid overfitting).
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Experiments
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Thank You!