33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.

Slides:



Advertisements
Similar presentations
Graph Mining Laks V.S. Lakshmanan
Advertisements

Random Forest Predrag Radenković 3237/10
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
gSpan: Graph-based substructure pattern mining
Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.
Dynamic Programming.
University of Illinois at Urbana-Champaign Graph Indexing: Tree + Δ ≥ Graph Peixiang Zhao Jeffrey Xu Yu Philip S. Yu Peixiang Zhao Jeffrey Xu Yu Philip.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
Mining Graphs.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Cloud Service Placement via Subgraph matching
Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia.
IGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu.
Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.
A General Model for Relational Clustering Bo Long and Zhongfei (Mark) Zhang Computer Science Dept./Watson School SUNY Binghamton Xiaoyun Wu Yahoo! Inc.
Dynamic Programming Reading Material: Chapter 7..
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Dynamic Programming Dynamic Programming algorithms address problems whose solution is recursive in nature, but has the following property: The direct implementation.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
Dynamic Programming Reading Material: Chapter 7 Sections and 6.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Yinghui Wu LFCS Lab Lunch Homomorphism and Simulation Revised for Graph Matching.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Dynamic Programming – Part 2 Introduction to Algorithms Dynamic Programming – Part 2 CSE 680 Prof. Roger Crawfis.
Slides are modified from Jiawei Han & Micheline Kamber
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
Image Segmentation Seminar III Xiaofeng Fan. Today ’ s Presentation Problem Definition Problem Definition Approach Approach Segmentation Methods Segmentation.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
SPIN: Mining Maximal Frequent Subgraphs from Graph Databases Jun Huan, Wei Wang, Jan Prins, Jiong Yang KDD 2004.
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Efficient Algorithms for Some Variants of the Farthest String Problem Chih Huai Cheng, Ching Chiang Huang, Shu Yu Hu, Kun-Mao Chao.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Graph Indexing: A Frequent Structure-­based Approach 指導老師:曾新穆 教授 組員:李彥寬、洪世敏、丁鏘巽、 黃冠霖、詹博丞 日期: 2013/11/ /11/141.
Graph Indexing From managing and mining graph data.
Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.
1 Substructure Similarity Search in Graph Databases R 陳芃安.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.
Gspan: Graph-based Substructure Pattern Mining
Outline Introduction State-of-the-art solutions
3.1 Introduction to Determinants
September 19, 2018.
Jiawei Han Department of Computer Science
Graph Search with Indexing
Chapter 15 QUERY EXECUTION.
On Efficient Graph Substructure Selection
Graph Database Mining and Its Applications
ICS 353: Design and Analysis of Algorithms
Consensus Partition Liang Zheng 5.21.
Diversified Top-k Subgraph Querying in a Large Graph
Ch09 _2 Approximation algorithm
Approximate Graph Mining with Label Costs
Presentation transcript:

33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip S. Yu 2, Jiawei Han 1, Dong-Qing Zhang 3, Xiaohui Gu 2 1 University of Illinois at Urbana-Champaign 2 IBM T. J. Watson Research Center 3 Thomson Research

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Problem Definition Given a graph database D = {g 1, … g n }, and a graph query q, one could formulate two basic search problems: (1) (traditional) graph search: find all graphs g i in D s.t. q is a subgraph of g i. GraphGrep: D. Shasha, J. T.-L. Wang, and R. Giugno. PODS gIndex: X. Yan, P. S. Yu, and J. Han. SIGMOD C-Tree: H. He and A. K. Singh. ICDE Tree+Δ: P. Zhao, J. X. Yu, and P. S. Yu. VLDB (2) graph containment search: find all graphs g i in D s.t. q is a supergraph of g i

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Applications Chem-Informatics Pattern Recognition Cyber Security (Virus Signature Detection) Information Management (User-interest Mapping)

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Example

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Preliminary Definitions Subgraph Isomorphism: For two labeled graphs g and g’, a subgraph isomorphism is an injective function f : V(g) -> V(g’), s.t. ∀ v ∈ V(g), l(v) = l’(f(v)); ∀ (u, v) ∈ E(g), (f(u), f(v)) ∈ E(g’) and l(u, v) = l’(f(u), f(v)). f is called an embedding of g in g’ Subgraph and Supergraph If there exists an embedding of g in g’, then g is a subgraph of g’, denoted by g ⊆ g’, and g’ is a supergraph of g.

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Feature-based Indexing Methodology Naïve solution (SCAN): Examines the database D sequentially and compares each graph g i with the query graph q to decide whether q ⊇ g i. Subgraph isomorphism problem is NP-complete. Feature-based indexing: Similar model graphs g i and g j are likely to have similar isomorphism testing results w.r.t. the same query graph. Let f be a common substructure shared by g i and g j. If f ⊆ q, then g i ⊆ q and g j ⊆ q. Therefore, we can save on isomorphism test. Select a feature set F from graph database D. If feature f ∈ F is not a subgraph of q, then the graphs having f as subgraph are pruned.

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Basic Framework Off-line index construction: Generate and select a feature set F from the graph database D. For feature f ∈ F, D f = {g|f ⊆ g, g ∈ D}, which can be represented by an inverted list over D. Search: Test indexed features in F against the query q which returns all f ⊆ q, and compute the candidate query answer set, C q = D – ∪ f D f (f ⊆ q, f ∈ F). Verification: Check each graph g in the candidate set C q to see whether g is really a subgraph of q.

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Cost Model Search Time Formula: |F| + |C q | (negligible)

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Feature Graph Matrix gaga gbgb gcgc f1f1 111 f2f2 110 f3f3 110 f4f4 100

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Feature Generation Good features should be frequent, but not too frequent in the database. frequent: index more graphs in database too frequent: simple and easy to be contained by query graph Use frequent subgraph mining algorithms, e.g. gSpan[1], to generate an initial set of frequent subgraphs.

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Feature Selection Given a set of queries {q 1, q 2, …, q r }, an optimal index should be able to maximize the total gain from naïve SCAN:

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Feature Selection Set i-th row to 0 if the query has feature f i as its subgraph. Concatenate feature graph matrix to form a global matrix. f i covers a set of columns -> Maximum Coverage with Cost:

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Maximum Coverage with Cost Given a set of subsets S = {S 1, S 2, …, S m } of the universal set U = {1, 2, …, n} and a cost parameter λassociated with any S i ∈ S, find a subset T of S such that | ∪ Si ∈ T S i | - λ|T| is maximized. Can be reduced from set cover, and therefore is NP- complete. Greedy heuristic method, in each iteration: Select a row i with the most # of non-zero entries from global matrix M. Set j-th column to 0 if M ij = 1 Note that selecting a row is associate with a cost r, so stop the iteration if no rows have more than r non-zero entries. The greedy heuristic achieves an approximation ratio of 1-1/e.

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Algorithm: cIndex-Basic Input: Graph Matrix M over r queries Output: Selected Features F. 1: F = ∅ ; 2: while ∃ i, ∑ j M ij > r do 3: select row i with most non-zero entries in M; 4: F = F ∪ {f i }; 5: for each column j s.t. M ij is not zero do 6: delete column j from M 7: delete row i; 8: return F;

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Complexity Time Complexity: O(|F 0 ||D||r|), where |D| and |r| can be reduced by sampling and clustering on graph database and queries. Space Complexity: Use a compact matrix, reduce the space complexity from O(|F 0 ||D||r|) to O(|F 0 ||D| + |F 0 ||r|) q1q1 q2q2 q3q3 f1f1 030 f2f2 220 f3f3 022 f4f4 111

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Hierarchical Indexing Models The cIndex-Basic algorithm builds a flat index structure, where each feature is tested sequentially and deterministically against any input queries. Hierarchical index may improve the performance: Bottom-up Top-down

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna cIndex-BottomUp Build index layer by layer staring from the bottom-level graphs. The first-level index L 1 is built on the original graph database by cIndex-Basic. The features in L 1 can be regarded as another graph database, where cIndex- Basic can be executed again to form second-level index L 2. Disadvantage: high-level features are simple and easy to be contained be queries.

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna cIndex-TopDown Select feature f i that covers most columns in global graph matrix M. Divide queries into two groups: contain f i and do not contain f i. Divide M into two parts according to query groups. Run the above steps recursively on new matrices, until we reach a small number of queries in a group (to avoid overfitting).

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Experiments

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu. 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Thank You!