Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.

Slides:

Advertisements

Similar presentations

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Advertisements

gSpan: Graph-based substructure pattern mining

Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Minimum Spanning Tree Sarah Brubaker Tuesday 4/22/8.

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

University of Illinois at Urbana-Champaign Graph Indexing: Tree + Δ ≥ Graph Peixiang Zhao Jeffrey Xu Yu Philip S. Yu Peixiang Zhao Jeffrey Xu Yu Philip.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Cloud Service Placement via Subgraph matching

Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia.

IGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu.

Data Structures, Spring 2004 © L. Joskowicz 1 Data Structures – LECTURE 14 Strongly connected components Definition and motivation Algorithm Chapter 22.5.

Association Analysis (7) (Mining Graphs)

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

COM (Co-Occurrence Miner): Graph Classification Based on Pattern Co-occurrence Ning Jin, Calvin Young, Wei Wang University of North Carolina at Chapel.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Data Structures, Spring 2006 © L. Joskowicz 1 Data Structures – LECTURE 14 Strongly connected components Definition and motivation Algorithm Chapter 22.5.

33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.

Efficient Computation of the Skyline Cube Yidong Yuan School of Computer Science & Engineering The University of New South Wales & NICTA Sydney, Australia.

Automated Drawing of 2D chemical structures Kees Visser.

Chapter 15 Graph Theory © 2008 Pearson Addison-Wesley.

Chapter 15 Graph Theory © 2008 Pearson Addison-Wesley. All rights reserved.

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

Hubert CARDOTJY- RAMELRashid-Jalal QURESHI Université François Rabelais de Tours, Laboratoire d'Informatique 64, Avenue Jean Portalis, TOURS – France.

Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim

Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

CSV: Visualizing and Mining Cohesive Subgraphs Nan Wang Srinivasan Parthasarathy Kian-Lee Tan Anthony K. H. Tung School of Computing National University.

CS 3343: Analysis of Algorithms Lecture 21: Introduction to Graphs.

Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases Jun Huan, Wei Wang, Jan Prins, Jiong Yang KDD 2004.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

A Unified View of Graph Searching

Role of Rigid Components in Protein Structure Pramod Abraham Kurian.

Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

Graph Indexing From managing and mining graph data.

Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.

Lu Qin Center of Quantum Computation and Intelligent Systems, University of Technology, Australia Jeffery Xu Yu The Chinese University of Hong Kong, China.

1 Substructure Similarity Search in Graph Databases R 陳芃安.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.

Gspan: Graph-based Substructure Pattern Mining

Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.

Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.

Cohesive Subgraph Computation over Large Graphs

Outline Introduction State-of-the-art solutions

Privacy Preserving Subgraph Matching on Large Graphs in Cloud

Probabilistic Data Management

Algorithms and networks

Mining Frequent Subgraphs

Graph Search with Indexing

TT-Join: Efficient Set Containment Join

On Efficient Graph Substructure Selection

Graph Database Mining and Its Applications

Elementary Graph Algorithms

Efficient Subgraph Similarity All-Matching

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Approximate Graph Mining with Label Costs

Presentation transcript:

Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of New South Wales & NICTA, Australia) Ying Zhang (The University of New South Wales, Australia) Jeffrey Xu Yu (Chinese University of Hong Kong, China) Wei Wang(The University of New South Wales & NICTA, Australia)

Outline 1. Motivation 2. Similarity Measure 3. Techniques 4. Experimental Study 5. Conclusion

Application 1. Chemistry 2. Bioinformatics 3. Software Engineering 4. Social Network Chemical Compounds

Substructure Search

Substructure Similarity Search Why Similarity Search? Input Mistake Exploration......

Substructure Similarity Search Why Similarity Search? Input Mistake Exploration Existing Work SIGMOD’05 Grafil ICDE’06 Closure-tree ICDE’07 GDIndex VLDB’09 Comparing Stars

Graph Similarity Subgraph Similarity Similarity Measures Maximum Common Subgraph (MCS) (# of missing edges) Edit Distance. Variants. No enforcement of connectivity.

Graph Similarity A New Similarity Measure. Maximum Connected Common Subgraph – MCCS (counting missing edges while retaining the connectivity)

Graph Similarity Maximum Connected Common Subgraph – MCCS: Given two graphs g 1 and g 2, the maximum connected common subgraph of g 1 and g 2 is the largest connected subgraph of g 1 which is subgraph isomorphic to g 2, denoted as mccs(g 1, g 2 )

Graph Similarity Maximum Connected Common Subgraph – MCCS: Given two graphs g 1 and g 2, the maximum connected common subgraph of g 1 and g 2 is the largest connected subgraph of g 1 which is subgraph isomorphic to g 2, denoted as mccs(g 1, g 2 ) Subgraph Distance: Given a query graph q and a data graph g, the Subgraph Distance is defined as, dist(q, g) = |q| − |mccs(q, g)| The graph size is defined as the number of edges. (# of missing edges from the query)

Graph Similarity Maximum Connected Common Subgraph – MCCS: Given two graphs g 1 and g 2, the maximum connected common subgraph of g 1 and g 2 is the largest connected subgraph of g 1 which is subgraph isomorphic to g 2, denoted as mccs(g 1, g 2 ) Substructure Similarity Search: Given a graph database D = {g 1, g 2,..., g n }, a query graph q, and a subgraph distance threshold, the substructure similarity search is to retrieve all the graphs g i ∈ D with dist(q, g i ) ≤. Subgraph Distance: Given a query graph q and a data graph g, the Subgraph Distance is defined as, dist(q, g) = |q| − |mccs(q, g)| The graph size is defined as the number of edges. (# of missing edges from the query)

Framework

Feature-based exact subgraph search: overview Query Data Query Feature(Index) Data Pruning:

Query Data Query Feature(Index) Data Feature-based exact subgraph search: overview Pruning: Validation:

Similarity Search (triangular inequality) dist(Q,F)+dist(F,D) ≥ dist(Q,D) ? Query Data dist(Q,D) dist(Q,F) dist(F,D) Query Feature(Index) Data

Query Data dist(Q,D) dist(Q,F) dist(F,D) Query Feature(Index) Data dist(Q,F)+dist(F,D) ≥ dist(Q,D) ? 1 Similarity Search (triangular inequality)

Query Data dist(Q,D) dist(Q,F) dist(F,D) Query Feature(Index) Data 1 2 dist(Q,F)+dist(F,D) ≥ dist(Q,D) ? Similarity Search (triangular inequality)

Query Data dist(Q,D) dist(Q,F) dist(F,D) Query Feature(Index) Data dist(Q,F)+dist(F,D) ≥ dist(Q,D) – hold! Similarity Search (triangular inequality)

dist(Q,F) dist(F,D) Query Feature(Index) Data Query Data dist(Q,D) dist(Q,F)+dist(F,D) ≥ dist(Q,D) X Triangular inequality: not always hold

dist(Q,F) dist(F,D) Query Feature(Index) Data Query Data dist(Q,D) Triangular inequality: not always hold dist(Q,F)+dist(F,D) ≥ dist(Q,D) X

Connectivity Dominance Connectivity Dominance: The connectivity of mccs(g 1, g 2 ) dominates the connectivity of g 2 if there is a subgraph isomorphic mapping from mccs(g 1, g 2 ) to g 2 such that if removing all the edges from this mapping, then all the vertices in the embedding mapping are disconnected. (i.e. The removing fully disconnected g 2.)

Theorem. Given three graphs g 1, g 2, and g 3, if the connectivity of mccs(g 1, g 2 ) dominates g 2 or the connectivity of mccs(g 3, g 2 ) dominates g 2, then dist(g 1, g 3 ) ≤ dist(g 1, g 2 ) + dist(g 2, g 3 ). Connectivity Dominance

Theorem. Given three graphs g 1, g 2, and g 3, if the connectivity of mccs(g 1, g 2 ) dominates g 2 or the connectivity of mccs(g 3, g 2 ) dominates g 2, then dist(g 1, g 3 ) ≤ dist(g 1, g 2 ) + dist(g 2, g 3 ). Connectivity Dominance g 1 =Query g 2 =Feature(Index) g 3 =Data Example 1 Example 2

Theorem. Given three graphs g 1, g 2, and g 3, if the connectivity of mccs(g 1, g 2 ) dominates g 2 or the connectivity of mccs(g 3, g 2 ) dominates g 2, then dist(g 1, g 3 ) ≤ dist(g 1, g 2 ) + dist(g 2, g 3 ). Connectivity Dominance g 1 =Query g 2 =Feature(Index) g 3 =Data Example 1 Example 2 mccs(g 2,g 3 ) dominates g 2 mccs(g 1,g 2 ) not dominate g 2

Theorem. Given three graphs g 1, g 2, and g 3, if the connectivity of mccs(g 1, g 2 ) dominates g 2 or the connectivity of mccs(g 3, g 2 ) dominates g 2, then dist(g 1, g 3 ) ≤ dist(g 1, g 2 ) + dist(g 2, g 3 ). Connectivity Dominance g 1 =Query g 2 =Feature(Index) g 3 =Data Example 1 Example 2 mccs(g 2,g 3 ) dominates g 2 mccs(g 1,g 2 ) not dominate g 2 mccs(g 2,g 3 ) not dominate g 2

Theorem. Given three graphs g 1, g 2, and g 3, if the connectivity of mccs(g 1, g 2 ) dominates g 2 or the connectivity of mccs(g 3, g 2 ) dominates g 2, then dist(g 1, g 3 ) ≤ dist(g 1, g 2 ) + dist(g 2, g 3 ). Connectivity Dominance g 1 =Query g 2 =Feature(Index) g 3 =Data Example 1 Example 2 mccs(g 2,g 3 ) dominates g 2 Count # of disconnected components: Linear Algorithm mccs(g 1,g 2 ) not dominate g 2 mccs(g 2,g 3 ) not dominate g 2

dist(Q,F)+dist(F,D) ≥ dist(Q,D) Validation Rule 1: dist(Q,F)+dist(F,D) ≤ => dist(Q,D) ≤ mccs(Q, F) dominates F or mccs(F, D) dominates F dist(Q,D)+dist(D,F) ≥ dist(Q,F) Pruning Rule 1: dist(Q,F)-dist(D,F)> => dist(Q,D)> mccs(D, F) dominates D dist(F,Q)+dist(Q,D) ≥ dist(F,D) Pruning Rule 2: dist(F, D)-dist(F, Q)> => dist(Q,D)> mccs(F, Q) dominates Q

Basic idea: 1. enumerate sub-spanning tree of query graph such that the # of missing edges ≤ ; try to terminate the algorithm as early as possible. 2. sharing the enumeration costs by two ways: a. not enumerate every thing from scratch. b. once enumerated, keep enumerated spanning trees. Convert Query to QI-Sequence [VLDB08] to favour earlier termination. Prefix = Induced subgraph 1.1 Infrequent Label (in all data graphs) First 1.2 Higher Degree Vertex (in the query graph) First 1.3 Dense Induced Subgraph (in the query graph) First Verification Algorithm

MCCS Detection Algorithm 1.Compute QI-Sequence

Verification Algorithm MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched)

Verification Algorithm Remove Edge B-D MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched) 3.Generate new QI-Sequence from the existing one.

Verification Algorithm Remove Edge B-E MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched) 3.Generate new QI-Sequence from the existing one.

Verification Algorithm Remove Edge B-F MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched) 3.Generate new QI-Sequence from the existing one.

Verification Algorithm Right Subtree MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched) 3.Generate new QI-Sequence from the existing one. 4.DFS: Threshold based DFS Search (The second A-B Matched)

Verification Algorithm Remove Edge B-C MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched) 3.Generate new QI-Sequence from the existing one. 4.DFS: Threshold based DFS Search (The second A-B Matched) 5.Generate new QI-Sequence from the existing one.

Verification Algorithm MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched) 3.Generate new QI-Sequence from the existing one. 4.DFS: Threshold based DFS Search (The second A-B Matched) 5.Generate new QI-Sequence from the existing one. 6.Terminate. (dist(q,g) ≤ 3)

Feature Selection Pruning Rule 1: mccs(D, F) dominates D Pruning Rule 2: mccs(F, Q) dominates Q =>F should be dense. =>Discriminative Frequent Induced Subgraph Validation Rule 1: mccs(F, D) dominates F or mccs(Q, F) dominates F =>F nearly contains Q and F should be sparse. =>Frequent Large Sparse Subgraphs Algorithm: gSpan[ICDM02] with our on-the-fly feature selection.

Experiments Settings CPUIntel Xeon 2.40GHz Memory4G SystemDebian Linux ComplierGNU GCC AIDS Antiviral dataset, a popular benchmark, 43k chemical bonds

Experiments

Conclusion Thanks Connected Substructure Similarity Search 1.Measure: Maximum Connected Common Subgraph – MCCS 2.Connectivity Dominance => Triangular inequality 3.MCCS Detection Algorithm (Index, Filtering & Validation, Verification Techniques) Future Work: Large Graphs? New Measures?