Outline Introduction State-of-the-art solutions

Slides:

Advertisements

Similar presentations

Indexing DNA Sequences Using q-Grams

Advertisements

Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.

ECE 667 Synthesis and Verification of Digital Circuits

BiG-Align: Fast Bipartite Graph Alignment

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.

Yinghui Wu, LFCS DB talk Database Group Meeting Talk Yinghui Wu 10/11/ Simulation Revised for Graph Pattern Matching.

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.

Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.

Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.

Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

Yinghui Wu LFCS Lab Lunch Homomorphism and Simulation Revised for Graph Matching.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Graph Query Reformulation with Diversity – Davide Mottin, Francesco Bonchi, Francesco Gullo 1 Graph Query Reformulation with Diversity Davide Mottin, University.

Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.

Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.

Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.

QoS Supported Clustered Query Processing in Large Collaboration of Heterogeneous Sensor Networks Debraj De and Lifeng Sang Ohio State University Workshop.

Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.

Melbourne, Australia, Oct., 2015 gSparsify: Graph Motif Based Sparsification for Graph Clustering Peixiang Zhao Department of Computer Science Florida.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Graph Indexing From managing and mining graph data.

Yinghui Wu, SIGMOD Incremental Graph Pattern Matching Wenfei Fan Xin Wang Yinghui Wu University of Edinburgh Jianzhong Li Jizhou Luo Harbin Institute.

A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:

1 Substructure Similarity Search in Graph Databases R 陳芃安.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.

Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.

Outline Introduction State-of-the-art solutions Equi-Truss Experiments

Cohesive Subgraph Computation over Large Graphs

Privacy Preserving Subgraph Matching on Large Graphs in Cloud

Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS

Cristian Ferent and Alex Doboli

Privacy Preserving Subgraph Matching on Large Graphs in Cloud

Graph Theory and Algorithm 02

Algorithms and networks

Graph Search with Indexing

Privacy Preserving Subgraph Matching on Large Graphs in Cloud

On Efficient Graph Substructure Selection

Automatic Physical Design Tuning: Workload as a Sequence

Concurrent Graph Exploration with Multiple Robots

Conflict-Aware Event-Participant Arrangement

Algorithms and networks

Farzaneh Mirzazadeh Fall 2007

Optimizing Data Popularity Conscious Bloom Filters

MURI Kickoff Meeting Randolph L. Moses November, 2008

Diversified Top-k Subgraph Querying in a Large Graph

Algorithms for Budget-Constrained Survivable Topology Design

Efficient Subgraph Similarity All-Matching

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Graph Homomorphism Revisited for Graph Matching

Efficient Processing of Top-k Spatial Preference Queries

Donghui Zhang, Tian Xia Northeastern University

Approximate Graph Mining with Label Costs

An Efficient Partition Based Method for Exact Set Similarity Joins

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

Yongjiang Liang, Peixiang Zhao CS @ FSU zhao@cs.fsu.edu Similarity Search in Graph Databases: a Multi-Layered Indexing Approach Yongjiang Liang, Peixiang Zhao CS @ FSU zhao@cs.fsu.edu

Outline Introduction State-of-the-art solutions ML-Index & similarity search Experiments Conclusion

Similarity Search Introduction Graphs are ubiquitous How to enable efficient access methods and flexible, structure-aware querying capabilities for a large collection of graphs? Exact graph querying may be too rigid and limited There are noise and distortions in graphs Rank-based exploration is highly desirable Similarity Search

Graph Edit Distance Introduction Applications for graph similarity search Chemistry: new drug discovery and synthesis CV&PR: identity discovery, object detection, scene identification Bioinformatics: biological pathway enumeration How to model similarity for graphs? A general metric for fine-grained graph structure/content proximity Computation is in NP-hard Graph Edit Distance

Introduction GED(q, g) = 5 Graph edit operations Vertex/Edge insertion, deletion, relabeling Graph edit distance, GED (q, g) The minimum number of graph edit operations to modify q to g Sulfur Phosphorous q g GED(q, g) = 5 5. Relabel edge (C1, C2) 3. Add new edge (P, C1) 4. Add new edge (P, C2) 2. Add new vertex P 1. Relabel N to S

Similarity Search in Graph Databases Problem Formulation Similarity Search in Graph Databases Given a graph database G ={g1, g2, ……, gn} , a query graph q, and a GED threshold 𝝉 , to find as output all the data graphs gi ∈ G such that GED(gi , q) ≤ 𝝉 NP Hard !

State-of-the-art Solutions The filtering-verification framework Filtering unpromising graphs from G to form a candidate set C GED(gi , q) ≥ ≥ 𝝉 Verify the GED constraint upon C |C| <<|G| K-AT[TKDE’12], SEGOS[ICDE’12], b-Tree[CIKM’13], Pars[VLDB’13] Each graph is decomposed to (𝝉+1) partitions, if every partition pi is NOT contained (subgraph isomorphic) in q, gi is filtered GED(gi , q) 1. Cost-effective 2. Cheap to compute 3. Powerful filtering capabilities

Technical Questions Arise Here Partition-based GED Similarity Search How to choose the right number of partitions for each data graph? 𝝉 + 1 How to partition each data graph? Random partitioning How to guarantee the query performance? One-layer index, no performance guarantees Multilayered-layer index with performance guarantees 𝝉 + k (k ≥ 1) Selectivity-aware partitioning ML- Index !

Partition-based GED Lower Bounds For each graph g in G Decompose it to partitions If GED(gi , q) ≤ 𝝉 There must exist at least k partitions contained in q Tighter GED bounds: When k > 1, the prob. of filtering a false-positive graph from G is higher than when k = 1 (𝝉 + k) k: a variable k ≥ 1 q g 𝝉 = 2, k = 2

Selectivity-aware Graph Partitioning A motivating example Selectivity of partitions Partition size Vertex/Edge label frequency A linear, greedy selectivity-aware graph partitioning algorithm Assign a vertex to a partition with maximum selectivity gain q g 𝝉 = 2, k = 2

Multi-Layered Indexing Framework Idea: incorporating multiple, as opposed one, GED lower bounds to strengthen the collaborative filtering capabilities w.h.p., a false-positive graph will be identified and filtered from G Similarity search performance is theoretically guaranteed ! ML-Index (Multi-Layered Index): L distinct layers of indices. For each layer: A partitioned-based GED lower-bound, characterized by ki A graph partitioning scheme Resultant graph partitions for false-positive graph filtering

Multi-Layered Indexing Framework

ML-Index Based Similarity Search Given a query q, explore ML-Index layer-by-layer for candidate generation. Graphs passing ALL layers of GED lower-bounds constitute the candidate set, C Time complexity Candidate Generation GED Verification Initialization & Set operations

Experiments Evaluation Methods Datasets Evaluation Metric Pars [VLDB’13] Selectivity ML-Index Datasets AIDS, Protein, GraphGen Evaluation Metric Index construction cost Similarity search performance

Index Construction Cost # Features (AIDS) Index Size (AIDS) Index Time (AIDS)

Similarity Search Performance AIDS Dataset

Conclusions Problem: enable GED-based similarity search in large graph databases Widely varying real-world applications NP-hard ML-Index: a multi-layered graph indexing framework A generic, parameterized, tighter GED lower bound Selectivity-aware graph partitioning Multi-layered indexing with guaranteed search performance

Thank you ! Q & A