Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Inves: Incremental Partitioning-based Verification for Graph Similarity Search Jongik Kim1, Dong-Hoon Choi2, and Chen Li3 1Chonbuk National University, South Korea 2Korea Institute of Science and Technology Information, South Korea 3University of California, Irvine

Introduction - Graph Similarity Search
Graph Data Model Graphs are ubiquitous and abundant in real-world data Finding occurrences of a graph from a database is an essential operation We need to tolerate noises, distortion, and different representations of graphs  Calls for graph similarity search Graph Similarity Search Important access method in many research areas Cheminformatics: predicting properties of chemicals, drug design Bioinformatics: similar DNA interactions CV&PR: object detection, fingerprint identification …

Introduction - Graph Edit Distance
Graph Edit Distance (GED) A general metric to measure the similarity between two graphs The minimum number of graph edit operations to transform one graph to the other graph insertion of a single vertex or edge deletion of a single vertex or edge substitution of the label of a single vertex or edge vertex labels denote atom symbols O C S N x O C S N O S y O C S N GED(x, y) = 3 edge labels (single and double lines) denote chemical bonds GED computation is NP-hard

Graph Similarity Search
Graph Similarity Search with a GED constraint Given a graph database, and a query graph with a GED threshold τ, graph similarity search is to find all graphs in the database whose GED from the query graph is within τ Filtering-and-Verification Framework Using a feature-based index, filtering data graphs to generate candidate graphs Verifying each candidate graph by computing GED with the query graph Main focus of existing work

Previous Work - Partition-based Approach
Given two graphs x and y, consider x is decomposed into τ+1 partitions GED(x, y) ≤ τ  at least one partition of x is contained in y (i.e., no partition of x is contained in y  GED(x, y) > τ) Filtering with a Partition-based Index [Pars, MLIndex] g1 g2 } DB { τ = 1 S O N C F , decompose into τ +1 partitions N O F C query graph q O N C p1 is contained in q Index O N C Ip1 = g1 O N C Ip1 = g1 S C F Ip2 = g1 S C F Ip2 = g1 g2 O C Ip3 = g2 O C Ip3 = g2 g1 is a candidate graph g2 Offline Processing

Motivation of Our Work (1/2)
Problem with Existing Index-based Filtering An offline partitioning of a data graph cannot work well for all queries  Suffer from many candidate graphs and an expensive verification phase original partitioning of g1 S O N C F alternative partitioning of g1 S O N C F g1 g2 } DB { τ = 1 S O N C F , decompose into τ +1 partitions N O F C query graph q p1 is contained in q Index O N C Ip1 = g1 S C F Ip2 = g1 O C Ip3 = g2 g1 is a candidate graph g2 Offline Processing

Motivation of Our Work (2/2)
Refine each candidate by partitioning it based on the query graph << Cost for GED computation Cost for partitioning and containment tests Candidate Generation Candidate Refinement GED Computation Filtering Phase Verification Phase (scope of our work)

Candidate Verification Scheme
Partition-based GED Lower Bound Given two graphs x and y with a partitioning of x, P(x) = {p1, p2, …, pk}, a GED lower bound between x and y is lb(x, y) = |{p | p ∈ P(x) and p is not contained in y}| p is called a mismatching partition Candidate Verification For a candidate x and a query y with a GED threshold τ, if lb(x, y) > τ then prune x else if GED(x, y) > τ then prune x else x is an answer of the query  We compute the GED only when the lower bound is not greater than τ Goal Tightening the lower bound by developing a novel partitioning strategy Exploiting partitioning results to accelerate GED computation

Tightening the Lower Bound - Measure for a Good Partitioning
See the paper for a detailed analysis of the tightness of the partition-based lower bound For every mismatching partition p in P(x), C1: Edit errors in p is indivisible and minimal indivisibility – p cannot be decomposed into two mismatching partitions minimality – p becomes a matching partition if we remove any vertex in p C2: An edit error in a bridge of p is captured by p, while preserving C1 bridge – an edge connecting p to another partition Example p1 p2 p1 p2 p3 x N S F O C N S F O C lb(x, y) = 1 N S F O C lb(x, y) = 2 N S F O C lb(x, y) = 3 y N O F C p1 p2 p3 p1 p2 p3 N S F O C lb(x, y) = 4 p4 p4

See the paper for the proof
Tightening the Lower Bound - Incremental Partitioning lb(x, y) = 2 Incremental Partitioning Strategy 1. Perform a containment test of x against y by investigating vertices in x one after another 2. As soon as the test fails, isolate the investigated vertices and edges connecting them into a separate partition 3. Repeat it using the remaining part of x See the paper for the proof For a mismatching partition p, p cannot be decomposed into two mismatching partitions  exactly meets the indivisibility constraint of the measure occurrence o of p3

Tightening the Lower Bound - Bridge Constraint
Bridge: an edge connecting one partition to another partition lb(x, y) = 2 3 u6 u7 u8 C O v3 v7 v6 matching partition p3 occurrence o of p3 Bridge difference between p3 and o 3 + 0 + 0 = 3  B(p3, o): edit errors in the bridges of p3 An error in a bridge can be counted twice mismatching A partition can use a half of the errors in its bridges See the paper for the proof Bridge Constraint: If B(p, o) > 1, p is mismatching with o Pushing the bridge constraint into the containment test  approximately meets the bridge error condition in the measure occurrence o of p3

Tightening the Lower Bound - Rematch Method
Edit errors in a mismatching partition p are mainly caused by the last vertex (without the last vertex, p is a matching partition) u4  u3  u2  u1 Rematch Method Reorder vertices in p Fixing the last vertex as the start vertex Infrequent vertices and edges first while preserving the vertex connectivity Rematch p with the new vertex ordering We can expect the edit errors can be detected in a smaller substructure Further optimization Repeat rematching while the size of p decreases  approximately meets the minimality constraint in the measure lb(x, y) = 2 lb(x, y) = 1

See the paper for the details
Improving GED Computation - Exploiting information from partitioning Existing GED Computation Method The most widely used GED computation method is based on A* Considering all possible vertex mappings between two graphs in a best first fashion Each internal state of the state-space tree denotes a partial vertex mapping For each active state, calculating an estimated distance as the sum of the existing distance in mapped vertices and edges and an estimated distance of unmapped parts Selecting a state having a minimum distance and expanding the state-space tree  We have a method to accurately estimate a distance of a partial vertex mapping See the paper for the details Place vertices in mismatching partitions first! Since the existing edit errors in mapped vertices and edges are exactly calculated, we can find many edit errors at higher levels of the state-space tree  Significantly reduce the search space of the A* algorithm This approach can be accelerated by our incremental partitioning technique because our technique makes the size of a mismatching partition as small as possible

Experiments - Experimental Setup
Platform 32GB RAM Intel core i7 at 3.4GHz running a 64-bit Ubuntu OS Dataset Dataset # graphs Avg. # vertices Avg. # edges AIDS 42,687 25.6 27.6 PubChem 22,794 48.1 50 PROTEIN 600 32.6 62 Synthetic dataset – see the paper for the details and results Query Workloads Randomly selected from datasets For each dataset, 100 queries are selected Results are reported on the basis of 100 queries Search Algorithms G - GSimSearch [ICDE 2012, VLDB J. 2013] P - Pars [PVLDB 2013] M - MLIndex [ICDE 2017]

Experiments - AIDS Dataset
✽ y axes are log scaled in all figures

Experiments - PubChem Dataset

Experiments - Protein Dataset

Conclusions Thank you!  Observation: Key Idea: Key Results:
Online dynamic partitioning of a candidate graph can reduce the cost of verification Key Idea: Judiciously incrementally partitioning a candidate graph to tighten the GED lower bound Exploiting the collected information in partitioning to accelerate GED computation Key Results: Enhanced the performance of graph similarity search significantly Thank you! 

Improving GED Computation - Existing A* Algorithm for GED
root O C {u1v1} S C 1 C C Estimating an edit distance of the partial mapping {u1v1}: g + h g: existing edit distance of mapped part h: estimated edit distance of unmapped part = 0 = 0 + 1 h = label differences of unmapped edges and unmapped vertices unmapped edges x: 5 single bonds and 1 double bond y: 5 single bonds and 1 double bond No difference unmapped vertices x: 1 S, 2 C’s, and 1 O y: 1 S and 3 C’s 1 difference (substituting O with C)

root {u1v1} {u1v1} {u1v2} {u1v3} {u1v4} {u1v5} {u1 ɛ} 1 3 2 2 2 3 {u1v1, u2v2} {u1v1, u2v3} {u1v1, u2v4} {u1v1, u2v5} {u1v1, u2 ɛ} 1 4 4 4 5

Pruning with a given threshold τ (e.g., τ = 2) root {u1v1} {u1v2} {u1v3} {u1v4} {u1v5} {u1 ɛ} 3 2 2 2 3 {u1v1, u2v2} {u1v1, u2v3} {u1v1, u2v4} {u1v1, u2v5} {u1v1, u2 ɛ} 1 4 4 4 5 Repeat the same procedure until there is no active node a leaf node is found In general, the performance of A* depends on the accuracy of an estimated distance

Improving GED Computation - Improved Estimated Distance
Consider the partial vertex mapping {u1v1, u2v2, u3v3} The existing distance g = 0 Previous work: label differences in unmapped part Unmapped edges x: 3 single bonds and 1 double bond y: 3 single bonds and 1 double bond No difference Unmapped vertices x: 1 C and 1 O y: 2 C’s 1 difference (substituting O with C) h = = 1 Our approach: distinguish bridges from unmapped edges Bridge difference Unmapped edges x: 1 single bond y: 1 single bond No difference Unmapped vertices x: 1 C and 1 O y: 2 C’s 1 difference bet. u1 and v1 = 1 bet. u2 and v2 = 0 bet. u3 and v3 = 1 2 differences h = = 3  much more accurate estimation!!

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Similar presentations

Presentation on theme: "Jongik Kim1, Dong-Hoon Choi2, and Chen Li3"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Similar presentations

Presentation on theme: "Jongik Kim1, Dong-Hoon Choi2, and Chen Li3"— Presentation transcript:

Similar presentations

About project

Feedback