Download presentation
Presentation is loading. Please wait.
Published byHilary McCarthy Modified over 9 years ago
1
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey Xu Yu † # The University of New South Wales † The Chinese University of Hong Kong
2
Outline Introduction State-of-the-Art Our Approach Experiments Conclusions 1
3
Introduction — Graph Data Chem-informatics Chemical Compounds (small size) Bio-informatics PPI Networks (medium size) Internet World Wide Web (large size) 2
4
Introduction — Exact All-Matching (I) Exact All-Matching Enumerate all exact (i.e. isomorphic) matches of a query graph q in a data graph G. Applications Query biological patterns in PPI networks. Detect suspicious bugs in software programs. C AB D q C AB D G C A C AB D B DC A exact matches 3
5
Introduction — Exact All-Matching (II) Dilemma of Exact All-Matching If q is issued by user for exploratory purpose … If G is noisy due to imprecise data collection … Potential Solutions Modify q/G and run exact all-matching again and again. Ask system to return approximate results (i.e., similarity all-matching) No exact matches can be found! C AB D G C A C AB D q' 4
6
SAPPER [VLDB’10 Zhang et al] (I) Similarity All-Matching Given a query graph q, a data graph G and a similarity threshold θ, enumerate all similarity matches of q in G (i.e., all connected subgraphs of G missing at most θ edges in q). Framework Enumerate a set of seeds Q SAPPER (i.e., all connected subgraphs q’ of q missing θ edges in q). Exact all-matching on each seed q’ to obtain exact matches. Induce similarity matches based on exact matches of seeds. 5
7
SAPPER [VLDB’10 Zhang et al] (II) Cost Model |Q SAPPER | = # of exact all-matching tests 6 C AA B G D C AA B C AA B q (θ = 1) C AA BC AA BC AA BC AA BC AA B F 1 = {u 1 →v 1, u 2 →v 2, u 3 → v 3, u 4 →v 4 } u1u1 u4u4 u2u2 u3u3 v1v1 v2v2 v5v5 v4v4 v3v3 F 2 = {u 1 →v 2, u 2 →v 1, u 3 → v 3, u 4 →v 4 } C AA BC AA BC AA BC AA B q' 1 q' 2 q' 3 q' 4
8
Our Approach — Overview (I) Tree-based Spanning Search Paradigm — TSpan Enumerate a set of seeds Q T (i.e., spanning trees of q cover all connected subgraph q’ of q missing θ edges in q). Primary Contribution Reduce # of exact all-matching tests (i.e., # of seeds). Reduce the complexity of exact all-matching test from graph to graph to tree to graph. C AB D q (θ = 2) C AB DC AB DC AB D 7 more SAPPER seeds 3 all-matching tests on connected subgraphs of q 1 all-matching tests on a spanning tree of q
9
Our Approach — Overview (II) Generating Similarity Maximal Matches Generating similarity maximal matches only can reduce # of exact all-matching tests. 8 C AA B G D C AA B C AA B q (θ = 1) C AA BC AA BC AA BC AA BC AA B F 1 = {u 1 →v 1, u 2 →v 2, u 3 → v 3, u 4 →v 4 } u1u1 u4u4 u2u2 u3u3 v1v1 v2v2 v5v5 v4v4 v3v3 F 2 = {u 1 →v 2, u 2 →v 1, u 3 → v 3, u 4 →v 4 } similarity maximal matches
10
Our Approach — Problem Statement Similarity Maximal All-Matching Given a query graph q, a data graph G and a similarity threshold θ, enumerate all distinct similarity maximal matches of q in G conforming θ. 9
11
Our Approach — Seeding (I) PRIM Order on Spanning Trees Similar to the basic idea of minimum spanning tree. Given a total order on E(q), a spanning tree T = {T[0], T[1], …, T[|V(q)|- 1]} of q conforms PRIM order (T[0] is head vertex), if and only if each spanning edge T[i] has the smallest order in E(q) – {T[1],..., T[i − 1]} and connects {T[0], T[1],..., T[i − 1]}. C AB D e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 q C AB D e1e1 e2e2 e3e3 T 10
12
Our Approach — Seeding (II) Avoid Duplicate Results Two spanning trees of q may induce duplicate similarity maximal matches. Associate an edge exclusion set T.R to each T in Q T. T.R is a set of edges in E(q) – E(T) enforced to be mismatched in the similarity maximal matches induced by T. C AB D q (θ = 2) E AA C G B D C AB D T1T1 C AB D T2T2 T 2.R = { (A,D) } T 1.R = ∅ 11
13
Our Approach — Seeding (III) C AB D e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e1e2e3e1e2e3 e1e2e4e1e2e4 e1e2e5e1e2e5 e1e4e3e1e4e3 e1e4e6e1e4e6 e1e5e3e1e5e3 e4e3e2e4e3e2 e4e3e5e4e3e5 e4e5e2e4e5e2 e6e2e3e6e2e3 X T 1 [1] e 1 XT 1 [2] e 2 XT1[3]e3XT1[3]e3 X T 2 [3] e 4 X T 4 [3] e 3 X T 4 [2] e 4 XT 7 [3] e 2 XT 7 [2] e 3 XT 7 [1] e 4 TT.RT 1.e 1 e 2 e 3 { } 2.e 1 e 2 e 4 {e 3 } 3.e 1 e 2 e 5 {e 3, e 4 } 4.e 1 e 4 e 3 {e 2 } 5.e 1 e 4 e 6 {e 2, e 3 } 6.e 1 e 5 e 3 {e 2, e 4 } 7.e 4 e 3 e 2 {e 1 } 8.e 4 e 3 e 5 {e 1, e 2 } 9.e 4 e 5 e 2 {e 1, e 3 } 10.e 6 e 2 e 3 {e 1, e 4 } q (θ =2) Q T Enumeration Algorithm go down alternate-reorder 12
14
Our Approach — Seeding (IV) Q T Enumeration Algorithm Correctness : Using Q T to inducing similarity maximal matches neither generates duplicate results nor misses valid results. Minimality of Q T : Missing any spanning tree in Q T does not guarantee the completeness of results based on edge exclusion semantics. When |E(q)| = m, |V(q)| = n, (1)|Q SAPPER | ≥ |Q T |; (2) |Q T | = |Q SAPPER | only when θ = 0 or m − n + 1. 13
15
Our Approach — Searching (I) Effectively Storing Q T Use DFS Traversal Tree to share computation cost. e1e2e3e1e2e3 e1e2e4e1e2e4 e1e2e5e1e2e5 e1e4e3e1e4e3 e1e4e6e1e4e6 e1e5e3e1e5e3 e4e3e2e4e3e2 e4e3e5e4e3e5 e4e5e2e4e5e2 e6e2e3e6e2e3 R e1e1 e4e4 e2e2 e3e3 e4e4 e5e5 e4e4 e3e3 e6e6 e3e3 e5e5 e3e3 e2e2 e5e5 e2e2 e5e5 e6e6 e3e3 e2e2 14
16
Our Approach — Searching (II) Similarity Maximal All-Matching Algorithm Sketch Traverse the DFS Traversal Tree in a depth-first backtrack search fashion. go-down : Beginning from the initial spanning tree, recursively drill down to extend the current partial match to the next spanning edge T[i] in the current spanning tree T. alternate : If T[i] can not be extended based on the current partial match and we can still afford to mismatch T[i] by conforming θ, alternate the algorithm from T to the alternative spanning tree T’ enumerated by replacing T[i] with T’[i]. 15
17
Our Approach — Optimizations Optimizations (I) EnumrateOnDemand Strategy Motivation : further reduce the number of seeds. Enumerate an alternative tree T’ based on the current tree T only when it is feasible to extend the current partial similarity maximal match conforming θ (1) on the next spanning edge T[i] or (2) on the next spanning edge T[i]’. Optimizations (II) Effective Search Order Motivation : terminate all-matching test as early as possible. Decide the search order of spanning edges in T based on the post-filtering candidate sets of each vertex in q. 16
18
Our Approach — Filtering & Ordering (I) Neighborhood Aggregate N(v, g) Given a set of labels Σ V = {L 1,..., L m }, N(v, g) = (x 1,..., x m ) where x i is the number of neighbors of v in g with label L i ∈ Σ V. Neighborhood-based Filtering Compute the candidate set C(u) for each u in q. A B D AA D u ∈ q A B C BA C v ∈ G N(u, q) = {2, 1, 0, 2} N(v, G) = {1, 2, 2, 0} 17
19
Our Approach — Filtering & Ordering (II) QI Search Ordering [VLDB’08 Shang et al.] Pick Head Vertex : The vertex u in q with minimum φ(u) (i.e., the occurrence of vertices in G with l(u)). Pick Next Spanning Edge : The edge (u 1, u 2 ) with minimum φ(u 1, u 2 ) (i.e., the occurrences of edges in G with (l(u 1 ), l(u 2 ))) where u 1 is a vertex incident on previous picked spanning edges. Filtering-based Search Ordering Pick Head Vertex : The vertex u in q with minimum number of candidates (i.e., |C(u)|). Pick Next Spanning Edge : The edge (u 1, u 2 ) minimizing |C(u 2 )|×φ (u 1, u 2 )/φ(u 2 ) where u 1 is vertex incident on previous picked spanning edges. 18
20
Experiments — Experimental Settings Data Graphs G H : HPRD network (|V(G H )| = 9,460, |E(G H )| = 37,081). G S : default synthetic data graph. Other synthetic data graphs generated by varying data graph settings. Query Graphs Random selected subgraphs of the corresponding data graphs. Parameter Settings (default settings in bold) |V(G)|5k, 10k, 20k, 40k, 80k avg. deg(G)4, 8, 12, 16, 20 |ΣV ||ΣV |20, 50, 100, 200 |V(q)|20, 40, 60, 80, 100 avg. deg(q)3, 4, 5, 6 θ1, 2, 3, 4 19
21
|Q SAPPER | : # of exact all-matching tests by SAPPER [VLDB’10]. |Q T | : # of exact all-matching tests by EnumerateAll paradigm. TSpan : # of exact all-matching tests by EnumerateOnDemand paradigm. Experiments — # of exact all-matching tests 20
22
Similarity All-Matching SAPPER : Generate all similarity matches. TSpan+ : Run TSpan first and then generate all similarity matches based on similarity maximal matches. Similarity Maximal All-Matching NaïveTSpan : Similarity maximal all-matching with no computation sharing. TSpan : Similarity maximal all-matching with computation sharing. Experiments — Total Processing Time 21
23
Enumeration Paradigms PrecTSpan : Similarity maximal all- matching by EnumerateAll. TSpan : Similarity maximal all-matching by EnumerateOnDemand. Filtering & Ordering TSpanQI : TSpan algorithm with QI searching ordering. TSpanNF : TSpan algorithm with no filtering technique. Experiments — Total Processing Time 22
24
TSpan on Large-scale Datasets Experiments — Large-scale Data Graphs 23
25
Conclusions Tree-based Spanning Search Paradigm EnumerateOnDemand Strategy Filtering-based Search Ordering SAPPERTSpan # of all-matching testssignificantly less each all-matching testgraph to graphtree to graph computation-sharingnoyes similarity resultsnon-maximalmaximal 24
26
Thank You! Any Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.