Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Beyond Trilateration: On the Localizability of Wireless Ad Hoc Networks Reported by: 莫斌.
Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Da Yan, Zhou Zhao and Wilfred Ng The Hong Kong University of Science and Technology.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
Los Angeles September 27, 2006 MOBICOM Localization in Sparse Networks using Sweeps D. K. Goldenberg P. Bihler M. Cao J. Fang B. D. O. Anderson.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:
Structure discovery in PPI networks using pattern-based network decomposition Philip Bachman and Ying Liu BIOINFORMATICS System biology Vol.25 no
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
Carmine Cerrone, Raffaele Cerulli, Bruce Golden GO IX Sirmione, Italy July
ECE669 L10: Graph Applications March 2, 2004 ECE 669 Parallel Computer Architecture Lecture 10 Graph Applications.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Complexity of Bellman-Ford Theorem. The message complexity of Bellman-Ford algorithm is exponential. Proof outline. Consider a topology with an even number.
Lecture 12-2: Introduction to Computer Algorithms beyond Search & Sort.
MST Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
Sequential PAttern Mining using A Bitmap Representation
COSC 2007 Data Structures II Chapter 14 Graphs III.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
UNC Chapel Hill Lin/Foskey/Manocha Minimum Spanning Trees Problem: Connect a set of nodes by a network of minimal total length Some applications: –Communication.
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 5 Graph Algorithms.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
1 Efficient Obstacle-Avoiding Rectilinear Steiner Tree Construction Chung-Wei Lin, Szu-Yu Chen, Chi-Feng Li, Yao-Wen Chang, Chia-Lin Yang National Taiwan.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
3.2 Matchings and Factors: Algorithms and Applications This copyrighted material is taken from Introduction to Graph Theory, 2 nd Ed., by Doug West; and.
GStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong.
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
A Unified View of Graph Searching
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu Yu Kang, Yangfan Zhou, Zibin Zheng, and Michael R. Lyu {ykang,yfzhou,
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Graphs A ‘Graph’ is a diagram that shows how things are connected together. It makes no attempt to draw actual paths or routes and scale is generally inconsequential.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Data Structures and Algorithms in Parallel Computing Lecture 3.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Graph Indexing From managing and mining graph data.
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
Constraint Programming for the Diameter Constrained Minimum Spanning Tree Problem Thiago F. Noronha Celso C. Ribeiro Andréa C. Santos.
::Network Optimization:: Minimum Spanning Trees and Clustering Taufik Djatna, Dr.Eng. 1.
Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
Cohesive Subgraph Computation over Large Graphs
Outline Introduction State-of-the-art solutions
Probabilistic Data Management
TT-Join: Efficient Set Containment Join
On Efficient Graph Substructure Selection
Diversified Top-k Subgraph Querying in a Large Graph
Efficient Subgraph Similarity All-Matching
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Resource Allocation for Distributed Streaming Applications
Approximate Graph Mining with Label Costs
Minimum Spanning Trees
Presentation transcript:

Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey Xu Yu † # The University of New South Wales † The Chinese University of Hong Kong

Outline Introduction State-of-the-Art Our Approach Experiments Conclusions 1

Introduction — Graph Data  Chem-informatics  Chemical Compounds (small size)  Bio-informatics  PPI Networks (medium size)  Internet  World Wide Web (large size) 2

Introduction — Exact All-Matching (I)  Exact All-Matching  Enumerate all exact (i.e. isomorphic) matches of a query graph q in a data graph G.  Applications  Query biological patterns in PPI networks.  Detect suspicious bugs in software programs. C AB D q C AB D G C A C AB D B DC A exact matches 3

Introduction — Exact All-Matching (II)  Dilemma of Exact All-Matching  If q is issued by user for exploratory purpose …  If G is noisy due to imprecise data collection …  Potential Solutions  Modify q/G and run exact all-matching again and again.  Ask system to return approximate results (i.e., similarity all-matching) No exact matches can be found! C AB D G C A C AB D q' 4

SAPPER [VLDB’10 Zhang et al] (I)  Similarity All-Matching  Given a query graph q, a data graph G and a similarity threshold θ, enumerate all similarity matches of q in G (i.e., all connected subgraphs of G missing at most θ edges in q).  Framework  Enumerate a set of seeds Q SAPPER (i.e., all connected subgraphs q’ of q missing θ edges in q).  Exact all-matching on each seed q’ to obtain exact matches.  Induce similarity matches based on exact matches of seeds. 5

SAPPER [VLDB’10 Zhang et al] (II)  Cost Model   |Q SAPPER | = # of exact all-matching tests 6 C AA B G D C AA B C AA B q (θ = 1) C AA BC AA BC AA BC AA BC AA B F 1 = {u 1 →v 1, u 2 →v 2, u 3 → v 3, u 4 →v 4 } u1u1 u4u4 u2u2 u3u3 v1v1 v2v2 v5v5 v4v4 v3v3 F 2 = {u 1 →v 2, u 2 →v 1, u 3 → v 3, u 4 →v 4 } C AA BC AA BC AA BC AA B q' 1 q' 2 q' 3 q' 4

Our Approach — Overview (I)  Tree-based Spanning Search Paradigm — TSpan  Enumerate a set of seeds Q T (i.e., spanning trees of q cover all connected subgraph q’ of q missing θ edges in q).  Primary Contribution  Reduce # of exact all-matching tests (i.e., # of seeds).  Reduce the complexity of exact all-matching test from graph to graph to tree to graph. C AB D q (θ = 2) C AB DC AB DC AB D 7 more SAPPER seeds 3 all-matching tests on connected subgraphs of q 1 all-matching tests on a spanning tree of q

Our Approach — Overview (II)  Generating Similarity Maximal Matches  Generating similarity maximal matches only can reduce # of exact all-matching tests. 8 C AA B G D C AA B C AA B q (θ = 1) C AA BC AA BC AA BC AA BC AA B F 1 = {u 1 →v 1, u 2 →v 2, u 3 → v 3, u 4 →v 4 } u1u1 u4u4 u2u2 u3u3 v1v1 v2v2 v5v5 v4v4 v3v3 F 2 = {u 1 →v 2, u 2 →v 1, u 3 → v 3, u 4 →v 4 } similarity maximal matches

Our Approach — Problem Statement  Similarity Maximal All-Matching  Given a query graph q, a data graph G and a similarity threshold θ, enumerate all distinct similarity maximal matches of q in G conforming θ. 9

Our Approach — Seeding (I)  PRIM Order on Spanning Trees  Similar to the basic idea of minimum spanning tree.  Given a total order on E(q), a spanning tree T = {T[0], T[1], …, T[|V(q)|- 1]} of q conforms PRIM order (T[0] is head vertex), if and only if each spanning edge T[i] has the smallest order in E(q) – {T[1],..., T[i − 1]} and connects {T[0], T[1],..., T[i − 1]}. C AB D e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 q C AB D e1e1 e2e2 e3e3 T 10

Our Approach — Seeding (II)  Avoid Duplicate Results  Two spanning trees of q may induce duplicate similarity maximal matches.  Associate an edge exclusion set T.R to each T in Q T.  T.R is a set of edges in E(q) – E(T) enforced to be mismatched in the similarity maximal matches induced by T. C AB D q (θ = 2) E AA C G B D C AB D T1T1 C AB D T2T2 T 2.R = { (A,D) } T 1.R = ∅ 11

Our Approach — Seeding (III) C AB D e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e1e2e3e1e2e3 e1e2e4e1e2e4 e1e2e5e1e2e5 e1e4e3e1e4e3 e1e4e6e1e4e6 e1e5e3e1e5e3 e4e3e2e4e3e2 e4e3e5e4e3e5 e4e5e2e4e5e2 e6e2e3e6e2e3 X T 1 [1] e 1 XT 1 [2] e 2 XT1[3]e3XT1[3]e3 X T 2 [3] e 4 X T 4 [3] e 3 X T 4 [2] e 4 XT 7 [3] e 2 XT 7 [2] e 3 XT 7 [1] e 4 TT.RT 1.e 1 e 2 e 3 { } 2.e 1 e 2 e 4 {e 3 } 3.e 1 e 2 e 5 {e 3, e 4 } 4.e 1 e 4 e 3 {e 2 } 5.e 1 e 4 e 6 {e 2, e 3 } 6.e 1 e 5 e 3 {e 2, e 4 } 7.e 4 e 3 e 2 {e 1 } 8.e 4 e 3 e 5 {e 1, e 2 } 9.e 4 e 5 e 2 {e 1, e 3 } 10.e 6 e 2 e 3 {e 1, e 4 } q (θ =2)  Q T Enumeration Algorithm go down alternate-reorder 12

Our Approach — Seeding (IV)  Q T Enumeration Algorithm  Correctness : Using Q T to inducing similarity maximal matches neither generates duplicate results nor misses valid results.  Minimality of Q T : Missing any spanning tree in Q T does not guarantee the completeness of results based on edge exclusion semantics.  When |E(q)| = m, |V(q)| = n,  (1)|Q SAPPER | ≥ |Q T |;  (2) |Q T | = |Q SAPPER | only when θ = 0 or m − n

Our Approach — Searching (I)  Effectively Storing Q T  Use DFS Traversal Tree to share computation cost. e1e2e3e1e2e3 e1e2e4e1e2e4 e1e2e5e1e2e5 e1e4e3e1e4e3 e1e4e6e1e4e6 e1e5e3e1e5e3 e4e3e2e4e3e2 e4e3e5e4e3e5 e4e5e2e4e5e2 e6e2e3e6e2e3 R e1e1 e4e4 e2e2 e3e3 e4e4 e5e5 e4e4 e3e3 e6e6 e3e3 e5e5 e3e3 e2e2 e5e5 e2e2 e5e5 e6e6 e3e3 e2e2 14

Our Approach — Searching (II)  Similarity Maximal All-Matching Algorithm Sketch  Traverse the DFS Traversal Tree in a depth-first backtrack search fashion.  go-down : Beginning from the initial spanning tree, recursively drill down to extend the current partial match to the next spanning edge T[i] in the current spanning tree T.  alternate : If T[i] can not be extended based on the current partial match and we can still afford to mismatch T[i] by conforming θ, alternate the algorithm from T to the alternative spanning tree T’ enumerated by replacing T[i] with T’[i]. 15

Our Approach — Optimizations  Optimizations (I) EnumrateOnDemand Strategy  Motivation : further reduce the number of seeds.  Enumerate an alternative tree T’ based on the current tree T only when it is feasible to extend the current partial similarity maximal match conforming θ (1) on the next spanning edge T[i] or (2) on the next spanning edge T[i]’.  Optimizations (II) Effective Search Order  Motivation : terminate all-matching test as early as possible.  Decide the search order of spanning edges in T based on the post-filtering candidate sets of each vertex in q. 16

Our Approach — Filtering & Ordering (I)  Neighborhood Aggregate N(v, g)  Given a set of labels Σ V = {L 1,..., L m }, N(v, g) = (x 1,..., x m ) where x i is the number of neighbors of v in g with label L i ∈ Σ V.  Neighborhood-based Filtering  Compute the candidate set C(u) for each u in q. A B D AA D u ∈ q A B C BA C v ∈ G N(u, q) = {2, 1, 0, 2} N(v, G) = {1, 2, 2, 0} 17

Our Approach — Filtering & Ordering (II)  QI Search Ordering [VLDB’08 Shang et al.]  Pick Head Vertex : The vertex u in q with minimum φ(u) (i.e., the occurrence of vertices in G with l(u)).  Pick Next Spanning Edge : The edge (u 1, u 2 ) with minimum φ(u 1, u 2 ) (i.e., the occurrences of edges in G with (l(u 1 ), l(u 2 ))) where u 1 is a vertex incident on previous picked spanning edges.  Filtering-based Search Ordering  Pick Head Vertex : The vertex u in q with minimum number of candidates (i.e., |C(u)|).  Pick Next Spanning Edge : The edge (u 1, u 2 ) minimizing |C(u 2 )|×φ (u 1, u 2 )/φ(u 2 ) where u 1 is vertex incident on previous picked spanning edges. 18

Experiments — Experimental Settings  Data Graphs  G H : HPRD network (|V(G H )| = 9,460, |E(G H )| = 37,081).  G S : default synthetic data graph.  Other synthetic data graphs generated by varying data graph settings.  Query Graphs  Random selected subgraphs of the corresponding data graphs.  Parameter Settings (default settings in bold) |V(G)|5k, 10k, 20k, 40k, 80k avg. deg(G)4, 8, 12, 16, 20 |ΣV ||ΣV |20, 50, 100, 200 |V(q)|20, 40, 60, 80, 100 avg. deg(q)3, 4, 5, 6 θ1, 2, 3, 4 19

 |Q SAPPER | : # of exact all-matching tests by SAPPER [VLDB’10].  |Q T | : # of exact all-matching tests by EnumerateAll paradigm.  TSpan : # of exact all-matching tests by EnumerateOnDemand paradigm. Experiments — # of exact all-matching tests 20

 Similarity All-Matching  SAPPER : Generate all similarity matches.  TSpan+ : Run TSpan first and then generate all similarity matches based on similarity maximal matches.  Similarity Maximal All-Matching  NaïveTSpan : Similarity maximal all-matching with no computation sharing.  TSpan : Similarity maximal all-matching with computation sharing. Experiments — Total Processing Time 21

 Enumeration Paradigms  PrecTSpan : Similarity maximal all- matching by EnumerateAll.  TSpan : Similarity maximal all-matching by EnumerateOnDemand.  Filtering & Ordering  TSpanQI : TSpan algorithm with QI searching ordering.  TSpanNF : TSpan algorithm with no filtering technique. Experiments — Total Processing Time 22

 TSpan on Large-scale Datasets Experiments — Large-scale Data Graphs 23

Conclusions  Tree-based Spanning Search Paradigm  EnumerateOnDemand Strategy  Filtering-based Search Ordering SAPPERTSpan # of all-matching testssignificantly less each all-matching testgraph to graphtree to graph computation-sharingnoyes similarity resultsnon-maximalmaximal 24

Thank You! Any Questions?