Parallel Subgraph Listing in a Large-Scale Graph Yingxia Shao Bin Cui Lei Chen Lin Ma Junjie Yao Ning Xu School of EECS, Peking University Hong Kong University of Science and Technology 1
Outline Subgraph listing operation Related work PSgL framework Evaluation Conclusion 2
Motivation 3 Motif Detection in Bioinformatics Cascades Counting in RN Triangle Counting in SN Introduction
Problem Definition 4 Pattern graph Subgraph Listing Operation o Input: pattern graph, data graph [both are undirected] o Output: all the occurrences of pattern graph in the data graph. Goal of our work o Efficiently listing subgraph in a large-scale graph Data graph Introduction
Related Work Centralized algorithms Enumerate one by one [Chiba ’85, Wernicke ’06, Grochow ’07] Streaming algorithms Only counting and results are inaccurate [Buriol ’06, Bordino ’08, Zhao ’10] MapReduce based Parallel algorithms Decompose pattern graph + explicit join operation [Afrati ’13] Fixed exploration plan + implicit join operation [Plantenga ’13] Other efficient algorithms for specific pattern graph Triangle [Suri ’11, Chu ’11, Hu ’13] 5 Related Work
Drawbacks in existing parallel solutions MapReduce is not friendly to process graphs. Join operation is expensive. Do not take care of the balance of data distribution. Data graph Intermediate results The novel PSgL framework lists subgraph via graph traversal on in-memory stored native graph. 6 Related Work
Contributions We propose an efficient parallel subgraph listing framework, PSgL. We introduce a cost model for the subgraph listing in PSgL. We propose a simple but effective workload-aware distribution strategy, which facilitates PSgL to achieve good workload balance. We design three independent mechanisms to reduce the size of intermediate results. 7
Partial subgraph instance 8 {?,?,?,?} {2,3,4,5} {1,5,6,?} Preliminaries
Independence Property 9 Preliminaries
PSgL: Parallel Subgraph Listing Framework 10 PSgL
11 PSgLVertex program
Algorithm of Expanding a G psi - II Main logic Changes one GRAY vertex into BLACK; Validates the expanding vertex’s GRAY neighbors; Makes the expanding vertex’s WHITE neighbor become GRAY. Two observations In each expansion, at least one pattern vertex is processed. All GRAYs are the valid candidates for the next expansion. Example: expanding vertex 12 PSgLVertex program
Efficiency of PSgL # of iterations Total cost # of workers # of G psi processed by worker k cost of processing a G psi 13 PSgLAnalysis
Workload balance - I 14 Optimization
Workload aware distribution strategy A general greedy-based heuristic rule. Workload balance - II 15 αDescriptionDrawbacks 1local optimal 0imbalance 0.5 (*)Making a trade-off between local optimal and imbalance- All three strategies have the same worst bound which is K*|OPT|. But in practice, α = 0.5 performs best. Optimization
Comparison among various approaches 16 Optimization Random Roulette
Partial subgraph instance reduction - I Pattern graph automorphism breaking Using DFS to find the equivalent vertex group Assign partial order for each equivalent vertex group Initial pattern vertex selection Introduce a cost model General pattern graph Enumerate all possible selections based on cost model Cycle and clique The vertex with lowest rank is the best one. 17 < < < Automorphism Breaking Cost Model Best Initial Pattern Vertex Initial Pattern Vertex Section based on cost model Optimization
Partial subgraph instance reduction - II 18 Data GraphPGGpsi # w/ indexGpsi # w/o indexPruning Ratio LiveJournalPG 1 (v 1 )2.86 x x % PG 4 (v 1 )9.93 x 10 9 OOMunknown UsPatentPG 5 (v 1 )2.26 x x % PG 5 (v 3 ; v 4 )7.38 x x % PG 1 PG 4 PG 5 Optimization
Evaluation - Comparing to MR solutions 19 PSgL: 4302s Afrati: 7291s Evaluation Afrati and SGIA-MR are the state-of-art MapReduce solutions. The ratios exceed 100 times are not visualized.
Evaluation - Comparing to GraphLab 20 Data GraphPattern GraphAfratiPowerGraphPSgL Twitter432min2min12.5min Wikipedia871s36s125s WikiTalk4402s48s318s WikiTalk 13743s 100s 494s WikiTalk 13743s OOM* 494s WikiTalk1785s127s38s LiveJournal2749sOOM1330s Evaluation * using a different traversal order.
Conclusion Subgraph listing is a fundamental operation for massive graph analysis. We propose an efficient parallel subgraph listing framework, PSgL. Various distribution strategies Cost model Light-weight global edge index The workload-aware distribution strategy can be extended to other balance problems. A new execution engine is required for larger pattern graphs. 21
Thanks! 22
Backup Expr. – Scalability of PSgL 23 Performance vs. Worker Number
Backup Expr. – Initial pattern vertex selection 24 Livejournal Random graph Influences of the Initial Pattern Vertex on Various Data Graphs