Download presentation
Presentation is loading. Please wait.
Published byLeon Quinn Modified over 9 years ago
1
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
2
2 Our ‘field’ to plow : Graph processing |V| = 1.4B, |E| = 6.6B
3
Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto Matei Ripeanu NetSysLab The University of British Columbia http://netsyslab.ece.ubc.ca
4
Graph Processing: The Challenges Data-dependent memory access patterns Caches + summary data structures Large memory footprint >128GB CPUs Poor locality Varying degrees of parallelism (both intra- and inter- stage) Low compute-to- memory access ratio
5
Graph Processing: The GPU Opportunity Data-dependent memory access patterns Caches + summary data structures Large memory footprint >128GB CPUs Poor locality Assemble a heterogeneous platform Massive hardware multithreading 6GB! GPUs Low compute-to- memory access ratio Caches + summary data structures Varying degrees of parallelism (both intra- and inter- stage)
6
6 YES WE CAN! 2x speedup (8 billion edges) Motivating Question Can we efficiently use hybrid systems for large-scale graph processing ?
7
7 Performance Model Predicts speedup Intuitive Totem A graph processing engine for hybrid systems Applies algorithm-agnostic optimizations Evaluation Predicated vs. achieved Hybrid vs. Symmetric Methodology
8
8 b The Performance Model (I) α =α = β =β = r cpu = Goal: Predict the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host) c =
9
9 The Performance Model (II) β = 20% r cpu = 0.5 BEPS It is beneficial to process the graph on a hybrid system if communication overhead is low α = β = r cpu = c = Assume PCI-E bus, b ≈ 4 GB/sec and per edge state m = 4 bytes => c = 1 billion EPS x Best reported single- node BFS performance [Agarwal, V. 2010] |V| = 32M, |E| = 1B Worst case (e.g., bipartite graph)
10
10 Totem: Programming Model Bulk Synchronous Parallel Rounds of computation and communication phases Updates to remote vertices are delivered in the next round Partitions vote to terminate execution......
11
11 Totem: A BSP-based Engine Compressed sparse row representation Computation: kernel manipulates local state Updates to remote vertices aggregated locally Comm2: merge with local state Comm1: transfer outbox buffer to remote input buffer
12
12 |E| = 512 Million Random The Aggregation Opportunity real-world graphs are mostly scale-free: skewed degree distribution sparse graph: ~5x reduction Denser graph has better opportunity for aggregation: ~50x reduction
13
13 Evaluation Setup Workload R-MAT graphs |V|=32M, |E|=1B, unless otherwise noted Algorithms Breadth-first Search PageRank Metrics Speedup compared to processing on the host only Testbed Host: dual-socket Intel Xeon with 16GB GPU: Nvidia Tesla C2050 with 3GB
14
14 Predicted vs. Achieved Speedup Linear speedup with respect to offloaded part GPU partition fills GPU memory After aggregation, β = 2%. A low value is critical for BFS
15
15 Breakdown of Execution Time PageRank is dominated by the compute phase Aggregation significantly reduced communication overhead GPU is > 5x faster than the host
16
16 So far … Performance modeling Simple Useful for initial system provisioning Totem Generic graph processing framework Algorithm-agnostic optimizations Evaluation (Graph500 scale-28) 2x speedup over a symmetric system 1.13 Billion TEPS edges on a dual-socket, dual-GPU system But, random partitioning! Can we do better?
17
Better partitioning strategies. The search space. Handles large (billion-edge scale) graphs. o Low space and time complexity. o Ideally, quasi-linear! Handles well scale-free graphs. Minimizes algorithm’s execution time by reducing computation time o (rather than communication) 17
18
The strategies we explore H IGH : vertices with high degree left on the host L OW : vertices with low degree left on the host R AND : random 18 The percentage of vertices placed on the CPU for a scale-28 RMAT graph (|V|=256m, |E|=4B)
19
Evaluation platform 19 Intel Nehalem Fermi GPU Xeon X5650 Tesla C2075 (2x sockets)(2x GPUs) Core Frequency 2.67GHz1.15GHz Num Cores (SMs) 614 HW-thread/Core 2x 48warps (x32/warp) Last Level Cache 12MB2MB Main Memory 144GB6GB Memory Bandwidth 32GB/sec144GB/sec Total Power (TDP) 95W225W
20
BSF performance 20 BFS traversal rate for a scale-28 RMAT graph (|V|=256m, |E|=4B) 2x performance gain! LOW: No gain over random! Exploring the 75% data point
21
Host is the bottleneck in all cases ! BSF performance – more details
22
PageRank performance 22 PageRank processing rate for a scale-28 RMAT graph (|V|=256m, |E|=4B) LOW: Minimal gain over random! Better packing 25% performance gain!
23
Small graphs (scale-25 RMAT graphs: |V|=32m, |E|=512m) 23 BFSPageRank Intelligent partitioning provides benefits Key for performance: load balancing
24
Uniform graphs (not scale free) 24 BFS on scale-25 uniform graph |V|=32m, |E|=512m) BFS on scale-28 Hybrid techniques not useful for uniform graphs
25
Scalability Graph size: RMAT graphs: scale 25 to 29 (|V|=512m, |E|=8B) Platform size: 1,2, 4 sockets 2xsockets + 2 x GPU 25 BFSPageRank
26
Power Normalizing by power (TDP – thermal design power) Metric: million TEPS / watt 26 BFSPageRank 2.01.3 1.1 1.3 1.9 2.3 2.4 1.8 1.0 1.4 1.8
27
Conclusions Q: Does it make sense to use a hybrid system? A: Yes! (for large scale-free graphs) Q: Can one design a processing engine for hybrid platforms that both generic and efficient? A: Yes. Q: Are there near-linear complexity partitioning strategies that enable higher performance? A: Yes, partitioning strategies based on vertex connectivity provide in all cases better performance than random. Q: Should one search for partitioning strategies that reduce the communication overheads (and hope for higher performance)? A: No. (for scale free graphs) Q: Which strategies work best? A: It depends! Large graphs: shape the load. Small graphs: load balancing.
28
28 If you were plowing a field, which would you rather use? - Two oxen, or 1024 chickens? - Both!
29
29 code available at: netsyslab.ece.ubc.ca Papers: A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, PACT 2012 On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, IPDPS 2013 29
30
30 A golf course … … a (nudist) beach (… and 199 days of rain each year) Networked Systems Laboratory (NetSysLab) University of British Columbia
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.