Download presentation
Presentation is loading. Please wait.
Published byMeagan Paul Modified over 9 years ago
1
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
2
A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University of British Columbia http://netsyslab.ece.ubc.ca
3
3 Graphs are Everywhere
4
4 Graphs Processing Challenges Poor locality Caches + summary data structures Massive hardware multithreading Low compute-to- memory access ratio Large memory footprint >128GB 6GB [Hong, S. 2011] CPUs GPUs
5
5 YES WE CAN! 2x speedup (4 billion edges) Motivating Question Can we efficiently use hybrid systems for large-scale graph processing ?
6
6 Performance Model Predicts speedup Intuitive Totem A graph processing engine for hybrid systems Applies algorithm-agnostic optimizations Evaluation Predicated vs achieved Hybrid vs Symmetric Methodology
7
7 b The Performance Model (I) α =α = β =β = r cpu = Predicts the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host) c =
8
8 The Performance Model (II) β = 20% r cpu = 0.5 BEPS It is beneficial to process the graph on a hybrid system if communication overhead is kept low α = β = r cpu = c = Assume PCI-E bus, b ≈ 4 GB/sec and per edge state m = 4 bytes => c = 1 billion EPS x Best reported single- node BFS performance [Agarwal, V. 2010] |V| = 32M, |E| = 1B Worst case (e.g., bipartite graph)
9
9 Totem: Programming Model Bulk Synchronous Parallel Rounds of computation and communication phases Updates to remote vertices are delivered in the next round Partitions vote to terminate execution......
10
10 Totem: A BSP-based Engine Compressed sparse row representation Computation: kernel manipulates local state Updates to remote vertices aggregated locally Comm2: merge with local state Comm1: transfer outbox buffer to remote input buffer
11
11 |E| = 512 Million Random The Aggregation Opportunity real-world graphs are mostly scale-free: skewed degree distribution sparse graph: ~5x reduction Denser graph has better opportunity for aggregation: ~50x reduction
12
12 Evaluation Setup Workload R-MAT graphs |V|=32M, |E|=1B, unless otherwise noted Algorithms Breadth-first Search PageRank Metrics Speedup compared to processing on the host only Testbed Host: dual-socket Intel Xeon with 16GB GPU: Nvidia Tesla C2050 with 3GB
13
13 Predicted vs Achieved Speedup Linear speedup with respect to offloaded part GPU partition fills GPU memory After aggregation, β = 2%. A low value is critical for BFS
14
14 Breakdown of Execution Time PageRank is dominated by the compute phase Aggregation significantly reduced communication overhead GPU is > 5x faster than the host
15
15 Effect of Graph Density Sparser graphs have higher β Deviation due to not incorporating pre- and post-transfer overheads in the model
16
16 Contributions Performance modeling Simple Useful for initial system provisioning Totem Generic graph processing framework Algorithm-agnostic optimizations Evaluation (Graph500 scale-28) 2x speedup over a symmetric system 1.13 Billion TEPS edges on a dual-socket, dual-GPU system
17
17 Questions? code available at: netsyslab.ece.ubc.ca 17
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.