1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University of British Columbia http://netsyslab.ece.ubc.ca

3 Graphs are Everywhere

4 Graphs Processing Challenges Poor locality Caches + summary data structures Massive hardware multithreading Low compute-to- memory access ratio Large memory footprint >128GB 6GB [Hong, S. 2011] CPUs GPUs

5 YES WE CAN! 2x speedup (4 billion edges) Motivating Question Can we efficiently use hybrid systems for large-scale graph processing ?

6  Performance Model  Predicts speedup  Intuitive  Totem  A graph processing engine for hybrid systems  Applies algorithm-agnostic optimizations  Evaluation  Predicated vs achieved  Hybrid vs Symmetric Methodology

7 b The Performance Model (I) α =α = β =β = r cpu = Predicts the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host) c =

8 The Performance Model (II) β = 20% r cpu = 0.5 BEPS It is beneficial to process the graph on a hybrid system if communication overhead is kept low α = β = r cpu = c = Assume PCI-E bus, b ≈ 4 GB/sec and per edge state m = 4 bytes => c = 1 billion EPS x Best reported single- node BFS performance [Agarwal, V. 2010] |V| = 32M, |E| = 1B Worst case (e.g., bipartite graph)

9 Totem: Programming Model Bulk Synchronous Parallel  Rounds of computation and communication phases  Updates to remote vertices are delivered in the next round  Partitions vote to terminate execution......

10 Totem: A BSP-based Engine Compressed sparse row representation Computation: kernel manipulates local state Updates to remote vertices aggregated locally Comm2: merge with local state Comm1: transfer outbox buffer to remote input buffer

11 |E| = 512 Million Random The Aggregation Opportunity real-world graphs are mostly scale-free: skewed degree distribution sparse graph: ~5x reduction Denser graph has better opportunity for aggregation: ~50x reduction

12 Evaluation Setup Workload  R-MAT graphs  |V|=32M, |E|=1B, unless otherwise noted Algorithms  Breadth-first Search  PageRank Metrics  Speedup compared to processing on the host only Testbed  Host: dual-socket Intel Xeon with 16GB  GPU: Nvidia Tesla C2050 with 3GB

13 Predicted vs Achieved Speedup Linear speedup with respect to offloaded part GPU partition fills GPU memory After aggregation, β = 2%. A low value is critical for BFS

14 Breakdown of Execution Time PageRank is dominated by the compute phase Aggregation significantly reduced communication overhead GPU is > 5x faster than the host

15 Effect of Graph Density Sparser graphs have higher β Deviation due to not incorporating pre- and post-transfer overheads in the model

16 Contributions  Performance modeling  Simple  Useful for initial system provisioning  Totem  Generic graph processing framework  Algorithm-agnostic optimizations  Evaluation (Graph500 scale-28)  2x speedup over a symmetric system  1.13 Billion TEPS edges on a dual-socket, dual-GPU system

17 Questions? code available at: netsyslab.ece.ubc.ca 17

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Similar presentations

Presentation on theme: "1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Similar presentations

Presentation on theme: "1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)"— Presentation transcript:

Similar presentations

About project

Feedback