1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

2 Our ‘field’ to plow : Graph processing |V| = 1.4B, |E| = 6.6B

Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto Matei Ripeanu NetSysLab The University of British Columbia http://netsyslab.ece.ubc.ca

Graph Processing: The Challenges Data-dependent memory access patterns Caches + summary data structures Large memory footprint >128GB CPUs Poor locality Varying degrees of parallelism (both intra- and inter- stage) Low compute-to- memory access ratio

Graph Processing: The GPU Opportunity Data-dependent memory access patterns Caches + summary data structures Large memory footprint >128GB CPUs Poor locality Assemble a heterogeneous platform Massive hardware multithreading 6GB! GPUs Low compute-to- memory access ratio Caches + summary data structures Varying degrees of parallelism (both intra- and inter- stage)

6 YES WE CAN! 2x speedup (8 billion edges) Motivating Question Can we efficiently use hybrid systems for large-scale graph processing ?

7  Performance Model  Predicts speedup  Intuitive  Totem  A graph processing engine for hybrid systems  Applies algorithm-agnostic optimizations  Evaluation  Predicated vs. achieved  Hybrid vs. Symmetric Methodology

8 b The Performance Model (I) α =α = β =β = r cpu = Goal: Predict the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host) c =

9 The Performance Model (II) β = 20% r cpu = 0.5 BEPS It is beneficial to process the graph on a hybrid system if communication overhead is low α = β = r cpu = c = Assume PCI-E bus, b ≈ 4 GB/sec and per edge state m = 4 bytes => c = 1 billion EPS x Best reported single- node BFS performance [Agarwal, V. 2010] |V| = 32M, |E| = 1B Worst case (e.g., bipartite graph)

10 Totem: Programming Model Bulk Synchronous Parallel  Rounds of computation and communication phases  Updates to remote vertices are delivered in the next round  Partitions vote to terminate execution......

11 Totem: A BSP-based Engine Compressed sparse row representation Computation: kernel manipulates local state Updates to remote vertices aggregated locally Comm2: merge with local state Comm1: transfer outbox buffer to remote input buffer

12 |E| = 512 Million Random The Aggregation Opportunity real-world graphs are mostly scale-free: skewed degree distribution sparse graph: ~5x reduction Denser graph has better opportunity for aggregation: ~50x reduction

13 Evaluation Setup Workload  R-MAT graphs  |V|=32M, |E|=1B, unless otherwise noted Algorithms  Breadth-first Search  PageRank Metrics  Speedup compared to processing on the host only Testbed  Host: dual-socket Intel Xeon with 16GB  GPU: Nvidia Tesla C2050 with 3GB

14 Predicted vs. Achieved Speedup Linear speedup with respect to offloaded part GPU partition fills GPU memory After aggregation, β = 2%. A low value is critical for BFS

15 Breakdown of Execution Time PageRank is dominated by the compute phase Aggregation significantly reduced communication overhead GPU is > 5x faster than the host

16 So far …  Performance modeling  Simple  Useful for initial system provisioning  Totem  Generic graph processing framework  Algorithm-agnostic optimizations  Evaluation (Graph500 scale-28)  2x speedup over a symmetric system  1.13 Billion TEPS edges on a dual-socket, dual-GPU system But, random partitioning! Can we do better?

Better partitioning strategies. The search space.  Handles large (billion-edge scale) graphs. o Low space and time complexity. o Ideally, quasi-linear!  Handles well scale-free graphs.  Minimizes algorithm’s execution time by reducing computation time o (rather than communication) 17

The strategies we explore H IGH : vertices with high degree left on the host L OW : vertices with low degree left on the host R AND : random 18 The percentage of vertices placed on the CPU for a scale-28 RMAT graph (|V|=256m, |E|=4B)

Evaluation platform 19 Intel Nehalem Fermi GPU Xeon X5650 Tesla C2075 (2x sockets)(2x GPUs) Core Frequency 2.67GHz1.15GHz Num Cores (SMs) 614 HW-thread/Core 2x 48warps (x32/warp) Last Level Cache 12MB2MB Main Memory 144GB6GB Memory Bandwidth 32GB/sec144GB/sec Total Power (TDP) 95W225W

BSF performance 20 BFS traversal rate for a scale-28 RMAT graph (|V|=256m, |E|=4B) 2x performance gain! LOW: No gain over random! Exploring the 75% data point

Host is the bottleneck in all cases ! BSF performance – more details

PageRank performance 22 PageRank processing rate for a scale-28 RMAT graph (|V|=256m, |E|=4B) LOW: Minimal gain over random! Better packing 25% performance gain!

Small graphs (scale-25 RMAT graphs: |V|=32m, |E|=512m) 23 BFSPageRank  Intelligent partitioning provides benefits  Key for performance: load balancing

Uniform graphs (not scale free) 24 BFS on scale-25 uniform graph |V|=32m, |E|=512m) BFS on scale-28  Hybrid techniques not useful for uniform graphs

Scalability Graph size: RMAT graphs: scale 25 to 29 (|V|=512m, |E|=8B) Platform size: 1,2, 4 sockets  2xsockets + 2 x GPU 25 BFSPageRank

Power Normalizing by power (TDP – thermal design power) Metric: million TEPS / watt 26 BFSPageRank 2.01.3 1.1 1.3 1.9 2.3 2.4 1.8 1.0 1.4 1.8

Conclusions Q: Does it make sense to use a hybrid system? A: Yes! (for large scale-free graphs) Q: Can one design a processing engine for hybrid platforms that both generic and efficient? A: Yes. Q: Are there near-linear complexity partitioning strategies that enable higher performance? A: Yes, partitioning strategies based on vertex connectivity provide in all cases better performance than random. Q: Should one search for partitioning strategies that reduce the communication overheads (and hope for higher performance)? A: No. (for scale free graphs) Q: Which strategies work best? A: It depends! Large graphs: shape the load. Small graphs: load balancing.

28 If you were plowing a field, which would you rather use? - Two oxen, or 1024 chickens? - Both!

29 code available at: netsyslab.ece.ubc.ca Papers: A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, PACT 2012 On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, IPDPS 2013 29

30 A golf course … … a (nudist) beach (… and 199 days of rain each year) Networked Systems Laboratory (NetSysLab) University of British Columbia

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Similar presentations

Presentation on theme: "1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Similar presentations

Presentation on theme: "1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)"— Presentation transcript:

Similar presentations

About project

Feedback