Download presentation
Presentation is loading. Please wait.
Published byBeverly Goodwin Modified over 9 years ago
1
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu Santos-Neto and Matei Ripeanu Netsyslab The University of British Columbia
2
2 The ‘Field’ to Plow : Graph processing 1.4B pages, 6.6B links 1B users 150B friendships 100B neurons 700T connections
3
3 Graph Processing Challenges Data-dependent memory access patterns Caches + summary data structures Large memory footprint >256GB CPUs Poor locality Low compute-to- memory access ratio Varying degrees of parallelism (both intra- and inter- stage)
4
4 The GPU Opportunity Data-dependent memory access patterns Caches + summary data structures Large memory footprint >256GB CPUs Poor locality Assemble a hybrid platform Massive hardware multithreading 6GB! GPUs Low compute-to- memory access ratio Caches + summary data structures Varying degrees of parallelism (both intra- and inter- stage)
5
5 YES WE CAN! 2x speedup (4 billion edges) Motivating Question Can we efficiently use hybrid systems for large-scale graph processing ?
6
66 Performance Model Predicts speedup Intuitive Totem A graph processing engine for hybrid systems Applies algorithm-agnostic optimizations Partitioning Strategies Can one do better than random? Evaluation Partitioning strategies Hybrid vs. Symmetric Methodology A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing, PACT 2012 On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest, IPDPS 2013
7
77 Totem: Compute Model Bulk Synchronous Parallel Rounds of computation and communication phases Updates to remote vertices are delivered in the next round Partitions vote to terminate execution......
8
8 Totem: Programming Model A user-defined kernel runs simultaneously on each partition Vertices are processed in parallel within each partition Messages to remote vertices are combined if a combiner function is defined msg_combine msg_combine_func
9
9 Target Platform 9 Intel Nehalem Fermi GPU Xeon X5650 Tesla C2075 (2x sockets) (2x GPUs) Core Frequency 2.67GHz1.15GHz Num HW-threads 12448 Last Level Cache 12MB2MB Main Memory 144GB6GB Memory Bandwidth 32GB/sec144GB/sec
10
10 Use Case: Breadth-first Search Graph Traversal algorithm Very little computation per vertex Sensitive to cache utilization Level 1 Level 0 5 2 9 4 1010100010 Visited bit-vector
11
11 BSF Performance 11 Workload: scale-28 Graph500 benchmark (|V|=256m, |E|=4B) 1S1G = 1 CPU Socket + 1 GPU Limited GPU memory prevents offloading more Linear improvement with respect to offloaded part Can we do better than random partitioning?
12
12 BSF Performance 12 Workload: scale-28 Graph500 benchmark (|V|=256m, |E|=4B) 1S1G = 1 CPU Socket + 1 GPU Computation phase dominates run time
13
13 Workload-Processer Matchmaking Observation: Real-world graphs have power- law edge degree distribution Idea: Place the many low-degree vertices on GPU Place the few high-degree vertices on CPU Log-log plot of edge degree distribution for LiveJournal social network Many low degree vertices Few high degree vertices
14
14 BSF performance 14 Workload: scale-28 Graph500 benchmark (|V|=256m, |E|=4B) 1S1G = 1 CPU Socket + 1 GPU 2x performance gain! LOW: No gain over random! Exploring the 75% data point Specialization leads to superlinear speedup
15
15 Host is the bottleneck in all cases! BSF performance – more details HIGH leads to better load balancing
16
16 PageRank: A More Compute Intensive Use Case 16 Workload: scale-28 Graph500 benchmark (|V|=256m, |E|=4B) 1S1G = 1 CPU Socket + 1 GPU LOW: Minimal gain over andom! Better packing 25% performance gain!
17
17 Same Analysis, Smaller Graph (scale-25 RMAT graphs: |V|=32m, |E|=512m) 17 BFS PageRank Intelligent partitioning provides benefits Key for performance: load balancing
18
18 Worst Case: Uniform Graph (not scale free) 18 BFS on scale-25 uniform graph (|V|=32m, |E|=512m) Hybrid techniques not useful for uniform graphs
19
19 Scalability: BFS 19 S = CPU Socket G = GPU Hybrid offers ~2x speedup vs Symmetric GPU memory is limited, GPU share < 1% of the edges! Comparable to top 100 in Graph500 challenge!
20
20 Scalability: PageRank 20 Hybrid offers better gains for more compute intensive algorithms S = CPU Socket G = GPU
21
21 Summary Problem: large-scale graph processing is challenging –Heterogeneous workload – power-law degree distribution Solution: Totem Framework –Harnesses hybrid systems – Utilizes the strengths of CPUs and GPUs –Powerful abstraction – simplifies implementing algorithms for hybrid systems –Partitioning – workload to processor matchmaking –Algorithms – BFS, SSSP, PageRank, Centrality Measures
22
22 Questions code@: netsyslab.ece.ubc.ca
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.