Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu
How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Hello every one, I am Tanuj Kr Aasawat. I will be presenting our paper How well do CPU, GPU, and Hybrid Graph processing frameworks perform? This work was done with colleagues at UBC.

Just to give little info, currently we are here at JW Marriot, and our campus is hardly an hour journey from here, using public transport.

A golf course … … a (nudist) beach
Networked Systems Laboratory (NetSysLab) University of British Columbia A golf course … Our lab is here. We also have a golf course. You can come and play golf. And if you do not like golf, well there is a nudist beach too. And UBC is the only university in the world that has both a golf course and a nudist beach, on-campus. And you are in the right season too, it is summer, and you should enjoy it. … a (nudist) beach (… and 199 days of rain each year)

Graphs are Everywhere 1B users 150B friendships 100B neurons
As a motivational beginning what we have is many interesting and relevant problems in society now, that can be modeled as graphs. And not only graphs but massively large graphs. Here are just a few pictures that u can see to illustrate this point. So we have online social network (where people are represented by vertices and edge between two vertices represents the friendship), or roads that can be modeled as graphs, or even larger problems like ranking pages on web, or cargo ship network where ports represents the vertices and the cost to go between them as edges. 1B users 150B friendships 100B neurons 700T connections

Challenges in Graph Processing
Poor locality Data-dependent memory access patterns Low compute-to-memory access ratio These problems are challenging. Primarily because graphs are sparse – neighbors are scattered in the memory memory accesses are highly irregular since it is data dependent Most of the graph algorithms have very low compute to memory access ratio They are memory intensive & depend on memory latency. Most of, say BFS, doesn’t really process much information stored at the node itself, but rather spent time in accessing what’s the next node and so on. Graphs are large and in order to process them efficiently, you need to load them in memory. Graph500 “Mini” graph requires 128GB, 2x if weighted graph Parallelism depends on stages of the execution of the algorithm. There is lots of interest on processing the graphs on single node and these nodes are large enough for medium size workloads that we see. Large memory footprint Graph500 “mini” graph requires 128 GB. Varying degrees of parallelism (both intra- and inter- stage)

Processing Elements Characteristics
CPUs GPUs Poor locality Large Caches Caches Data-dependent memory access patterns Massive hardware multithreading Low compute-to-memory access ratio So we see how two processing elements – CPU and GPU, can be used in the context of graph processing. If you think of CPUs, it has large caches and large memories. GPUs on the other hand, can solve issues that CPUs don’t. Low compute-to-memory access ratio can be alleviated by the massive multithreading implemented on the hardware on GPUs. On the other hand, the drawbacks is that you don’t have enough space to store massive graphs that CPUs would have. Can we put these two together as a hybrid platform and leverage their respective strengths to process graphs efficiently? Large memory footprint >1TB ~16GB Graph500 “mini” graph requires 128 GB. Assemble a hybrid platform? Varying degrees of parallelism (both intra- and inter- stage)

Graph Processing Frameworks
Programming Model (Vertex Programming/Linear Algebra) There are two essential elements that make a graph processing framework. One is the choice of architecture. Single node or distributed. I am going to focus on single node in this talk. For single-node, we have multiple processing units, CPU only, GPU only, and hybrid. second essential element is the programming model, either vertex programming (which covers vertex centric or edge centric operations) or linear algebra (which covers SPMV like operations) People use linear algebra because it is much easier to add applications. Users can write the code in three lines by using existing linear algebra libraries. While for Vertex Programming, you have to program the application by yourself, though it gives the opportunity to hand-tune the code. High Performance Architecture (Single-node or Distributed) CPU/GPU/Hybrid

Motivation How architecture and programming model combination improves performance and efficiency of the system as a whole? I would explore this space to understand how the various combinations of these two perform.

Graph Processing Frameworks
Galois UTexas, Austin GraphMat Intel Gunrock UC, Davis Nvgraph Nvidia Totem UBC Architecture Programming Model Model CPU CPU + Distributed Vertex Programming Linear Algebra we evaluate 5 state-of–the-art graph processing frameworks, targeting different architectures and embracing different programming models. Galois and GraphMat are CPU based frameworks. Galois is by group at UT-Austin and it targets multi-core CPUs. GraphMat is by Intel and it targets both single-node as well as distributed system consisting of CPUs. Gunrock and Nvgraph are GPU based frameworks. Gunrock is developed at UC Davis, and it supports multi-gpu processing. Nvgraph is by Nvidia and it targets single GPU. Totem is developed by our group and it targets single-node hybrid systems consisting of the CPU and multiple GPUs. It supports parallel CPU only, multi-GPU only, as well as a hybrid mode in which it does processing on both the CPU and GPUs simultaneously. Galois follows vertex centric approach. Gunrock leverages both the vertex-centric and edge-centric processing modes and alternates between them depending on the algorithm execution state. Totem also vertex programming model and follows vertex centric approach. On the other hand, GraphMat and Nvgraph follows linear algebra based programming model. We consider these frameworks because previous publications in this domain have showed that they are representative of their category in terms of performance. For each of CPU and GPU, we picked one vertex programming and another linear algebra based framework. Multi - GPU GPU Vertex Programming Linear Algebra CPU + multi-GPU Vertex Programming

Benchmark Algorithms PageRank Single Source Shortest Paths (SSSP)
Ranking web pages Compute intensive Single Source Shortest Paths (SSSP) IP routing, Transportation networks Breadth-First Search (BFS) Finding connected component, subroutine Memory intensive To compare these frameworks, we use three popular graph algorithms which stresses the system differently: PageRank, Single Source Shortest Paths (SSSP) and Breath-First Search (BFS). PageRank is the well-known algorithm used by search engines for ranking web pages. SSSP is a graph traversal algorithm and finds shortest path between a given source vertex and all other vertices (within the same connected component) in the graph. BFS is a graph traversal algorithm which can determine the connected component starting from a given source vertex as well as the level of each vertex in the resulting BFS tree. It is also used as a subroutine in many graph algorithms like betweenness centrality, connected components, etc. We select these algorithms because BFS and PageRank represents two ends of the spectrum, PageRank is compute intensive while BFS is memory intensive. We chose SSSP because BFS and Pagerank have data on vertices, while SSSP has data on edges, and it follows different memory access pattern. Further, BFS and SSSP are used as benchmarks in Graph500 challenge. Lastly, we have native implementation of all these three algorithms on all the five frameworks.

Evaluation Metrics Raw Performance Energy Consumption Scalability
Traversed Edges Per Second (TEPS): Traversed Edges / Execution Time Energy Consumption Average Power consumed * Execution Time Scalability Strong scaling w.r.t processing units We evaluate the frameworks for the algorithms with respect to three performance metrics: raw performance, energy consumption, and scalability. We measure Raw performance as Traversed edges per second or TEPS. TEPS is calculated by dividing traversed edges by execution time. Consistent with the standard practice in the domain, ‘execution time’ does not include time spent in pre- or post processing steps. We measure raw performance in terms of TEPS, instead of Execution time (though execution time is available in the paper). TEPS helps in normalizing the result. For small graphs, execution time is going to be very low, but TEPS would be in the same ballpark as for the large graph. We measure energy consumption by the power consumed during the algorithm execution time. We do strong scaling to study how the frameworks respond with increasing number of processing units. For scalability experiments please have a look at our paper.

Testbed Characteristics
System 1 CPU 2x Intel Xeon E v3 (Haswell) #CPU Cores 28 Host Memory 512 GB DDR4 L3 Cache 70 MB PCIe 3.0 – x16 GPU 2x Nvidia Tesla K40c GPU Thread Count 2880 GPU Memory 12 GB For all the experiments we use our system, which consists of 2 socket intel xeon CPU and 2 nvidia tesla k40c GPU. Do note that this is a large memory node, with 512 GB memory available.

Datasets Graph #Vertices #Edges Max Degree Avg. Degree Real World
Com-Orkut 3 M 234 M 33,313 78 liveJournal 4.8 M 68 M 20,292 14 Road-USA 28.8 M 47.9 M 9 1.6 Twitter 52 M 3.9 B 3,691,240 75 Synthetic RMAT22 4 M 128 M 168,729 32 RMAT23 8 M 256 M 272,808 RMAT24 16 M 512 M 439,994 RMAT27 4 B 3,910,241 We use diverse real-world and synthetic datasets. Top 4 in red, are real world, and bottom 4 in green are synthetic ones, generated using Graph500 generator. All the graphs are undirected, following the Graph500 standard. Though, directed version of RMAT27 and Twitter are used for GraphMat, since it doesn’t support 64 bit integers. All the graphs are power law graphs except Road-USA.

In power law graphs, most of the vertices have very low degree while few of the vertices have very high degree WDC, 2012

9,354 MB during pre-processing step
Memory Consumption Framework Memory layout PageRank SSSP BFS Nvgraph CSC (PageRank, SSSP) and CSR (BFS) 1,159 (1.8x) 1,111 (1.0x) 683 (1.0x) Gunrock CSR and COO 641 (1.0x) 1,582 (1.4x) 1,443 (2.1x) Galois CSR 1,599 (2.5x) 2,074 (1.9x) 1,432 (2.1x) GraphMat* DCSC 2,818 (4.4x) 2,786 (2.5x) 2,980 (4.4x) Totem-2S 1,275 (2.0x) 2,198 (2.0x) 1,282 (1.9x) Totem-2S2G 1,628 (2.5x) 2,587 (2.3x) 1,658 (2.4x) 9,354 MB during pre-processing step Since, these frameworks use different memory layouts for in-memory graph storage, as well as leverage supplementary data structures to achieve high performance, we first list down the total memory consumption for the three algorithms when using the RMAT22 graph, whose edgelist is of size 512 MB. Nvgraph stores the graph either in Compressed Sparse Column format or Compressed Sparse Row format depending on the graph algorithm. Gunrock stores the graph in CSR format for vertex centric operations and in Co-ordinate format for edge centric operations. Galois and Totem uses CSR layout, while GraphMat uses Doubly Compressed Sparse Column format. On average, Nvgraph requires the least amount of memory. While, GraphMat requires the highest, and surprisingly it consumes ~9 GB during preprocessing step, for this workload of size only 512 MB. Totem, stores both the original graph edge-list as well as the respective CPU and GPU partitions in memory, therefore, ends up requiring memory about two times the size of the graph. Memory Consumption (in MB) for RMAT22 graph (edge list size: 512 MB)

Experimental Results 1. Raw Performance - PageRank
Fastest: Totem-2S Nvgraph vs GraphMat We first evaluate the frameworks for raw performance. X axis is for workload. Rmat27 and Twitter are large graphs. Rest of the graphs fit into the GPU memory. Y axis is for Billion TEPS per pagerank iteration. Higher the better. Missing data points means either the graph did not fit into the GPU memory or it failed to run. The different frameworks are on the top. The ones in Green are GPU based frameworks. The ones on blue, Galois and GraphMat, are CPU based frameworks. While, the ones in red represents different modes of Totem. So, Totem-1G means running Totem on 1 GPU, 2S means running on CPU (on 2S machine), and 2S2G means running it in hybrid mode with 2 socket CPU and 2 GPUs. We first compare gpu based frameworks. Totem-1G is faster than both the frameworks in most of the cases, while nvgraph is faster for orkut and livejournal. Then we compare CPU based frameworks. Totem-2S is roughly 4x faster than both the frameworks. Further we observe that, in all cases, even for smaller graphs where the overheads of coordinating a CPU and a GPU may become apparent, the hybrid solution offers a performance advantage. Totem-2S works better than GPU frameworks as well, because it uses the large cache efficiently, and other CPU based frameworks are not good enough to use the cache. Otherwise I would have expected Galois and GraphMat to perform better than GPU based frameworks. Among linear algebra based systems, nvgraph performs better than GraphMat, on average. Further, both uses CSC format to store the graph, and are faster than at least one of the respective frameworks – Nvgraph is at least faster than Gunrock, while GraphMat is at least faster then Galois.

Experimental Results 1. Raw Performance - SSSP
Fastest: Totem-2S CSC is suitable for PageRank For SSSP, we see similar trend. Totem-1G is faster than GPU based frameworks, Totem-2S is faster than CPU based frameworks, while Totem-Hybrid offers performance advantage, except for the Road_USA graph which has a very large diameter, and is a suitable workload for SSSP. One thing to note over here is that, both Nvgraph and GraphMat uses CSC format. And we can see that none of them perform well. We observe that for linear algebra based frameworks, CSC is more suitable for PageRank than for SSSP. So, in case you are not familiar with CSC or CSR format, I will describe them briefly.

Graph Layout in Memory CSR Representation CSC Representation rowPtr
2 4 3 1 rowPtr 1 3 6 8 2 4 5* VertexId edgeList 1 2 3 4 5 6 7 Lets say we have this graph. In CSR format, all the outgoing edges from a vertex are placed together in the adjacency list of the vertex. So, vertex 1 has two outgoing edges to 2 and 3, and in the adjacency list of vertex 1, vertex 2 and 3 are placed together. While, in CSC format, all the incoming edges are placed together. So, incoming edges for vetex 0, are placed together in its adjacency list. So for say PageRank, where a vertex computes its rank through the incoming edges, if you use CSC format, you have all the incoming edges placed together. Thereby we get better prefetching and coalesced memory accesses. CSC Representation colPtr 2 3 6 7 8 1 4 5* VertexId edgeList 3 4 1 2 5 6 7

Experimental Results 1. Raw Performance - BFS
Fastest: Totem-2S Nvgraph vs GraphMat CSR suitable for BFS Hybrid: ~2x For BFS, All the vertex programming based frameworks implement Scott Beamer’s direction optimized BFS algorithm, where it switches between top-down and bottom-up approach. Among GPU based frameworks, Gunrock is faster because its programming model (that switches b/w vertex and edge centric operations) maps well with direction optimized BFS. While for CPU based framework we observe that Totem-2S is faster while we observe that hybrid mode provides performance boost of up to 2x. Further to note over here is that Nvgraph stores graph in CSR format for BFS, while GraphMat stores it in CSC format. And we can observe that Nvgraph is faster in all the cases than GraphMat.

Experimental Results 2. Energy Consumption – GPU Frameworks – Orkut Workload
For energy consumption, we consider the Orkut graph to measure energy consumption by the GPU-based frameworks For PageRank, Nvgraph is the most energy efficient and consumes 9% less energy than Totem-1G. For SSSP, Totem-1G is most energy efficient and consumes roughly 4x less energy than Nvgraph. For BFS, Gunrock outperforms Totem-1G by roughly 5x, because it is ~5x faster than Totem-1G.

Experimental Results 2. Energy Consumption – GPU Frameworks – Orkut Workload
We further look at how other modes of totem perform even on this small workload. We observe that both Totem-2S as well as hybrid mode, are ‘greener’ than the GPU-based frameworks even on this small graph, except for BFS.

Experimental Results 2. Energy Consumption – CPU Frameworks – Twitter Workload
For CPU frameworks, we consider Twitter graph. For all the three algorithms, Totem outperforms both GraphMat and Galois for all three algorithms, by roughly an order of magnitude.

Energy Efficient: Totem-2S
Experimental Results 2. Energy Consumption – CPU Frameworks – Twitter Workload Energy Efficient: Totem-2S We further observe that, for all three algorithms, hybrid mode of Totem consumes similar amount of energy. This is because even though hybrid mode consumes more power, time-to-solution decreases accordingly.

Summary GPU + Linear Algebra| CPU + Vertex programming = Good Match
GPU based frameworks: ? CPU based frameworks: Totem-2S Totem Hybrid: Greenest CSC PageRank CSR BFS, SSSP To summarize, we observe that linear algebra based programming model offers performance advantages on the GPU, while vertex programming is a better match for the CPU. Among GPU based frameworks there is no clear winner, though in most of the cases Totem-1G performs better. Among CPU based frameworks Totem-2S performs the best and it also performs better than GPU only frameworks in most of the cases. Overall we find the Totem Hybrid is greenest of all. Coming to the memory layout, we observe that CSC is more suitable for algorithms like PageRank, while CSR is more suitable for CSC format.

Discussion Now the discussion lies on, where should we focus on, should we focus on CPU only frameworks or GPU only frameworks or should we go with a hybrid one?

Does hybrid have the future potential?
So far we have looked at small graphs that fit into GPU memory, and we demonstrate that we get better performance by adding GPUs. We will next look at the performance advantage of using graph workload which is an order of magnitude larger than the memory available on GPU. In this experiment we use RMAT30 graph with edge list size 128 GB. In this experiment we keep the number of processing elements constant and choose whether some of the processing elements should be GPUs. So we use our 4 socket machine to compare against hybrid mode of totem with 2 Sockets and 2 GPUs. The Y-axis on the left, in blue, is for execution time, while the secondary y axis on right, in red, is for energy. We observe that hybrid mode is up to 3x faster than CPU mode and up to 5x more energy efficient. And this is an argument in favor of our hypothesis that a hybrid platform has a chance to work better for scale free graphs. How? Totem-4S vs Totem-2S2G for RMAT30 (edge list size: 128 GB) 4S Machine: 4x Intel Xeon E v2 (Ivy bridge), with 1,536 GB memory

Hybrid Graph Processing
CPUs GPUs Graph Processing Poor locality Large Caches + summary data structures Caches + summary data structures Data-dependent memory access patterns Massive hardware multithreading Low compute-to-memory access ratio We can allocate high degree on CPU and low degree on GPU. On CPU we get benefited by the large cache and by using summary data structures like bitmap for BFS, and on GPU we get benefited by the massive multithreading. Also since these vertices on GPU have low degree, they can leverage small cache on GPUs as well. Large memory footprint >1TB 16GB! Low Degree High Degree Varying degrees of parallelism (both intra- and inter- stage)

code@: netsyslab.ece.ubc.ca
Questions Our code is available at our lab’s web page. And yes questions please. netsyslab.ece.ubc.ca

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu

Similar presentations

Presentation on theme: "Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu

Similar presentations

Presentation on theme: "Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu"— Presentation transcript:

Similar presentations

About project

Feedback