Scale-Free Graph Processing on a NUMA Machine

Scale-Free Graph Processing on a NUMA Machine
Hello Everyone. I am going to give presentation on Processing Scale-Free Graphs on a NUMA machine. We explore the distributed shared memory nature of NUMA systems to process graphs with more than 100 billion edges. Tanuj Kr Aasawat*, Tahsin Reza, Matei Ripeanu Networked Systems Lab The University of British Columbia *Now at RIKEN AIP, Tokyo

Graphs are Everywhere 3.5B pages 128B hyperlinks 0.7B users
As a motivational beginning what we have is many interesting and relevant problems in society now, that can be modeled as graphs. And not only as graphs but as massively large graphs. For example, in social network graphs we represent people by vertices and relationship between them as edges. In all these high impact applications, in order to get meaningful insights from the huge data, these massively large graphs need to be processed fast and efficiently. 0.7B users 137B friendships 100B neurons 700T connections PetaBytes of Data

Challenges in Graph Processing
Highly irregular, data-dependent memory access pattern Neighbors scattered in memory Leads to poor data locality Low compute-to-memory access ratio Memory bound Large memory footprint Real world graphs: current Facebook (137 Billion edges), WDC – Web Graph (128 Billion edges) Needs at least 2 TB of memory Needs entire graph in memory for high-performance Hard to obtain balanced partitions Hard to obtain balanced partitions – to achieve better load balance and overall performance Power-law degree distribution: few vertices with high degree; most vertices with low degree These problems are challenging. Primarily because memory accesses are highly irregular and data dependent, since neighbors are scattered in memory. This leads to poor data locality. Most of the graph algorithms have very low compute to memory access ratio. Most of the graph algorithms explores the neighbors of a vertex rather than doing processing the information stored on the vertex itself. Graphs are large. For example, Facebook and WDC graphs have more than 100 billion edges and in order to process them efficiently, you need to keep the entire graph in memory. If we process them in a distributed system, it is hard to obtain balanced partitions to achieve better load balance and over all performance because these graphs follow power law degree distribution, where few vertices have very high degree and most vertices have very low degree.

Sequential vs Random access: 6x to 9x
Hardware Platform Intel Xeon, 4 sockets, 1.5 TB Memory Large shared-nothing cluster Shared-memory SMP machine Shared-memory NUMA machine Throughput Latency 40% 52% Can a distributed graph processing approach provide better performance on NUMA? Sequential vs Random access: 6x to 9x Usually to process such huge graphs, frameworks like Google’s Pregel have been developed that target shared-nothing clusters as they have large aggregated memory. The graph is partitioned among the processing units and since the clusters are not cache-coherent, the partitions need to communicate explicitly through message passing. This strategy is in contrast with graph processing frameworks like, that target shared-memory systems, and treat shared-memory system as if it is based on SMP architecture. In SMP architecture, the access time to any location in memory is uniform, therefore it doesn’t require partitioning. Now, the modern shared-memory systems are embracing the so called NUMA architecture. Which has proven to be more scalable in-terms of memory and processing power. NUMA machines introduce a dilemma: on the one side they provide shared memory - thus graph processing frameworks developed for shared-memory system can be directly used. On the other side, the cost of memory accesses is non-uniform. For example, the NUMA machine that we have used has 4 sockets. We observe that local throughput is up to 40% more than non-local/remote throughput. And, local latency is 52% less than remote latency. There is one more interesting observation we made. Sequential access is 6 to 9x faster than random access. Now, this is important for graph processing, since here we do lot random accesses. Given these key characteristic that NUMA has differential cost for accessing memory, it resembles distributed systems. So the question is can a distributed processing approach provide better performance on NUMA as well. Requires: Explicit graph partitioning Explicit communication Easy to program Doesn’t require: Partitioning & communication Usually programmed as SMP Suffers communication penalty Same as distributed memory Polymer Totem Ligra

Hypothesis: A distributed-memory like middleware provides better performance on NUMA Machines
Explicit graph partitioning and inter-partition communication Advantages Controlled data placement Better locality Communication Trade-offs Disadvantages Memory overhead Communication overhead We postulate the hypothesis that a distributed mem… The intuition is that, with explicit partitioning, we get the control over data placement, which helps in designing and experimenting with different partitioning strategies to achieve better load balance and overall performance gain on a shared memory system. Further, explicit partitioning and explicit communication, provides better locality, since all the accesses could be served from the local memory. And finally, since NUMA is a shared memory system, we can explore different trade offs to minimize communication overhead. All these advantages comes at some cost. Since we have different partitions, there is memory overhead of storing buffers for remote partitions. unlike shared-memory frameworks, here we have the overhead of inter-partition communication.

Bulk Synchronous Parallel (BSP) Processing Model
Preprocessing Step Graph Partition Processing Step Sequence of supersteps Computation Phase Communication Phase Synchronization Phase Barrier Synchronization Communication Processing Units Computation Superstepi Superstepi+1 Most distributed graph processing frameworks follow BSP processing model, to process the graph in parallel on different processing units. It presumes the data is partitioned among the processing units. The processing consists of a sequence of rounds or supersteps. And each superstep consists of three phases, executed in order. In computation phase, processing units executes their respective partition independently In communication phase, they exchange remote updates and applies the respective updates Synchronization phase confirms the message delivery. This process continues until convergence or termination criteria is fulfilled.

Benefits of BSP in Graph Processing on NUMA
Benefit 1: Explicit partitioning Easy design/experimentation with partitioning strategies Better load balance and overall performance We explore two popular strategies and introduce a new one Benefit 2: Choice of Communication Designs Explicit communication – embrace distributed systems design Communication Trade-off – embrace shared-memory nature of NUMA On NUMA machine, this provides two key benefits. First, is partitioning the graph explicitly. Which enables easy implementation and experimentation with partitioning strategies, to improve load balance and overall performance. We explore two partitioning strategies popular in distributed systems and then we introduce a new one. The second benefit is we can explore different communication designs for inter-partition communication. We explore three communication designs, and in the first design we fully embrace the communication design of shared-nothing systems, and then we look at two other communication designs possible because of the shared-memory nature of NUMA systems.

Benefits of BSP in Graph Processing on NUMA
Benefit 1: Explicit partitioning Easy design/experimentation of partition strategy Better load balance and overall performance We explore two popular strategies and introduce a new one Benefit 2: Choice of Communication Designs Explicit communication – embrace distributed systems design Communication Trade-off – embrace shared-memory nature of NUMA Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion So, for the rest of this presentation, we will follow this roadmap, where we first describe and evaluate the benefits of explicit partitioning. Next we describe our three communication designs, and evaluate them against a state-of-the-art NUMA-oblivious framework. After that we compare our framework against a state-of-the-art NUMA-aware framework And, finally we conclude this presentation.

Benefit 1: Explicit Partitioning
Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Benefit 1: Explicit Partitioning Makes designing and experimenting with partitioning strategies easier Popular partitioning strategies Random partitioning – Google’s Pregel, GraphLab Sorted/Degree-aware partitioning – Totem and others New Hybrid partitioning strategy Hybrid = Partition randomly + Sort each partition by degree Success criteria Better load balance Overall performance improvement Having support for explicit partitioning, one can experiment with different partitioning schemes. We implement two popular partitioning strategies. Random partitioning scheme is used by large scale distributed systems like Google’s pregel. In random partitioning, vertices are assigned randomly to the processing units. This leads to uniform distribution of vertices. On the other hand, Totem (developed by our group) and other frameworks have used sorted partitioning. In sorted partitioning, the vertex list is first sorted by degree, and then the chunks of equal size (i.e. equal number of edges) are assigned to the respective partitions. It has been shown that this strategy improves data locality. Based on the lessons from these two strategies, we developed our hybrid partitioning strategy. Here we first assign the vertices randomly to the partitions, leading to uniform distribution of vertices. And in the next step, we sort the vertex list of each partition by degree – this improves data locality. The success criteria is that a partitioning strategy should provide better load balance, and more importantly, better overall performance improvement.

Impact of Partitioning: PageRank
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Impact of Partitioning: PageRank Load imbalance 3% Now we will look at how these two strategies perform for PageRank algorithm. We use RMAT31 graph, a synthetic graph generated following Graph500 standard and have similar characteristics to real-world graphs. The x axis is for supersteps. Since workload is fixed, we show only three supersteps. Random strategy leads to a load imbalance of only 1.03x (i.e. the slowest partition is only 3% slower than the fastest partition). Workload: RMAT-31 (64 Billion undirected edges – 512 GB)

Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Impact of Partitioning: PageRank Load imbalance 46% for sorted strategy, we observed load imbalance of 46%, Workload: RMAT-31 (64 Billion undirected edges – 512 GB)

Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Impact of Partitioning: PageRank Load imbalance 5% Our hybrid leads to load imbalance of only 5%. Workload: RMAT-31 (64 Billion undirected edges – 512 GB)

Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Impact of Partitioning: PageRank Performance gain: 2x and 1.18x And Improves the performance by 2 times compared to random strategy, and 18% compared to sorted one. In sorted strategy, data locality increases the performance, but since first partition gets dense subgraph and last partition gets sparse subgraph, it increases the load imbalance. Workload: RMAT-31 (64 Billion undirected edges – 512 GB)

Impact of Partitioning: BFS-DO
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Impact of Partitioning: BFS-DO Performance gain: 1.55x and 5.3x Now we will look at how these strategies perform for Direction Optimized BFS, which has dynamic workload per iteration. Since workload is dynamic, it needs more supersteps to converge. Hybrid strategy provides performance gain of 55% compared to random, and more than 5 times compared to sorted strategy. sorted strategy leads to load imbalance of 10x. Since first partition is the most dense, the frontier size builds up quickly. This leads to first partition spending most time in processing the large frontier. Workload: RMAT-31 (64 Billion undirected edges – 512 GB)

Benefit 1 (Explicit Partitioning)
Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Benefit 1: Summary Better load balance (and balanced partitions) does not mean better performance (Random vs Sorted strategy) Hybrid strategy improves both load balance as well as performance Our infrastructure enables easy implementation and experimentation with different partitioning strategies Hybrid strategy strikes the right balance between load balance and performance, hence provides better overall gain. Infrastructure is flexible enough so that users can implement different partitioning schemes. from now on all my algorithms are choosing this best partitioning solution, and I am going to focus on how to do communication.

Benefit 2: Communication Designs
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Benefit 2: Communication Designs Explicit communication – embrace distributed systems design NUMA 2-Box: Allocate message box on both source and destination Communication Trade-off – embrace shared-memory nature of NUMA NUMA 1-Box: Allocate message box on one partition, and send pointer to the box on the other partition NUMA 0-Box: Get rid of entire communication infrastructure, since NUMA is a shared-memory system As described earlier, on NUMA machine, there is variability in terms of local and remote accesses, sequential and random accesses, as well as number of edges explored by different algorithms. So, we have the opportunity to look at different communication designs possible because of the distributed shared-memory nature of NUMA. We explore three communication designs, and here is just the roadmap. In the first one, we fully embrace the philosophy of a distributed system, where we allocate two message boxes, one at source partition and other at destination partition. Since NUMA is shared memory system, we have the option to allocate only one message box, on one of the partitions, and then pass the pointer. Finally, in the third design, we just get rid of entire communication infrastructure. We do not use any message box.

Explicit Communication – NUMA 2-Box
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Explicit Communication – NUMA 2-Box Computation: kernel manipulates local state Assuming NUMA as a shared- nothing distributed system Maintains two message boxes NUMA 2-Box Design BSP Supersteps Computation Communication Synchronization PID-0 PID-1 Zero remote memory accesses Message reduction 1 1 2 In the NUMA 2-Box design, we treat it as a shared-nothing system. So, we have the two partitions. V and E are the array to represent the subgraph in Compressed Sparse Row format. S is the local state buffer, to store the state of local vertices (such as ranks of the vertices for pagerank). Outbox is the message box for remote vertices, and inbox is the message box for local vertices which are remote in other partition. In computation phase of BSP, kernel manipulates the local state for the local vertices, and for remote vertices, updates are aggregated locally. Since, a vertex can be associated with multiple edges, aggregating them locally, leads to sending only one message per remote vertex, regardless of the number of edges associated with it. In communication, it transfers the outbox to respective remote inbox, and then remote updates are merged with the local state. Synchronization assures that messages are delivered and updated, before next superstep starts. 2 1 2 2 Updates to remote vertices aggregated locally Comm1: transfer outbox buffer to remote inbox buffer Comm2: merge with local state

Explicit Communication – NUMA 2-Box: Evaluation
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Explicit Communication – NUMA 2-Box: Evaluation 1.63x 2.07x We compare against Totem, which is a NUMA-oblivious framework developed by our group. Numactl is a runtime command in linux, and it allocates the memory pages in round-robin fashion on each of the numa nodes, as opposed to first-touch policy. The y-axis is for execution time per pagerank iteration. So, lower the better. NUMA-2Box performs more than 2 times faster than numa-oblivious totem. And since PR has high compute-to-memory access ratio, communication time is only 3% of total execution time. For BFS, we observed performance gain of 63%. Since BFS has a low compute-to-memory-access ratio, communication takes around 26% of execution time. Workload: RMAT-31 (64 Billion undirected edges – 512 GB)

Communication Trade-Off: NUMA 0-Box
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Communication Trade-Off: NUMA 0-Box Computation: kernel manipulates local state NUMA is a “distributed” shared- memory system “distributed” – Explicit partitioning “shared-memory” – Implicit communication No communication infrastructure NUMA 0-Box design BSP Supersteps Overlap Computation with Communication Synchronization Useful for applications that communicate via selected edges only From our experiments as well as analytical model, we observed that NUMA 1-Box doesn’t perform better. Lastly, we have a hybrid design that leverages the distributed shared memory nature of NUMA. Here, we do explicit partitioning, as if NUMA is a distributed system. But we access the state buffers as if we are in a shared-memory system. So, it does not need any communication infrastructure. And in BSP supersteps, we overlap computation with communication. So, whenever it sees a local vertex, it updates it in the local state buffer, and if it is a remote vertex, it updates the local state buffer of the partition, where this remote vertex belongs. Atomic operations are used for updates, to ensure correctness this design is useful for application … like traversal based algorithms, such as BFS. 1 2 Computation: kernel updates remote state

NUMA 0-Box: Evaluation 1.63x 2.07x
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion NUMA 0-Box: Evaluation 1.63x 2.07x Therefore, this design performs better than others for BFS and other traversal based algorithm, which communicate via selected edges only. But, for algorithms like pagerank where there is a message over each boundary edge, its performance degrades, as it ends up doing remote accesses for each of the boundary edge. Workload: RMAT-31 (64 Billion undirected edges – 512 GB)

Evaluation on Real-World Graphs
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Evaluation on Real-World Graphs 1.29x 1.69x Further, we evaluate our designs for real world graphs. Since Twitter is an online social network, where most of the algorithms used are for doing community detection or clustering and connected component. And, in all these algorithm, BFS is used as the subroutine. So, we evaluate the designs for BFS-TopDown, which the classic level-synchronous algorithm for BFS. And here as well we get better performance. The other real-world graph that we consider is clueWeb, which is a webgraph. So, pagerank is the popular algorithm used on webgraphs to rank web pages. And we observed 69% better speedup. Twitter (|V| = 51 Million, |E| = 3.9 Billion – 16 GB) clueWeb12 (|V| = 978 Million, |E| = 74 Billion – 286 GB)

Benefit 1 (Explicit Partitioning)
Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Strong Scaling 3.7x 2.9x 2.7x 2.8x Finally we show strong scaling experiments for our designs. We compare performance on 4 sockets against 1 socket, for all the algorithms that we have used. And we achieved scalability of up to 3.7x on our 4 socket machine. Workload: RMAT-30 (32B undir. edges – 256 GB) RMAT-29 (weighted, 16B undir. edges – 192 GB)

Evaluation against Polymer
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Evaluation against Polymer Segmentation fault (core dumped) Algorithm Workload Time (sec) Memory (GB) Polymer NUMA-xB Speedup Efficiency PageRank RMAT30 26.2 8.29 3.16x 1401 330 4.24x RMAT31 - 21.40 674 Twitter 1.53 0.75 2.04x 144 21.8 6.61x BFS 34.63 4.94 7.01x 1302 322 4.04x 12.63 653 13.1 0.94 13.9x 93 19.2 4.84x SSSP RMAT29 24.95 9.56 2.61x 886 232 3.82x 21.79 452 5.3 8.4 0.63x 115 41 2.8x We evaluate against Polymer, the only numa-aware framework we are aware of. We compare against it for performance as well as memory consumption, for three different algorithms, for both synthetic and real world graph. For PR, we are up to 3x faster. It did not run for RMAT31 onwards, as it was out of memory *We were able to run up to Scale 32 as well as clueWeb graph, but Polymer couldn’t. 23

Conclusion Distributed-memory like middleware provides
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Conclusion Distributed-memory like middleware provides Opportunity for designing partitioning schemes on a single-node Better load balance: Improvement of up to 9x Better overall performance: up to 5.3x Opportunity to explore communication strategies Introduced a hybrid design of Distributed and SMP systems Better scalability Up to 3.7x on a 4-Socket machine High Performance Up to 2.25x faster than NUMA-oblivious and 13.9x against NUMA-aware framework Graph500: SSSP: World Rank 2 (ISC, June, 2018), World Rank 3 (SC, 2017) BFS: Among top 3 single-nodes So to conclude, we have shown that distributed memory like middle ware on a NUMA machine

Thank You

Graph Partitioning Strategies
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Graph Partitioning Strategies x 10827x [slide 10] In random partitioning, vertices are randomly assigned to the processing units. This leads to balanced partitions w.r.t number of vertices and edges. In sorted partitioning, the vertex list is first sorted by edge degree, and then the chunks of equal size (i.e. equal number of edges) are assigned to the respective partitions. This leads to one partition getting dense subgraph and other getting sparse subgraph. But it has been shown that this strategy improves data locality. Workload: RMAT-31 (64 Billion undirected edges – 512 GB)

> > Conclusion Polymer Ligra X-Stream Totem GraphMat Gunrock
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion 13.79x Conclusion Polymer SJTU, China Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu, Scale-Free Graph Processing on a NUMA Machine, IEEE Workshop on Irregular Applications: Architectures and Algorithms (IA3), co-located with SC, November 2018. Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu, How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform?, IEEE Workshop on High-Performance Big Data, Deep Learning, and Cloud Computing (HPBDC), co-located with IPDPS, Vancouver, May 2018 Graph500 Submission: BFS: Among top 3 single-nodes Submission up to RMAT32 128 Billion undirected edges, edge list size: 1 TB) SSSP World Rank 2 (ISC, June, 2018) World Rank 3 (SC, November, 2017) > UTAustin Ligra CMU X-Stream EPFL 2.25x Totem UBC > GraphMat Intel Gunrock UCDavis Nvgraph Nvidia UTAustin

High-level view of System Implementation
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion High-level view of System Implementation BSP Engine User Inputs graph, graph_kernel to run, partitioning and comm. strategy Superstepi pid0 pid1 pid2 pid3 One parent thread is launched on each NUMA domain to initiate Superstep for the local partition Continue until the global finish flag is set Process graph_kernel(pid) using child threads within a NUMA domain Computation … We present system implementation of our framework. User provides graph/workload, graph kernel they want to run, and partitioning and communication strategy they want to select. We have implemented three graph partitioning strategies, and the users can extend with further. Partitions are allocated on respective numa node using libnuma library. Finally in the BSP engine, one parent thread is launched on each NUMA node to initiate superstep on the local partition. Each parent thread then spawns child threads process the graph_kernel. In communication phase, there are three options to select. This process continues until convergence. Graph Partitioning Random Degree-aware Hybrid Communication libnuma for numa allocation NUMA 2-Box NUMA 1-Box NUMA 0-Box Barrier Synchronization Superstepi + 1 … Superstepn

Challenges in Partitioning
Why not specific partitioning algorithms like min-cut? NP complete problem Partitioning algorithm will take more time than benchmark execution. Partitioning is complex and we are looking for simple partitioning strategies. Our infrastructure is so flexible that one can plugin any partitioning scheme you want. We picked one which is simple and works fantastic. These simple techniques make the hypothesis that there are no natural communities/clusters emerging in these graphs. But we are skeptical that there are there are natural clusters in many of these graphs. Some of the real world graphs, are highly skewed, need more supersteps to converge Possible solution: edge-balanced, vertex-cut approach adopted by GraphLab and several other distributed frameworks. Support for vertex-cut partitioning, however, would require additional changes to our core processing engine and a candidate for future work.

Variability in Communication Designs
Variability comes because you have different memory access pattern - very different between 2- and 0-Box designs NUMA 2-Box: Gain from local accesses, and remote accesses are sequential and they cost very little Copying data in bulk, followed by local random accesses NUMA 0-Box: All the remote accesses are random == # remote edges. Depending on the algorithm we have different volumes of data. some are updating all the neighbors and some are just subset of the neighbors. This explains why some algorithms are finding better solution with one and some with other one. cheaper than set of remote sequential accesses. we are reading in sequential remote, but updating them local random. if both read and write are sequential, it will go fast, but if you have seq read followed by random writes, it is limited by random writes.

Testbed Characteristics
40% faster System CPU 4x Intel Xeon E v2 (IvyBridge) #Cores 60 Memory 1536 GB DDR3 LLCache 120 MB Throughput (MB/S) Access Local Remote Read Sequential 2,464 2,069 Random 286 226 Write 1,438 1,024 238 188 For all the experiments we use our system, which consists of 4 socket intel xeon CPU. Do note that this is a large memory node, with 1.5 TB memory available. Local vs remote: up to 40% Access pattern: local random is 7x slower than remote sequential Further we show the architecture of the NUMA machine, along with memory latency (lower the better), measured using Intel’s Memory Latency Checker v3.5. 6x to 9x faster 52% faster Latency (ns) Local Remote 119 178

Latest Intel Xeon System CPU 8x Intel Xeon E7-8800 v4 #Cores 192
Memory 24 TB DDR4 LLCache 480 MB

Remote vertices vs Remote updates, in BFS-DO
22x difference Here we show the number of remote vertices vs number of remote updates in each superstep for BFS. Since remote vertices are determined during partitioning, they are fixed for entire execution. But the number of remote updates vary in every superstep, and are 22x less than the total number of remote vertices.

Applications and Workloads
PageRank High computational intensity Stable workload per superstep BFS - Top Down Memory bound Suffers from high write traffic BFS - Direction Optimized (BFS-DO) Requires hand-tuning the switching parameters BFS - Graph500 SSSP Requires distance vector SSSP - Graph500 Requires distance vector & SSSP tree Dataset #Vertices #Edges Size (GB) RMAT28 256 M 8 B 64 RMAT29 512 M 16 B 128 RMAT30 1 B 32 B 256 RMAT31 2 B 64 B 512 RMAT32 4 B 128 B 1024 Twitter 51 M 3.9 B 15 clueWeb12 978 M 74 B 286 1. Building blocks of other complex algorithms. 2. Good representation of studying the performance of any platform for irregular data structures. 3. There are two types of accesses on irregular data structures, one where you always visit your all the neighbors (pagerank) or you visit selective neighbors only (traversal based algorithms). 4. We use these applications since they are they have been widely studied in the context of high-performance graph processing systems and have been used in past studies [6]–[9]. 5. BFS and SSSP are also used as benchmarks for the Graph500 competition [14], to rank supercomputers for data intensive applications.

Explicit Communication – NUMA 2-Box
Computation: kernel manipulates local state Assuming NUMA as a shared- nothing distributed system Maintains two message boxes NUMA 2-Box Design BSP Supersteps Computation Communication Synchronization PID-0 PID-1 1 In this design we fully embrace the philosophy of a distributed system We term explicit communication as a 2-Box design, since we allocate two buffers, one at source partition and other at destination partition. And we follow the BSP computation model. and pass the buffers around. If we consider PageRank, it computes the rank of its vertex. Explain the equation which is for PageRank. Coming to processing step, we have come with three designs to handle communication in NUMA architecture. In computation phase, PU process their partitions independently, and stores the updates for the remote vertices in respective local buffers. In communication phase, these outbox are transferred to the respective remote partition’s inbox, where they are applied locally to their local state buffers. ---- This Bulk communication helps in using PCIe Bus efficiently. Synchronization phase confirms the message delivery. In Totem, we merge the communication and synchronization phase. This process continues until all the partitions are done with processing. Assuming NUMA (a distributed shared-memory system) as a shared-nothing distributed system - where nodes are independent and are connected with each other through interconnect. We extend the communication infrastructure available in Totem for communication between CPU and GPUs, and create communication links between all the NUMA partitions. In computation phase, each partition updates its local state buffer as well as remote updates are aggregated and stored in respective outbox buffer. In communication phase they transfer the respective outbox buffer to the corresponding remote inbox buffer, as well as apply the remote updates received from remote partitions in their corresponding inbox, thereby requiring two message buffers (one at source and another at destination). This design leads to zero remote memory access, since in computation phase all memory accesses are local, and in communication phase it explicitly copies the message buffers to the remote partition, after which each partition applies the remote updates locally from their inbox to their local state buffer. 2 1 2 2 Updates to remote vertices aggregated locally Comm1: transfer outbox buffer to remote inbox buffer Comm2: merge with local state Computation Local - Random updates v*LS W + (e+e ’ )*LR R Local - Random updates (N 1) * (L R + LS W ) * v ’ Communication Memcopy RSt W

Computation: kernel manipulates local state Can pass pointer as NUMA is shared-memory system Physically allocate only 1 Box NUMA 1-Box Design BSP Supersteps Computation Communication Synchronization PID-0 PID-1 1 1 Since it is shared-memory system, we can allocate only one box and during communication only pass the pointer. Alternately, you can have the box on destination, and in that case, you will end up accessing remote edges which is ~40x more than remote vertices. So, we don’t go into that design. 1 2 2 Updates to remote vertices aggregated locally Comm: merge with local state Computation Local - Random updates v*LS W + (e+e ’ )*LR R Communication Local - Random updates (N 1) * (L R + RS W ) * v ’ Pointer Copy

NUMA is a “distributed” shared- memory system “distributed” – Explicit partitioning “shared-memory” – Implicit communication No communication infrastructure NUMA 0-Box design Overlaps computation with communication Lastly, we have a mixed design between distributed and shared memory. where we have explicit partitioning, but access the state buffers as if we are in a shared-memory system. Explain the design. This should work where there are few updates, because … go to next slide and explain the case for BFS In this design we consider the case that NUMA is essentially a shared-memory system with penalty to access certain memory region, and we have the privilege to break the constraint of using Totem’s communication infrastructure for communication between NUMA partitions. In this design, as shown in Figure, we do not use communication infrastructure between NUMA partitions. During computation phase if a remote vertex is visited, the value is updated directly in the local state buffer of the respective remote partition. Atomic operation is used on the memory region to update, to ensure correctness. The reason we have this design is, for certain algorithms, like BFS, communicate only via a selective set of edges in a superstep. This limitation becomes severe in NUMA architecture, since there are (n - 1) number of outbox in each partition, where n is the number of partitions (= #NUMA nodes + #GPUs). Comp: kernel updates remote state Overlapped Computation and Communication Local - Random updates v*LS W + e*LR R Remote - Random updates e ’ * RR R

Remote vertices vs Remote updates, in BFS-DO
Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Remote vertices vs Remote updates, in BFS-DO 22x difference Here we show the number of remote vertices vs number of remote updates in each superstep for BFS. Since remote vertices are determined during partitioning, they are fixed for entire execution. But the number of remote updates vary in every superstep, and are 22x less than the total number of remote vertices.

Supersteps required for different algorithms
Workload Supersteps BFS-TD RMAT30 7 Twitter 15 clueWeb12 133 BFS-DO 8 9 125 SSSP 11 33 129

Explicit Partitioning and Implicit Communication
This design works best for algorithms like BFS, where there are few updates. For algorithm like pagerank where there is message for each vertex, 0-box is not effective because it is memory latency bound since it accesses all the remote edges, one at a time.

PageRank Workload Totem numactl NUMA-2B NUMA-1B NUMA-0B RMAT28 1x
Twitter 1.28x 1.63x 1.56x 1.50x clueWeb 1.64x 1.59x 1.49x pr

Bfs do

BFS-DO Workload Totem numactl NUMA-2B NUMA-1B NUMA-0B RMAT28 1x 1.44x
Twitter 1.04x 0.91x 0.86x 1.24x clueWeb 1.14x 0.55x 0.48x 0.62x pr

Bfs td

BFS-TD Workload Totem numactl NUMA-2B NUMA-1B NUMA-0B RMAT28 1x 1.19x
Twitter 0.90x 1.23x 1.58x clueWeb 1.04x 0.72x 0.81x 1.10x pr

SSSP Workload Totem numactl NUMA-2B NUMA-1B NUMA-0B RMAT28 1x 2.33x
Twitter 1.35x 1.11x 1.14x 1.73x clueWeb 1.05x 0.88x 1.26x pr

Supersteps required for different algorithms
Workload Supersteps BFS-TD RMAT30 7 Twitter 15 clueWeb12 133 BFS-DO 8 9 125 SSSP 11 33 129

Strong Scaling w.r.t Resources
3.7x 3.38x 3.43x 3.15x 2.9x 2.73x 2.77x 2.67x 2.72x 2.51x 2.39x 2.2x 2.33x 2.02x 2.14x 2.07x 1.74x 1.74x [TODO] Just include a plot showing model prediction vs actual gain, for RMAT28, 29, 30, 31. No need to show this model since we have described them in the earlier slides with each design. At a very high level here is how I modeled it, here is how my model predicts We estimate these numbers on a different machine. And on that machine, 1 box design should take 18% more time and 0-Box design should take 42% more time. But by running the PageRank on our machine, we observed that it 2-B is 7%. The machines are different but they are from same Intel Xeon Family. Finally, we look at our performance model to compare the expected performance of three designs. We take a use case of PageRank. V = #local vertices V’ = #remote vertices E = #local edges E’ = #remote edges XYZ = access format, X is either Random or Sequential, Y is either Local or Remote, Z is either Read or Write By using the numbers from Polymer paper, this model suggests that 2-Box design is 18% faster than 1-Box, and 42% faster than 0-Box. But my experiment results says that 2-Box is 7% better than 1 Box and 10% better than 0-Box 1.31x 1.31x

84.4% of vertices have <50 degree
Max. degree: 3.69 M

65% of vertices have <50 degree
Max. degree: 75.6 M

Max. degree: M

Workload Remote Edges (%) Local Edges (%) Local Vertices (%) Remote Vertices (%) RMAT28 74.93 37.88 RMAT29 74.89 36.93 RMAT30 75.05 36.05 RMAT31 75.13 35.22

Performance Model for 3 designs: Use Case - PageRank
Computation Communication Local-Random updates Memcopy V * WriteSeqLocal + (E+E’) * ReadRandLocal (N-1) * V’ * (WriteSeqLocal + ReadRandLocal) Memcpy() C.M. Emp. NUMA 2-Box 1x 1x Computation Communication Local-Random updates Pointer Copy V * WriteSeqLocal + (E+E’) * ReadRandLocal (N-1) * V’ * (WriteSeqRemote + ReadRandLocal) [TODO] Just include a plot showing model prediction vs actual gain, for RMAT28, 29, 30, 31. No need to show this model since we have described them in the earlier slides with each design. At a very high level here is how I modeled it, here is how my model predicts We estimate these numbers on a different machine. And on that machine, 1 box design should take 18% more time and 0-Box design should take 42% more time. But by running the PageRank on our machine, we observed that it 2-B is 7%. The machines are different but they are from same Intel Xeon Family. Finally, we look at our performance model to compare the expected performance of three designs. We take a use case of PageRank. V = #local vertices V’ = #remote vertices E = #local edges E’ = #remote edges XYZ = access format, X is either Random or Sequential, Y is either Local or Remote, Z is either Read or Write By using the numbers from Polymer paper, this model suggests that 2-Box design is 18% faster than 1-Box, and 42% faster than 0-Box. But my experiment results says that 2-Box is 7% better than 1 Box and 10% better than 0-Box NUMA 1-Box 1.04x 1.06x Overlapped Computation and Communication Local-Random updates Remote-Random updates V * WriteSeqLocal + E * ReadRandLocal E’ * ReadRandRemote NUMA 0-Box 1.27x 1.23x

Applications BFS-Top Down BFS-Direction Optimized (BFS-DO)
5 8 7 2 3 1 9 6 4 BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP BFS: it goes level by level, so basically finding child from parents. Computes level of each vertex.

5 8 7 2 3 1 9 6 Frontier = 5 Next Frontier = 8,1,3 4 BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP

5 8 7 2 3 1 9 6 Frontier = 8,1,3 Next Frontier = 7,2,9,6 4 BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP

5 8 7 2 3 1 9 6 Frontier = 7,2,9,6 Next Frontier = 4 4 BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP For scale-free graphs, this can cause high write traffic as many edges out of the current frontier can attempt to add the same vertex into the next frontier.

5 8 7 2 3 1 9 6 Frontier = 4 Next Frontier = null 4 BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP

Applications BFS-Top Down BFS-Direction Optimized(BFS-DO) Graph500 BFS
8 7 2 3 1 9 6 4 BFS-Top Down BFS-Direction Optimized(BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP BFS-DO: it switches b/w td and bu. It switches if the frontier size is larger than the threshold value.

8 7 2 3 1 9 6 Frontier = 5 Next Frontier = 8,1,3 4 BFS-Top Down BFS-Direction Optimized(BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP

8 7 2 3 1 9 6 Frontier = 8,1,3 Next Frontier = 7,2,9,6 4 BFS-Top Down BFS-Direction Optimized(BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP Lets say the threshold is 3. since next frontier’s size is 4, it will look at the unvisited vertices in the graph, which is the neighbor of vertices in Next Frontier. Unvisited

8 7 2 3 1 9 6 Frontier = 7,2,9,6 4 Next Frontier = null 4 BFS-Top Down BFS-Direction Optimized(BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP Now once it visits 4, none of its vertices are unvisited, so next frontier is now NULL. Thereby it terminates

8 5 7 2 3 1 9 6 -1 4 BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP G500 BFS, computes the parent of each vertex rather than level.

5 8 7 2 3 1 9 6 4 BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP Pagerank, compute intensive and calculates the rank of every vertex from its neighbours.

5 8 7 2 3 1 9 6 4 BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP SSSP finds the shortest path from a given source.

5 8 7 2 3 1 9 6 4 BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP

5 8 7 2 3 1 9 6 4 0, -1 5, 5 8, 8 3, 5 2, 5 5, 1 3, 3 7, 9 7, 6 BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP G500 sssp computes parent as well as distance.

0.1% of vertices have 25.5% of edges Max. degree: 3.69 M

65% of vertices have <50 degree
0.1% of vertices have 15.4% of edges Max. degree: 75.6 M

0.1% of vertices have 48.7% of edges Max. degree: M

Cache Hit Rate: 47.1% Cache Hit Rate: 62.0% Cache Hit Rate: 30.7%
Bfs do Cache Hit Rate: 30.7% Cache Hit Rate: 23.4%

Scale-Free Graph Processing on a NUMA Machine

Similar presentations

Presentation on theme: "Scale-Free Graph Processing on a NUMA Machine"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scale-Free Graph Processing on a NUMA Machine

Similar presentations

Presentation on theme: "Scale-Free Graph Processing on a NUMA Machine"— Presentation transcript:

Similar presentations

About project

Feedback