A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University
Modern Parallel Computing Landscape Super Computers Super computers are in the forms of heterogeneous clusters. 9/20/2018
Heterogeneous Clusters Massive Computation Power Multiple levels of parallelism Large number of heterogeneous nodes High-end CPUs Many cores, e.g., GPU, Xeon Phi Play an Important Role in Scientific Computations 4 out of top 10 supercomputers involve CPU-accelerator nodes 9/20/2018
Programming Heterogeneous Clusters Direct Programming Pros: performance Conduct application specific optimizations Cons: complexity, low productivity, low portability Programming different devices ( MPI, CUDA ..) Communications at different levels Workload partitioning General Heterogeneous Programming Models Pros: high programmability Runtime system handles underlying issues Cons: General but not efficient Too general to apply application specific optimizations A tradeoff is worth considering 9/20/2018
Thus we have the following question: “can high-level APIs be developed for several classes of popular scientific applications, to ease application development, while achieving high performance on clusters with accelerators? 9/20/2018
Our Solution 9/20/2018
The Approach Consider a Reasonable Variety of, but not all, Scientific Applications Summarize Scientific Computation Kernels by Patterns Provides Pattern-specific APIs for Each Pattern Motivation from MapReduce Automatically Conduct Pattern-Specific Optimizations 9/20/2018
The Approach Commonly Appeared Communication Patterns from Scientific Applications Generalized Reductions Irregular Reductions Stencil Computations Cover a reasonable subset of Berkeley Dwarfs (cover 16 out of 23 applications in Rodinia Benchmark Suite) Individually for Each Pattern Summarize its characteristics Computation pattern Communication pattern Design a high-level API Execute on different devices and their combination Conduct automatic pattern-specific optimizations Computation level Communication level 9/20/2018
Communication Patterns Generalized Reductions Parallel accumulation using associative and commutative operations, e.g., sum, mul, max E.g., supported by OpenMP Reduction space is typically small Irregular Reductions Stencil Computations Structured grids Update elements using neighbor elements 9/20/2018
APIs Pattern-specific Flexible One set of user-defined functions for each pattern User-defined functions process a smallest unit Flexible C++ based. Allow the use of other parallel libraries Support applications with mixed communication patterns 9/20/2018
Example, Moldyn (irregular & generalized reductions) //user-defined functions for CF kernel DEVICE void force_cmpt(Object *obj, EDGE edge, void *edge_data,void *node_data,void *parameter) { /*{compute the distance between nodes...}*/ if(dist < cutoff) { double f = compute_force( (double*)node_data[edge[0]], (double*)node_data[edge[1]]); obj->insert(&edge[0],&f); f = -f; obj->insert(&edge[1],&f); } DEVICE void force_reduce(VALUE *dst, VALUE *src) { *dst += *src; //user-defined functions for KE kernel DEVICE void ke_emit(Object *object, void *input, size_t index, void *parameter) {...} DEVICE void ke_reduce(VALUE *dst, VALUE *src) {...} //user-defined functions for AV kernel DEVICE void av_emit(...) {...} DEVICE void av_reduce(...) {...} Runtime_env env; env.init(); //runtime for irregular reduction CF IReduction_runtime *ir = env.get_IR(); //runtime for generalized reductions KE & AV GReduction_runtime *gr = env.get_GR(); //Compute Force (CF) kernel ir->set_edge_comp_func(force_cmpt); // use force_cmpt ir->set_node_reduc_func(force_reduce); // use force_reduce /*{set edge and node data filenames ...}*/ for(int i = 0; i < n_tsteps; i++){ ir->start(); // get local reduction result result = ir->get_local_reduction(); // update local node data ir->update_nodedata(result); } /*{set input filename}*/ //Kinetic Energy (KE) kernel gr->set_emit_func(ke_emit); // ke_emit as emit func gr->set_reduc_func(ke_reduce);// ke_reduce as reduce func ... gr->start(); double ke_output = (gr->get_global_reduction()); // Average Velocity (AV) kernel gr->set_emit_func(av_emit); // av_emit as emit func gr->set_reduc_func(av_reduce); // av_reduce as reduce func env.finalize(); compute force, Irregular Reduction Kinetic energy, Generalized Reduction Average velocity, Generalized Reduction User-defined Functions 9/20/2018 Application Driver Code
Example, Jacobi (stencil computation) DEVICE void jacobi(void *input, void *output, int *offset, int *size, void *param) { int k = offset[0], j = offset[1], i = offset[2]; float total = GET_FLOAT3(input,i,j,k)+ GET_FLOAT3(input,i,j,k+1)+ ... + GET_FLOAT3(input,i-1,j,k); GET_FLOAT3(output,i,j,k) = total/7; } Runtime_env env; env.init(); Stencil_runtime *sr=env.get_SR(); /* {prepare input data & input data eles...} */ Sconfig<float> conf; DIM3 grid_size(N, N, N), proc_size(2, 2, 2); conf.grid_size=grid_size; conf.proc_size=proc_size; /* {configure stencil width, diagonal access, #iters...} */ sr->set_config(config); sr->set_stencil_func(jacobi); // jacobi as user-defined func sr->set_grid(input_pointer); sr->start(); sr->copy_out_grid(output_pointer); env.finalize(); System-defined primitives User-defined Functions Application Driver Code 9/20/2018
Code Sizes Handwritten MPI codes (not able to use GPUs) 60% code size reduction, using the framework Framework is able to fully utilize CPU and GPUs on each node 9/20/2018
Runtime Implementation 9/20/2018
Inter-node Generalized Reductions Irregular Reductions Evenly partition input for all processes No data exchange during execution Conduct a final combination Irregular Reductions Workload Partitioning Based on reduction space (the nodes) Group edges according to the node partitioning Inter-process communication Node data exchanged for crossing edges Overlapped with local edges computation 9/20/2018
Inter-node – cont. Stencil Computations Partition the grid according to the user-defined decomposition parameter Allocate sub-grids in each process with halo regions Exchange boundary data through halo region Boundary data exchange overlaps inner elements computation 9/20/2018
CPU-GPU Workload Partitioning Goals Need to consider load balance Processing speeds of CPU and GPU are different Need to keep scheduling overhead low The relative amount of cycles spent on scheduling should be small compared with computation 9/20/2018
CPU-GPU Workload Partitioning – cont. Generalized Reductions Dynamic scheduling between CPU and GPUs GPU launches a kernel after each task block fetch Use multiple streams for each GPU Overlap data copy and kernel execution among streams Irregular Reductions Adaptive partitioning Irregular reductions are iterative Evenly partition the reduction space for the first a few iterations Profile the relative speeds of devices in first a few iterations Re-partition the nodes according to the relative speed Stencil Computations Partition grid along the highest dimension Also use adaptive partitioning 9/20/2018
GPU and CPU Execution Optimizations Reduction Localization (for Generalized Reductions and Irregular Reductions) GPU execution Reductions performed in GPU’s shared memory first, and combine into device memory CPU execution Each core has a private reduction object A combination is performed later Grid Tiling (for Stencil Computations) Increases neighbor access locality Overlapped execution Inner tiles are processed concurrently with the exchange of boundary tiles Boundary tiles are processed later 9/20/2018
Experiments 9/20/2018
Experimental Results Platform Applications Execution configurations A GPU cluster 32 heterogeneous nodes Each node 12 core Intel Xeon 5650 CPU 2 Nvidia Tesla M2070 GPUs MVAPICH2 version 1.7 1 process per node, plus pthread multithreading CUDA 5.5 Applications Kmeans – generalized reduction Moldyn – irregular reduction and generalized reduction MiniMD - irregular reduction and generalized reduction Sobel – 2D stencil Heat3D – 3D stencil Execution configurations MPI (Hand) – from widely distributed benchmark suites CPU-ONLY – use only multi-core CPU execution on each node 1GPU-ONLY – use only 1 GPU execution on each node CPU+1GPU – use the CPU plus 1 GPU on each node CPU+2GPU – use the CPU plus 2 GPUs on each node 9/20/2018
Single GPU Performance -Comparison with GPU Benchmarks Handwritten benchmarks Kmeans is from Rodinia benchmark suite Sobel is from NVIDIA SDK Use single-node single GPU execution Framework is 6% and 15% slower for Kmeans and Sobel, respectively 9/20/2018
Performance - Kmeans Kmeans A GPU is 2.69x faster than a CPU CPU+1GPU is 1.2x faster than GPU only CPU+2GPU is 1.92x faster than GPU only 32 nodes is 1760x faster than sequential version Framework (CPU-ONLY) faster than MPI (Hand), due to the difference in implementation Hand written code uses 1 MPI process per core Framework uses pthread, less communication 9/20/2018
Performance - Moldyn Moldyn GPU is 1.5x faster than CPU CPU+1GPU is 1.54x faster than GPU only CPU+2GPU is 2.31x faster than GPU only 589x speedup achieved using all 32 nodes 9/20/2018
Performance – Heat3D Heat3D GPU is 2.4x faster than CPU CPU+2GPU is 5.5x faster than CPU only 749x speedup achieved using 32 nodes Comparable with handwritten code 9/20/2018
Effect of Optimizations Tiling optimization was used for stencil computations Overlapped (communication & computation) execution was used for both irregular reductions and stencil computations Tiling: improves Sobel by up to 20% Overlapped execution: 37% and 11% faster for Moldyn and Sobel 9/20/2018
Conclusions A Programming Model Aiming to Trade off Programmability and Performance Pattern Based Optimizations Achieve Considerable Scalability, and Comparable Performance with Benchmarks Reduces Code Sizes Future Work Cover more communication patterns Support more architectures, e.g., Intel MIC 9/20/2018