A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

Modern Parallel Computing Landscape
Super Computers Super computers are in the forms of heterogeneous clusters. 9/20/2018

Heterogeneous Clusters
Massive Computation Power Multiple levels of parallelism Large number of heterogeneous nodes High-end CPUs Many cores, e.g., GPU, Xeon Phi Play an Important Role in Scientific Computations 4 out of top 10 supercomputers involve CPU-accelerator nodes 9/20/2018

Programming Heterogeneous Clusters
Direct Programming Pros: performance Conduct application specific optimizations Cons: complexity, low productivity, low portability Programming different devices ( MPI, CUDA ..) Communications at different levels Workload partitioning General Heterogeneous Programming Models Pros: high programmability Runtime system handles underlying issues Cons: General but not efficient Too general to apply application specific optimizations A tradeoff is worth considering 9/20/2018

Thus we have the following question:
“can high-level APIs be developed for several classes of popular scientific applications, to ease application development, while achieving high performance on clusters with accelerators? 9/20/2018

Our Solution 9/20/2018

The Approach Consider a Reasonable Variety of, but not all, Scientific Applications Summarize Scientific Computation Kernels by Patterns Provides Pattern-specific APIs for Each Pattern Motivation from MapReduce Automatically Conduct Pattern-Specific Optimizations 9/20/2018

The Approach Commonly Appeared Communication Patterns from Scientific Applications Generalized Reductions Irregular Reductions Stencil Computations Cover a reasonable subset of Berkeley Dwarfs (cover 16 out of 23 applications in Rodinia Benchmark Suite) Individually for Each Pattern Summarize its characteristics Computation pattern Communication pattern Design a high-level API Execute on different devices and their combination Conduct automatic pattern-specific optimizations Computation level Communication level 9/20/2018

Communication Patterns
Generalized Reductions Parallel accumulation using associative and commutative operations, e.g., sum, mul, max E.g., supported by OpenMP Reduction space is typically small Irregular Reductions Stencil Computations Structured grids Update elements using neighbor elements 9/20/2018

APIs Pattern-specific Flexible
One set of user-defined functions for each pattern User-defined functions process a smallest unit Flexible C++ based. Allow the use of other parallel libraries Support applications with mixed communication patterns 9/20/2018

Example, Moldyn (irregular & generalized reductions)
//user-defined functions for CF kernel DEVICE void force_cmpt(Object *obj, EDGE edge, void *edge_data,void *node_data,void *parameter) { /*{compute the distance between nodes...}*/ if(dist < cutoff) { double f = compute_force( (double*)node_data[edge[0]], (double*)node_data[edge[1]]); obj->insert(&edge[0],&f); f = -f; obj->insert(&edge[1],&f); } DEVICE void force_reduce(VALUE *dst, VALUE *src) { *dst += *src; //user-defined functions for KE kernel DEVICE void ke_emit(Object *object, void *input, size_t index, void *parameter) {...} DEVICE void ke_reduce(VALUE *dst, VALUE *src) {...} //user-defined functions for AV kernel DEVICE void av_emit(...) {...} DEVICE void av_reduce(...) {...} Runtime_env env; env.init(); //runtime for irregular reduction CF IReduction_runtime *ir = env.get_IR(); //runtime for generalized reductions KE & AV GReduction_runtime *gr = env.get_GR(); //Compute Force (CF) kernel ir->set_edge_comp_func(force_cmpt); // use force_cmpt ir->set_node_reduc_func(force_reduce); // use force_reduce /*{set edge and node data filenames ...}*/ for(int i = 0; i < n_tsteps; i++){ ir->start(); // get local reduction result result = ir->get_local_reduction(); // update local node data ir->update_nodedata(result); } /*{set input filename}*/ //Kinetic Energy (KE) kernel gr->set_emit_func(ke_emit); // ke_emit as emit func gr->set_reduc_func(ke_reduce);// ke_reduce as reduce func ... gr->start(); double ke_output = (gr->get_global_reduction()); // Average Velocity (AV) kernel gr->set_emit_func(av_emit); // av_emit as emit func gr->set_reduc_func(av_reduce); // av_reduce as reduce func env.finalize(); compute force, Irregular Reduction Kinetic energy, Generalized Reduction Average velocity, Generalized Reduction User-defined Functions 9/20/2018 Application Driver Code

Example, Jacobi (stencil computation)
DEVICE void jacobi(void *input, void *output, int *offset, int *size, void *param) { int k = offset[0], j = offset[1], i = offset[2]; float total = GET_FLOAT3(input,i,j,k)+ GET_FLOAT3(input,i,j,k+1) GET_FLOAT3(input,i-1,j,k); GET_FLOAT3(output,i,j,k) = total/7; } Runtime_env env; env.init(); Stencil_runtime *sr=env.get_SR(); /* {prepare input data & input data eles...} */ Sconfig<float> conf; DIM3 grid_size(N, N, N), proc_size(2, 2, 2); conf.grid_size=grid_size; conf.proc_size=proc_size; /* {configure stencil width, diagonal access, #iters...} */ sr->set_config(config); sr->set_stencil_func(jacobi); // jacobi as user-defined func sr->set_grid(input_pointer); sr->start(); sr->copy_out_grid(output_pointer); env.finalize(); System-defined primitives User-defined Functions Application Driver Code 9/20/2018

Code Sizes Handwritten MPI codes (not able to use GPUs)
60% code size reduction, using the framework Framework is able to fully utilize CPU and GPUs on each node 9/20/2018

Runtime Implementation
9/20/2018

Inter-node Generalized Reductions Irregular Reductions
Evenly partition input for all processes No data exchange during execution Conduct a final combination Irregular Reductions Workload Partitioning Based on reduction space (the nodes) Group edges according to the node partitioning Inter-process communication Node data exchanged for crossing edges Overlapped with local edges computation 9/20/2018

Inter-node – cont. Stencil Computations
Partition the grid according to the user-defined decomposition parameter Allocate sub-grids in each process with halo regions Exchange boundary data through halo region Boundary data exchange overlaps inner elements computation 9/20/2018

CPU-GPU Workload Partitioning
Goals Need to consider load balance Processing speeds of CPU and GPU are different Need to keep scheduling overhead low The relative amount of cycles spent on scheduling should be small compared with computation 9/20/2018

CPU-GPU Workload Partitioning – cont.
Generalized Reductions Dynamic scheduling between CPU and GPUs GPU launches a kernel after each task block fetch Use multiple streams for each GPU Overlap data copy and kernel execution among streams Irregular Reductions Adaptive partitioning Irregular reductions are iterative Evenly partition the reduction space for the first a few iterations Profile the relative speeds of devices in first a few iterations Re-partition the nodes according to the relative speed Stencil Computations Partition grid along the highest dimension Also use adaptive partitioning 9/20/2018

GPU and CPU Execution Optimizations
Reduction Localization (for Generalized Reductions and Irregular Reductions) GPU execution Reductions performed in GPU’s shared memory first, and combine into device memory CPU execution Each core has a private reduction object A combination is performed later Grid Tiling (for Stencil Computations) Increases neighbor access locality Overlapped execution Inner tiles are processed concurrently with the exchange of boundary tiles Boundary tiles are processed later 9/20/2018

Experiments 9/20/2018

Experimental Results Platform Applications Execution configurations
A GPU cluster 32 heterogeneous nodes Each node 12 core Intel Xeon 5650 CPU 2 Nvidia Tesla M2070 GPUs MVAPICH2 version 1.7 1 process per node, plus pthread multithreading CUDA 5.5 Applications Kmeans – generalized reduction Moldyn – irregular reduction and generalized reduction MiniMD - irregular reduction and generalized reduction Sobel – 2D stencil Heat3D – 3D stencil Execution configurations MPI (Hand) – from widely distributed benchmark suites CPU-ONLY – use only multi-core CPU execution on each node 1GPU-ONLY – use only 1 GPU execution on each node CPU+1GPU – use the CPU plus 1 GPU on each node CPU+2GPU – use the CPU plus 2 GPUs on each node 9/20/2018

Single GPU Performance -Comparison with GPU Benchmarks
Handwritten benchmarks Kmeans is from Rodinia benchmark suite Sobel is from NVIDIA SDK Use single-node single GPU execution Framework is 6% and 15% slower for Kmeans and Sobel, respectively 9/20/2018

Performance - Kmeans Kmeans A GPU is 2.69x faster than a CPU
CPU+1GPU is 1.2x faster than GPU only CPU+2GPU is 1.92x faster than GPU only 32 nodes is 1760x faster than sequential version Framework (CPU-ONLY) faster than MPI (Hand), due to the difference in implementation Hand written code uses 1 MPI process per core Framework uses pthread, less communication 9/20/2018

Performance - Moldyn Moldyn GPU is 1.5x faster than CPU
CPU+1GPU is 1.54x faster than GPU only CPU+2GPU is 2.31x faster than GPU only 589x speedup achieved using all 32 nodes 9/20/2018

Performance – Heat3D Heat3D GPU is 2.4x faster than CPU
CPU+2GPU is 5.5x faster than CPU only 749x speedup achieved using 32 nodes Comparable with handwritten code 9/20/2018

Effect of Optimizations
Tiling optimization was used for stencil computations Overlapped (communication & computation) execution was used for both irregular reductions and stencil computations Tiling: improves Sobel by up to 20% Overlapped execution: 37% and 11% faster for Moldyn and Sobel 9/20/2018

Conclusions A Programming Model Aiming to Trade off Programmability and Performance Pattern Based Optimizations Achieve Considerable Scalability, and Comparable Performance with Benchmarks Reduces Code Sizes Future Work Cover more communication patterns Support more architectures, e.g., Intel MIC 9/20/2018

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Similar presentations

Presentation on theme: "A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Similar presentations

Presentation on theme: "A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan."— Presentation transcript:

Similar presentations

About project

Feedback