A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Slides:

Advertisements

Similar presentations

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Advertisements

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

st International Conference on Parallel Processing (ICPP)

OpenFOAM on a GPU-based Heterogeneous Cluster

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Parallel Programming Models

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Computer Engg, IIT(BHU)

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

CS427 Multicore Architecture and Parallel Computing

Pattern Parallel Programming

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Speedup over Ji et al.'s work

Efficient and Simplified Parallel Graph Processing over CPU and MIC

Linchuan Chen, Xin Huo and Gagan Agrawal

Advisor: Dr. Gagan Agrawal

Year 2 Updates.

Synchronization trade-offs in GPU implementations of Graph Algorithms

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Linchuan Chen, Peng Jiang and Gagan Agrawal

Department of Computer Science University of California, Santa Barbara

MASS CUDA Performance Analysis and Improvement

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Wei Jiang Advisor: Dr. Gagan Agrawal

Data-Intensive Computing: From Clouds to GPU Clusters

Parallel Computation Patterns (Reduction)

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Gary M. Zoppetti Gagan Agrawal

Hybrid Programming with OpenMP and MPI

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Chapter 01: Introduction

Multicore and GPU Programming

Department of Computer Science University of California, Santa Barbara

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

Modern Parallel Computing Landscape Super Computers Super computers are in the forms of heterogeneous clusters. 9/20/2018

Heterogeneous Clusters Massive Computation Power Multiple levels of parallelism Large number of heterogeneous nodes High-end CPUs Many cores, e.g., GPU, Xeon Phi Play an Important Role in Scientific Computations 4 out of top 10 supercomputers involve CPU-accelerator nodes 9/20/2018

Programming Heterogeneous Clusters Direct Programming Pros: performance Conduct application specific optimizations Cons: complexity, low productivity, low portability Programming different devices ( MPI, CUDA ..) Communications at different levels Workload partitioning General Heterogeneous Programming Models Pros: high programmability Runtime system handles underlying issues Cons: General but not efficient Too general to apply application specific optimizations A tradeoff is worth considering 9/20/2018

Thus we have the following question: “can high-level APIs be developed for several classes of popular scientific applications, to ease application development, while achieving high performance on clusters with accelerators? 9/20/2018

Our Solution 9/20/2018

The Approach Consider a Reasonable Variety of, but not all, Scientific Applications Summarize Scientific Computation Kernels by Patterns Provides Pattern-specific APIs for Each Pattern Motivation from MapReduce Automatically Conduct Pattern-Specific Optimizations 9/20/2018

The Approach Commonly Appeared Communication Patterns from Scientific Applications Generalized Reductions Irregular Reductions Stencil Computations Cover a reasonable subset of Berkeley Dwarfs (cover 16 out of 23 applications in Rodinia Benchmark Suite) Individually for Each Pattern Summarize its characteristics Computation pattern Communication pattern Design a high-level API Execute on different devices and their combination Conduct automatic pattern-specific optimizations Computation level Communication level 9/20/2018

Communication Patterns Generalized Reductions Parallel accumulation using associative and commutative operations, e.g., sum, mul, max E.g., supported by OpenMP Reduction space is typically small Irregular Reductions Stencil Computations Structured grids Update elements using neighbor elements 9/20/2018

APIs Pattern-specific Flexible One set of user-defined functions for each pattern User-defined functions process a smallest unit Flexible C++ based. Allow the use of other parallel libraries Support applications with mixed communication patterns 9/20/2018

Example, Moldyn (irregular & generalized reductions) //user-defined functions for CF kernel DEVICE void force_cmpt(Object *obj, EDGE edge, void *edge_data,void *node_data,void *parameter) { /*{compute the distance between nodes...}*/ if(dist < cutoff) { double f = compute_force( (double*)node_data[edge[0]], (double*)node_data[edge[1]]); obj->insert(&edge[0],&f); f = -f; obj->insert(&edge[1],&f); } DEVICE void force_reduce(VALUE *dst, VALUE *src) { *dst += *src; //user-defined functions for KE kernel DEVICE void ke_emit(Object *object, void *input, size_t index, void *parameter) {...} DEVICE void ke_reduce(VALUE *dst, VALUE *src) {...} //user-defined functions for AV kernel DEVICE void av_emit(...) {...} DEVICE void av_reduce(...) {...} Runtime_env env; env.init(); //runtime for irregular reduction CF IReduction_runtime *ir = env.get_IR(); //runtime for generalized reductions KE & AV GReduction_runtime *gr = env.get_GR(); //Compute Force (CF) kernel ir->set_edge_comp_func(force_cmpt); // use force_cmpt ir->set_node_reduc_func(force_reduce); // use force_reduce /*{set edge and node data filenames ...}*/ for(int i = 0; i < n_tsteps; i++){ ir->start(); // get local reduction result result = ir->get_local_reduction(); // update local node data ir->update_nodedata(result); } /*{set input filename}*/ //Kinetic Energy (KE) kernel gr->set_emit_func(ke_emit); // ke_emit as emit func gr->set_reduc_func(ke_reduce);// ke_reduce as reduce func ... gr->start(); double ke_output = (gr->get_global_reduction()); // Average Velocity (AV) kernel gr->set_emit_func(av_emit); // av_emit as emit func gr->set_reduc_func(av_reduce); // av_reduce as reduce func env.finalize(); compute force, Irregular Reduction Kinetic energy, Generalized Reduction Average velocity, Generalized Reduction User-defined Functions 9/20/2018 Application Driver Code

Example, Jacobi (stencil computation) DEVICE void jacobi(void *input, void *output, int *offset, int *size, void *param) { int k = offset[0], j = offset[1], i = offset[2]; float total = GET_FLOAT3(input,i,j,k)+ GET_FLOAT3(input,i,j,k+1)+ ... + GET_FLOAT3(input,i-1,j,k); GET_FLOAT3(output,i,j,k) = total/7; } Runtime_env env; env.init(); Stencil_runtime *sr=env.get_SR(); /* {prepare input data & input data eles...} */ Sconfig<float> conf; DIM3 grid_size(N, N, N), proc_size(2, 2, 2); conf.grid_size=grid_size; conf.proc_size=proc_size; /* {configure stencil width, diagonal access, #iters...} */ sr->set_config(config); sr->set_stencil_func(jacobi); // jacobi as user-defined func sr->set_grid(input_pointer); sr->start(); sr->copy_out_grid(output_pointer); env.finalize(); System-defined primitives User-defined Functions Application Driver Code 9/20/2018

Code Sizes Handwritten MPI codes (not able to use GPUs) 60% code size reduction, using the framework Framework is able to fully utilize CPU and GPUs on each node 9/20/2018

Runtime Implementation 9/20/2018

Inter-node Generalized Reductions Irregular Reductions Evenly partition input for all processes No data exchange during execution Conduct a final combination Irregular Reductions Workload Partitioning Based on reduction space (the nodes) Group edges according to the node partitioning Inter-process communication Node data exchanged for crossing edges Overlapped with local edges computation 9/20/2018

Inter-node – cont. Stencil Computations Partition the grid according to the user-defined decomposition parameter Allocate sub-grids in each process with halo regions Exchange boundary data through halo region Boundary data exchange overlaps inner elements computation 9/20/2018

CPU-GPU Workload Partitioning Goals Need to consider load balance Processing speeds of CPU and GPU are different Need to keep scheduling overhead low The relative amount of cycles spent on scheduling should be small compared with computation 9/20/2018

CPU-GPU Workload Partitioning – cont. Generalized Reductions Dynamic scheduling between CPU and GPUs GPU launches a kernel after each task block fetch Use multiple streams for each GPU Overlap data copy and kernel execution among streams Irregular Reductions Adaptive partitioning Irregular reductions are iterative Evenly partition the reduction space for the first a few iterations Profile the relative speeds of devices in first a few iterations Re-partition the nodes according to the relative speed Stencil Computations Partition grid along the highest dimension Also use adaptive partitioning 9/20/2018

GPU and CPU Execution Optimizations Reduction Localization (for Generalized Reductions and Irregular Reductions) GPU execution Reductions performed in GPU’s shared memory first, and combine into device memory CPU execution Each core has a private reduction object A combination is performed later Grid Tiling (for Stencil Computations) Increases neighbor access locality Overlapped execution Inner tiles are processed concurrently with the exchange of boundary tiles Boundary tiles are processed later 9/20/2018

Experiments 9/20/2018

Experimental Results Platform Applications Execution configurations A GPU cluster 32 heterogeneous nodes Each node 12 core Intel Xeon 5650 CPU 2 Nvidia Tesla M2070 GPUs MVAPICH2 version 1.7 1 process per node, plus pthread multithreading CUDA 5.5 Applications Kmeans – generalized reduction Moldyn – irregular reduction and generalized reduction MiniMD - irregular reduction and generalized reduction Sobel – 2D stencil Heat3D – 3D stencil Execution configurations MPI (Hand) – from widely distributed benchmark suites CPU-ONLY – use only multi-core CPU execution on each node 1GPU-ONLY – use only 1 GPU execution on each node CPU+1GPU – use the CPU plus 1 GPU on each node CPU+2GPU – use the CPU plus 2 GPUs on each node 9/20/2018

Single GPU Performance -Comparison with GPU Benchmarks Handwritten benchmarks Kmeans is from Rodinia benchmark suite Sobel is from NVIDIA SDK Use single-node single GPU execution Framework is 6% and 15% slower for Kmeans and Sobel, respectively 9/20/2018

Performance - Kmeans Kmeans A GPU is 2.69x faster than a CPU CPU+1GPU is 1.2x faster than GPU only CPU+2GPU is 1.92x faster than GPU only 32 nodes is 1760x faster than sequential version Framework (CPU-ONLY) faster than MPI (Hand), due to the difference in implementation Hand written code uses 1 MPI process per core Framework uses pthread, less communication 9/20/2018

Performance - Moldyn Moldyn GPU is 1.5x faster than CPU CPU+1GPU is 1.54x faster than GPU only CPU+2GPU is 2.31x faster than GPU only 589x speedup achieved using all 32 nodes 9/20/2018

Performance – Heat3D Heat3D GPU is 2.4x faster than CPU CPU+2GPU is 5.5x faster than CPU only 749x speedup achieved using 32 nodes Comparable with handwritten code 9/20/2018

Effect of Optimizations Tiling optimization was used for stencil computations Overlapped (communication & computation) execution was used for both irregular reductions and stencil computations Tiling: improves Sobel by up to 20% Overlapped execution: 37% and 11% faster for Moldyn and Sobel 9/20/2018

Conclusions A Programming Model Aiming to Trade off Programmability and Performance Pattern Based Optimizations Achieve Considerable Scalability, and Comparable Performance with Benchmarks Reduces Code Sizes Future Work Cover more communication patterns Support more architectures, e.g., Intel MIC 9/20/2018