Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire Tomov†, Guido Juckeland‡, Robert Dietrich‡, Duncan Poole§ and Christopher Lamb§ University of Oregon, Eugene, Department of Computer and Information Science †University of Tennessee, Knoxville, Innovative Computing Laboratory (ICL) ‡Technische Universit¨at Dresden, Center for Information Services and High Performance Computing (ZIH), Germany §NVIDIA Corporation, Santa Clara, CA ICPP Sep, 2011 ~16 Sep, 2011 Reporter : Shih-Meng Teng
Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫ Synchronous method ▫ Event queue method ▫ Callback method Heterogeneous Performance Tools ▫ Tool interoperability Experiment Conclusion 2
Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 3
Introduction The power of GPUs is giving rise to heterogeneous parallel computing, with new demands on programming environments, runtime systems, and tools to deliver high-performing applications. Study focus on Heterogeneous computation model and Alternative CPU-GPU measurement approaches PAPI (Performance API) VampirTrace TAU (Tuning and Analysis Utilities) 4
Introduction(Cont.) Basis of the measurement approach. ▫Synchronous method ▫ Event queue method ▫ Callback method Three experiments ▫Multiple GPU Test ▫Symmetric Matrix Vector Product (SYMV) ▫SHOC Benchmarks – Stencil2D 5
Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 6
Heterogeneous Computation Model 7
Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 8
CPU-GPU Operational Semantics The controlling process (or thread) will bind against one available GPU device. Transfer the necessary input data into the device memory. Launch one or multiple kernels. Copy the results back to host. 9
10
11
12
Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 13
Heterogeneous Execution and Performance Two difficulties: 1.A multi-threaded program each thread can bind against the same GPU device ▫ receive a different context, limiting interaction. 2.Communication of data between GPU devices in different physical hosts now requires three steps: 1.moving the data from device memory to host memory of the sending host, 2.send the data to the receiving host 3.moving the data from host memory to device memory on the receiving host. 14
Heterogeneous execution performance to evaluate several concerns GPU kernel execution CPU-GPU interactions Intra-node execution Inter-node communication 15
Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 16
CPU-GPU Measurement Approaches Three assumptions 1.User code is executed on a GPU device in the form of kernels which run without access to direct performance information. 2.A given device can be logically divided into streams of execution and in each stream kernels are executed sequentially in a pre-determined order. 3.Each kernel is executed after an associated kernel launch that runs on the CPU. These launches also execute in the same sequence as do the kernels (though not necessarily at the same time). 17
Synchronous method 18
Event queue method 19
Callback method 20
Method support and implementation 1) Synchronous method ▫CUDA and OpenCL 2) Event queue method ▫CUDA and OpenCL 3) Callback method ▫Only OpenCL 4) CUPTI (CUDA Performance Tool Interface) ▫CUPTI provides two APIs, the Callback API and the Event API. 21
Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 22
H eterogeneous Performance Tools PAPI CUDA Component Vampir/VampirTrace TAU Performance System 23
24
Tool interoperability 25
Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫Synchronous method ▫Event queue method ▫Callback method H eterogeneous Performance Tools ▫Tool interoperability Experiment Conclusion 26
Experiment - Multiple GPU Test Multi-GPU are used by an application running on a single node. A main-thread spawns multiple solverThreads by CUDA-SDK-simpleMultiGPU. Run a keeneland node. Use 3 C2070-GPU to create TAU profile and Fig7. 27
28
29
Experiment - Symmetric Matrix Vector Product (SYMV) Use PAPI to measurement CUBLAS and MAGMA library on SYMV. ▫ MAGMA :Matrix Algebra on GPU and Multicore Architectures ▫ CUBLAS Library (CUDA Toolkit 3.2). NVIDIA. Memory-bound kernel Use “symmetry” method to reduce bank-conflicts. ▫ Although N 2 /2 element reads are reduced, N 2 /64 writes (and N 2 /64 reads) are introduced. Use Array-padding method to completely eliminate shared cache bank conflicts. 30
31
32
33
34
Experiment - SHOC Benchmarks SHOC(Scalable HeterOgeneous Computing ) Benchmarks Provide some tests for heterogeneous performance tool. Introduce to stencil2D application. Use CUDA version (Fig12-left), tool is VampirTrace ▫ 2-dimensional, 9-point stencil. ▫ 2 keeneland nodes, run MPI process on each node (one GPU per process). Use OpenCL version (Fig12-right), tool is TAU ▫ 8 keeneland nodes, a node run 24 MPI processes ▫ Each MPI process attached to a single GPU device. 35
Fig12. Vampir trace display of Stencil2D execution on 4 MPI processes with 4 GPUs. Time synchronized GPU counter rates convey important performance characteristics of the kernel execution. 36
Fig12. TAU profile of the OpenCL version of the Stencil2D application run on the Keeneland platform with 24 MPI processes and GPUs. The kernel execution times are generally well- balanced across the GPUs. 37
Outline Introduction Heterogeneous Computation Model CPU-GPU Operational Semantics Heterogeneous Execution and Performance CPU-GPU Measurement Approaches ▫ Synchronous method ▫ Event queue method ▫ Callback method Heterogeneous Performance Tools ▫ Tool interoperability Experiment Conclusion 38
Conclusion Understanding the performance of scalable heterogeneous parallel systems and applications. New challenges : ▫ Instrumentation ▫ Measurement ▫ Analysis of heterogeneous components 39
The research presented here demonstrates support for GPU performance measurement with CUDA and OpenCL in three well- known performance tools PAPI, VampirTrace, and the TAU Performance System. 40
Thanks for your listening and Have a nice day Q & A 41