*University of Utah † Lawrence Berkeley National Laboratory

Slides:

Advertisements

Similar presentations

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Parallel Processing with OpenMP

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Multi-GPU and Stream Programming Kishan Wimalawarne.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Introductions to Parallel Programming Using OpenMP

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

An Introduction to Programming with CUDA Paul Richmond

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

GPU Architecture and Programming

GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Synchronization These notes introduce:

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

My Coordinates Office EM G.27 contact time:

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Computer Engg, IIT(BHU)

Prof. Zhang Gang School of Computer Sci. & Tech.

CS427 Multicore Architecture and Parallel Computing

Enabling Effective Utilization of GPUs for Data Management Systems

Scientific requirements and dimensioning for the MICADO-SCAO RTC

Exploiting Parallelism

Basic CUDA Programming

Linchuan Chen, Xin Huo and Gagan Agrawal

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CS 252 Project Presentation

Measuring Performance

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

General Purpose Graphics Processing Units (GPGPUs)

Measuring Performance

Assembly Language for Intel-Based Computers, 5th Edition

Synchronization These notes introduce:

Rui (Ray) Wu Unified Cuda Memory Rui (Ray) Wu

Multicore and GPU Programming

Presentation transcript:

*University of Utah † Lawrence Berkeley National Laboratory Roofline Model Toolkit : A Practical Tool for Architectural and Program Analysis Yu Jung Lo*, Samuel Williams†, Brian Van Straalen†, Terry Ligocki†, Matthew Cordery†, Nicholas Wright†, Mary Hall*, Leonid Oliker† *University of Utah † Lawrence Berkeley National Laboratory yujunglo@cs.utah.edu

Empirical benchmark-driven Roofline model Motivation Performance Model Architecture Characterization Application Performance Measurement Issues Hard to find technical specs for most HPC platforms to form “textbook” Roofline model. Even with technical specs, the real issue is achievable performance. Empirical benchmark-driven Roofline model

“Theoretical” Roofline Model Peak FP Performance Gflop/s = min Peak GFlop/s Memory BW ∗Arithmetic Intensity Peak Memory Bandwidth

Micro Benchmarks Init Compute Sync Driver int main () { #pragma omp parallel private(id) { uint64_t n, t; initialize(&A[nid]); for (n = 16; n < SIZE; n *= 1.1) { for (t = 1; t < TRIALS; t *= 2) { // start timer here Kernel(n, t, &A[nid]); // stop timer here #pragma omp barrier #pragma omp master { MPI_Barrier(MPI_COMM_WORLD); } }}} Bandwidth void Kernel (uint64_t size, unit64_t trials, double * __restrict__ A) { double alpha = 0.5; uint64_t i, j; for (j = 0; j < trials; ++j ) { for (i = 0; i < nsize; ++i) { A[i] = A[i] + alpha; } alpha = alpha * 0.5; }} Init Compute Sync double bytes = 2 * sizeof(double) * (double)n * (double)t;

Micro Benchmarks (cont’) Driver int main () { #pragma omp parallel private(id) { uint64_t n, t; for (n = 16; n < SIZE; n *= 1.1) { for (t = 1; t < TRIALS; t *= 2) { // start timer here Kernel(n, t, &A[nid]); // stop timer here #pragma omp barrier #pragma omp master { MPI_Barrier(MPI_COMM_WORLD); } }}} GFlops void Kernel (uint64_t size, unit64_t trials, double * __restrict__ A) { double alpha = 0.5; uint64_t i, j; for (j = 0; j < trials; ++j ) { for (i = 0; i < nsize; ++i) { double bete = 0.8; #if FLOPPERITER == 2 beta = beta * A[i] + alpha; #elif FLOPPERITER == 4 … #endif A[i] = beta; } alpha = alpha * 0.5; }} Compute double bytes = FLOPPERITER * (double)n * (double)t;

Architectural Platforms Mira (IBM Blue Gene/Q) Edison (Intel Xeon CPU) Babbage (Intel Xeon Phi) Titan (Nvidia K20x)

Bandwidth Benchmark Results Edison (Intel Xeon CPU) Mira (IBM Blue Gene/Q) Babbage (Intel Xeon Phi) Titan (Nvidia K20x) 1 MB

Bandwidth Benchmark Results (cont’) Titan (Nvidia K20x) dim3 gpuThreads(64); dim3 gpuBlocks(224); // start timer here #if defined (GLOBAL_TRIAL_INSIDE) global_trialInside <<<gpuBlocks, gpuThreads>>> (nsize, trials, d_buf); #elif defined(GLOBAL_TRIAL_OUTSIDE) for (uint64_t t = 0; t < trials; ++t) { global_trialOutside <<<gpuBlocks, gpuThreads>>> (nsize, d_buf, alpha); alpha = alpha * (1 – 1e-8); } #else sharedmem <<<gpuBlocks, gpuThreads>>> (nsize, trials, d_buf); #endif cudaDeviceSynchronize(); // stop timer here A B C (blocks, threads)

Optimized GFlops Benchmarks C Code AVX Code (Edison) double alpha = 0.5; for (j = 0; j < ntrials; ++j ) { for (i = 0; i < nsize; ++i) { double bete = 0.8; beta = beta * A[i] + alpha; A[i] = beta; } alpha = alpha * (1e-8); } for (j = 0 ; j < ntrials; ++j) { for (i = 0 ; i < nsize ; i += 8) { bv1 = _mm256_set1_pd(0.8); v1 = _mm256_load_pd(&A[i]); bv1 = _mm256_mul_pd(bv1, v1); bv1 = _mm256_add_pd(bv1, v1); _mm256_store_pd(&A[i], bv1); // repeat above operations for A[i+4] } alpha = alpha * (1e-8); av = _mm256_set1_pd(alpha); } Unroll by 8 2 Flops per Element QPX Code (Mira) AVX-512 Code (Babbage) for (j = 0 ; j < ntrials ; ++j){ for (i = 0 ; i < nsize ; i += 8){ bv1 = vec_splats(0.8); v1 = vec_ld(0L, &A[i]); bv1 = vec_madd(bv1,v1,av); vec_st(bv1, 0L, &A[i]); // repeat above operations for A[i+4] } alpha = alpha * (1e-8); vec_splats(alpha); } for (j = 0 ; j < ntrials ; ++j) { for (i = 0 ; i < nsize ; i += 8) { bv1 = _mm512_set1_pd(0.8); v1 = _mm512_load_pd(&A[i]); bv1 = _mm512_fmadd_pd(bv1,v1,av); _mm512_store_pd(&A[i], bv1); } alpha = alpha * (1e-8); av = _mm512_set1_pd(alpha); } Fused Multiply & Add Fused Multiply & Add

Gflops Performance Edison (Intel Xeon CPU), 8 FPE Mira (IBM Blue Gene/Q), 16 FPE Turbo Boost Theoretical Peak C code Optimized code Babbage (Intel Xeon Phi), 16 FPE 256 FPE, SIMD and unrolled by 16

Gflops Performance (cont’) Edison (Intel Xeon CPU) Mira (IBM Blue Gene/Q) Babbage (Intel Xeon Phi) Titan (Nvidia K20x)

Beyond the Roofline

Separate Address Spaces Unified Virtual Addressing (UVA) CUDA Unified Memory CUDA’s Memory Concept Four Approaches to Manage Memory Explicit Copy Separate Address Spaces Pageable Host with Explicit Copy 1 Page-locked Host with Explicit Copy 2 Unified Virtual Addressing (UVA) 3 Page-locked Host with Zero Copy Unified Memory with Zero Copy 4 Unified Memory Implicit Copy

CUDA Managed Memory Benchmark int main() { // start timer here… for (uint64_t j = 0; j < trials; ++j) { for (uint64_t k = 0; k < reuse; ++k) { GPUKERNEL <<<blocks, threads>>> (n, d_buf, alpha); alpha = alpha * (1e-8); } CPUKERNEL(n, h_buf, alpha); } // stop timer here… double bytes = 2 * sizeof(double) * (double)n * (double)trials * (double)(reuse + 1); } #if defined(_CUDA_ZEROCPY) || defined(_CUDA_UM) cudaDeviceSynchronize(); #else cudaMemcpy(d_buf, h_buf, SIZE, cudaMemcpyDefault); #endif K iterations 3 4 #if defined(_CUDA_ZEROCPY) || defined(_CUDA_UM) cudaDeviceSynchronize(); #else cudaMemcpy(h_buf, d_buf, SIZE, cudaMemcpyDefault); #endif 1 2 K + 1 iterations

CUDA Managed Memory Performance 1 Pageable host w/ explicit copy 2 Page-locked host w/ explicit copy 128 GB/s 156 GB/s 3 Page-locked host w/ zero copy 4 Unified Memory w/ zero copy * GPU driver version: 331.89; toolkit version: 6.0beta

Construct the Roofline Model

Empirical Roofline Model Edison (Intel Xeon CPU) Mira (IBM Blue Gene/Q) Babbage (Intel Xeon Phi) Titan (Nvidia K20x)

Application Analysis : MiniDFT Flat MPI MPI tasks x OpenMP threads

Conclusion Way to get high bandwidth on manycore and accelerated architectures. Massive parallelism on large working sets. Way to get high Gflops Sufficient SIMDized and unrolled. At least 2 threads per core for in-order processor. High FPE for manycore and accelerators. Way to get high CUDA managed memory performance Highly reuse the data on device, operate on large working set, and explicit copy between host and device.

Questions?

Appendix

Appendix