© 2010 NVIDIA Corporation Optimizing GPU Performance 2010-05-201 Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance 2010-05-20 John Nickolls.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Introduction to CUDA and GPUGPU Computing

NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

An Introduction to Programming with CUDA Paul Richmond

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

GPU Programming with CUDA – Optimisation Mike Griffiths

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

L17: CUDA, cont. Execution Model and Memory Hierarchy November 6, 2012.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.

Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.

Operation of the SM Pipeline

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Sathish Vadhiyar Parallel Programming

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Mattan Erez The University of Texas at Austin

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

NVIDIA Fermi Architecture

Programming Massively Parallel Processors Performance Considerations

Mattan Erez The University of Texas at Austin

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

General Purpose Graphics Processing Units (GPGPUs)

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls

© 2010 NVIDIA Corporation Optimizing GPU Performance Optimizing parallel performance Understand how software maps to architecture Use heterogeneous CPU+GPU computing Use massive amounts of parallelism Understand SIMT instruction execution Enable global memory coalescing Understand cache behavior Use Shared memory Optimize memory copies Understand PTX instructions

© 2010 NVIDIA Corporation Optimizing GPU Performance CUDA Parallel Threads and Memory Thread Per-thread Private Local Memory Block Per-block Shared Memory Grid 0... Per-app Device Global Memory... Grid 1 __device__ float GlobalVar; __shared__ float SharedVar; float LocalVar; Sequence Registers

© 2010 NVIDIA Corporation Optimizing GPU Performance Using CPU+GPU Architecture Heterogeneous system architecture Use the right processor and memory for each task CPU excels at executing a few serial threads Fast sequential execution Low latency cached memory access GPU excels at executing many parallel threads Scalable parallel execution High bandwidth parallel memory access GPU SMem PCIe Bridge Host Memory CPU Cache Device Memory Cache

© 2010 NVIDIA Corporation Optimizing GPU Performance CUDA kernel maps to Grid of Blocks kernel_func >>(param, … ); GPU SMs: SMem PCIe Bridge Host Memory CPU Cache Device Memory Cache... Host ThreadGrid of Thread Blocks

© 2010 NVIDIA Corporation Optimizing GPU Performance Registers Thread blocks execute on an SM Thread instructions execute on a core GPU SMs: SMem PCIe Bridge Host Memory CPU Cache Device Memory Cache Block Per-block Shared Memory Per-app Device Global Memory Per-thread Local Memory Thread float myVar;__shared__ float shVar;__device__ float glVar;

© 2010 NVIDIA Corporation Optimizing GPU Performance CUDA Parallelism GPU SMs: SMem PCIe Bridge Host Memory CPU Cache Device Memory Cache... Host ThreadGrid of Thread Blocks CUDA virtualizes the physical hardware Thread is a virtualized scalar processor(registers, PC, state) Block is a virtualized multiprocessor(threads, shared mem.) Scheduled onto physical hardware without pre-emption Threads/blocks launch & run to completion Blocks execute independently

© 2010 NVIDIA Corporation Optimizing GPU Performance Expose Massive Parallelism Use hundreds to thousands of thread blocks A thread block executes on one SM Need many blocks to use 10s of SMs SM executes 2 to 8 concurrent blocks efficiently Need many blocks to scale to different GPUs Coarse-grained data parallelism, task parallelism Use hundreds of threads per thread block A thread instruction executes on one core Need 384 – 512 threads/SM to use all the cores all the time Use multiple of 32 threads (warp) per thread block Fine-grained data parallelism, vector parallelism, thread parallelism, instruction-level parallelism

© 2010 NVIDIA Corporation Optimizing GPU Performance Scalable Parallel Architectures run thousands of concurrent threads 128 SP cores 12,288 threads 32 SP cores 3,072 threads 240 SP cores 30,720 threads

© 2010 NVIDIA Corporation Optimizing GPU Performance Fermi SM increases instruction-level parallelism 512 CUDA Cores 24,576 threads SM CUDA Core

© 2010 NVIDIA Corporation Optimizing GPU Performance SM parallel instruction execution SIMT ( Single Instruction Multiple Thread ) execution Threads run in groups of 32 called warps Threads in a warp share instruction unit (IU) HW automatically handles branch divergence Hardware multithreading HW resource allocation & thread scheduling HW relies on threads to hide latency Threads have all resources needed to run Any warp not waiting for something can run Warp context switches are zero overhead Register File Scheduler Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Instruction Cache

© 2010 NVIDIA Corporation Optimizing GPU Performance SIMT Warp Execution in the SM Warp: a set of 32 parallel threads that execute an instruction together SIMT: Single-Instruction Multi-Thread applies instruction to warp of independent parallel threads SM dual issue pipelines select two warps to issue to parallel cores SIMT warp executes each instruction for 32 threads Predicates enable/disable individual thread execution Stack manages per-thread branching Redundant regular computation faster than irregular branching

© 2010 NVIDIA Corporation Optimizing GPU Performance Enable Global Memory Coalescing Individual threads access independent addresses A thread loads/stores 1, 2, 4, 8, 16 B per access LD.sz / ST.sz; sz = {8, 16, 32, 64, 128} bits per thread For 32 parallel threads in a warp, SM load/store units coalesce individual thread accesses into minimum number of 128B cache line accesses or 32B memory block accesses Access serializes to distinct cache lines or memory blocks Use nearby addresses for threads in a warp Use unit stride accesses when possible Use Structure of Arrays (SoA) to get unit stride

© 2010 NVIDIA Corporation Optimizing GPU Performance Unit stride accesses coalesce well __global__ void kernel(float* arrayIn, float* arrayOut) { int i = blockDim.x * blockIdx.x + threadIdx.x; // Stride 1 coalesced load access float val = arrayIn[i]; // Stride 1 coalesced store access arrayOut[i] = val + 1; }

© 2010 NVIDIA Corporation Optimizing GPU Performance Example Coalesced Memory Access 16 threads within a warp load 8 B per thread: LD.64 Rd, [Ra + offset] ; 16 individual 8B thread accesses fall in two 128B cache lines LD.64 coalesces 16 individual accesses into 2 cache line accesses Implements parallel vector scatter/gather Loads from same address are broadcast Stores to same address select a winner Atomics to same address serialize Coalescing scales gracefully with the number of unique cache lines or memory blocks accessed Thread 15 Thread 14 Thread 13 Thread 12 Thread 11 Thread 10 Thread 9 Thread 8 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Address 224 Address 232 Address 208 Address 216 Address 192 Address 200 Address 176 Address 184 Address 160 Address 168 Address 144 Address 152 Address 128 Address 136 Address 112 Address 120 Address 256 Address 264 Address 240 Address B cache line

© 2010 NVIDIA Corporation Optimizing GPU Performance Memory Access Pipeline Load/store/atomic memory accesses are pipelined Latency to DRAM is a few hundred clocks Batch load requests together, then use return values Latency to Shared Memory / L1 Cache is 10 – 20 cycles

© 2010 NVIDIA Corporation Optimizing GPU Performance Fermi Cached Memory Hierarchy Configurable L1 cache per SM 16KB L1$ / 48KB Shared Memory 48KB L1$ / 16KB Shared Memory L1 caches per-thread local accesses Register spilling, stack access L1 caches global LD accesses Global stores bypass L1 Shared 768KB L2 cache L2 cache speeds atomic operations Caching captures locality, amplifies bandwidth, reduces latency Caching aids irregular or unpredictable accesses

© 2010 NVIDIA Corporation Optimizing GPU Performance Use per-Block Shared Memory Latency is an order of magnitude lower than L2 or DRAM Bandwidth is 4x – 8x higher than L2 or DRAM Place data blocks or tiles in shared memory when the data is accessed multiple times Communicate among threads in a block using Shared memory Use synchronization barriers between communication steps __syncthreads() is single bar.sync instruction – very fast Threads of warp access shared memory banks in parallel via fast crossbar network Bank conflicts can occur – incur a minor performance impact Pad 2D tiles with extra column for parallel column access if tile width == # of banks (16 or 32)

© 2010 NVIDIA Corporation Optimizing GPU Performance Using cudaMemCpy() cudaMemcpy() invokes a DMA copy engine Minimize the number of copies Use data as long as possible in a given place PCIe gen2 peak bandwidth = 6 GB/s GPU load/store DRAM peak bandwidth = 150 GB/s SMEM Device Memory PCIe Bridge CPU Host Memory cudaMemcpy()

© 2010 NVIDIA Corporation Optimizing GPU Performance Overlap computing & CPU  GPU transfers cudaMemcpy() invokes data transfer engines CPU  GPU and GPU  CPU data transfers Overlap with CPU and GPU processing Pipeline Snapshot: Kernel 0 Kernel 1 Kernel 2 Kernel 3 CPU CPU CPU CPU cpy => GPU GPU GPU GPU cpy <=

© 2010 NVIDIA Corporation Optimizing GPU Performance Fermi runs independent kernels in parallel Concurrent Kernel Execution + Faster Context Switch Serial Kernel Execution Parallel Kernel Execution Time Kernel 1 Kernel 2 Kernel 3 K er 4 ne l Kernel 5 Kernel 4 Kernel 2

© 2010 NVIDIA Corporation Optimizing GPU Performance Minimize thread runtime variance Warps executing kernel with variable run time SMs Long running warp Time:

© 2010 NVIDIA Corporation Optimizing GPU Performance PTX Instructions Generate a program.ptx file with nvcc –ptx ptx_isa_2.0.pdf PTX instructions: op.type dest, srcA, srcB; type =.b32,.u32,.s32,.f32,.b64,.u64,.s64,.f64 memtype =.b8,.b16,.b32,.b64,.b128 PTX instructions map directly to Fermi instructions Some map to instruction sequences, e.g. div, rem, sqrt PTX virtual register operands map to SM registers Arithmetic: add, sub, mul, mad, fma, div, rcp, rem, abs, neg, min, max, setp.cmp, cvt Function: sqrt, sin, cos, lg2, ex2 Logical: mov, selp, and, or, xor, not, cnot, shl, shr Memory: ld, st, atom.op, tex, suld, sust Control: bra, call, ret, exit, bar.sync

© 2010 NVIDIA Corporation Optimizing GPU Performance Optimizing Parallel GPU Performance Understand the parallel architecture Understand how application maps to architecture Use LOTS of parallel threads and blocks Often better to redundantly compute in parallel Access memory in local regions Leverage high memory bandwidth Keep data in GPU device memory Experiment and measure Questions?