Some things are naturally parallel

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Intermediate GPGPU Programming in CUDA

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Optimization on Kepler Zehuan Wang

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

GPU Programming with CUDA – Optimisation Mike Griffiths

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CIS 565 Fall 2011 Qing Sun

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

CS 193G Lecture 2: GPU History & CUDA Programming Basics.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

My Coordinates Office EM G.27 contact time:

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

Basic CUDA Programming

Programming Massively Parallel Graphics Processors

Introduction to CUDA Programming

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

CUDA Parallelism Model

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Programming Massively Parallel Graphics Processors

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Mattan Erez The University of Texas at Austin

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CUDA Execution Model – III Streams and Events

General Purpose Graphics Processing Units (GPGPUs)

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Chapter 4:Parallel Programming in CUDA C

6- General Purpose GPU Programming

Parallel Computing 18: CUDA - I

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Some things are naturally parallel

Sequential Execution Model / SISD int a[N]; // N is large for (i =0; i < N; i++) out[i] = out[i] * fade; Flow of control / Thread One instruction at the time Optimizations possible at the machine level time

Data Parallel Execution Model / SIMD int a[N]; // N is large for all elements do in parallel out[i] = a[i] * fade; time This has been tried before: ILLIAC III, UIUC, 1966 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4038028&tag=1 http://ed-thelen.org/comp-hist/vs-illiac-iv.html

Single Program Multiple Data / SPMD int a[N]; // N is large for all elements do in parallel new = a[i] * fade if (new > 255) out[i] = new; time Code is statically identical across all threads Execution path may differ The model used in today’s Graphics Processors

SPMD Execution Order Undefined All in parallel Sequential Any other order

for index in RANGE RANGE can be 1D, 2D, or 3D Programming Model for index in RANGE Kernel(index) RANGE can be 1D, 2D, or 3D Use 1D for the time being

1D Range – Execution Model int a[N]; // N is large for all elements of an array a[i] = a[i] * fade Lots of independent computations Kernel RANGE (grid) THREADs

Exposing Locality to Programmer RANGE (grid) block Threads within a group can co-operate and coordinate

Intra-Block communication and synchronization thread 10 thread 11 a[10] = in[10] a[11] = in[11] barrier barrier a[10] += a[11] communication synchronization RANGE (grid) block

GPGPU Processor – Basic Unit SHARED MULTIPROCESSOR FETCH DECODE SCHEDULE WARP REG REG REG REG REG REG REG REG REG REG 32 REG REG REG REG MEM LOCAL MEM EXEC EXEC EXEC EXEC EXEC REG EXEC REG REG REG CACHE

WARP Execution and Control Flow Divergence if (in[i] == 0) out[i] = sqrt(x); else out[i] = 10; WARP in[i] == 0 in[i] == 0 idle TIME out[i] = sqrt(x) out[i] = 10

Control Flow Divergence Contd. WARP WARP #1 WARP #2 in[i] == 0 in[i] == 0 in[i] == 0 idle TIME BAD SCENARIO Good Scenario

GPGPU Processor – Overall Architecture Block Scheduler SHARED MULTIPROCESSOR FETCH DECODE SCHEDULE EXEC REG WARP 32 MEM LOCAL CACHE SHARED MULTIPROCESSOR FETCH DECODE SCHEDULE EXEC REG WARP 32 MEM LOCAL CACHE SHARED MULTIPROCESSOR FETCH DECODE SCHEDULE EXEC REG WARP 32 MEM LOCAL CACHE Shared Cache Global Memory

Why are threads useful? Parallelism Concurrency: Do multiple things in parallel Uses more hardware  Gets higher performance Application must have parallelism Needs more functional units

Why are threads useful #2 – Tolerating stalls Often a thread stalls, e.g., memory access Multiplex the same functional unit Get more performance at a fraction of the cost

GPGPU Processor – Basic Unit SHARED MULTIPROCESSOR FETCH DECODE SCHEDULE Warp Pool WARP REG REG REG REG REG REG REG REG REG REG 32 REG REG REG REG MEM LOCAL MEM EXEC EXEC EXEC EXEC EXEC REG EXEC REG REG REG CACHE

GPU: bandwidth optimized – latencies are long A GPU ADD takes 24 GPU cycles (true of GTX280) CPU ADD 1 cycle The GPU cycle is roughly 4x of a CPU cycle For the systems in the lab (GTX480) Need ~100 threads to break even 1000s of threads for GPU to be better

Architecture Scalability

GF100 Architecture Overview -- Compute 64-bit

GF100 Architecture - Complete 512 CUDA cores 16 PolyMorph Engines 4 raster units 64 texture units 48 ROP units 384-bit GDDR5 6 channels 64-bit / channel

SM Architecture Streaming Multiprocessor (SM) 32 Streaming Processors (SP) 32 INT or FP (32-bit) 16 DP (64-bit) 4 Super Function Units (SFU) 16 Load/Store Units Multi-threaded instruction dispatch Up to1536 threads active 32 (threads) x 48 24,576 threads for all SMs Up to 8 concurrent blocks 1024 threads/block limit Shared instruction fetch per 32 threads Cover latency of texture/memory loads 80+ GFLOPS 16K/48K KB shared memory 48K/16K L1 cache DRAM texture and memory access

GK110 Architecture GTX7xx series

Why GPUs now? Why not before?

CPU GPU Memory GPU Memory Programmer’s view GPU as a co-processor (CPU data is from 2008 – matches our lab machines) CPU GPU 3GB/s – 8GB.s 177.4GB/sec / 288.4GB/sec Memory 6.4GB/sec – 31.92GB/sec 8B per transfer GPU Memory 1GB / 3GB GTX480 / GTX780 (6 of those) Key Suppliers: Nvidia and AMD

Execution Timeline CPU / Host GPU / Device 1. Copy to GPU mem 2. Launch GPU Kernel 2’. Synchronize with GPU 3. Copy from GPU mem time

CPU GPU Memory GPU Memory Programmer’s view First create data on CPU memory CPU GPU Memory GPU Memory

CPU GPU Memory GPU Memory Programmer’s view Then Copy to GPU CPU GPU Memory GPU Memory

CPU GPU Memory GPU Memory Programmer’s view GPU starts computation  runs a kernel CPU can also continue CPU GPU Memory GPU Memory

CPU GPU Memory GPU Memory Programmer’s view CPU and GPU Synchronize CPU GPU Memory GPU Memory

CPU GPU Memory GPU Memory Programmer’s view Copy results back to CPU CPU GPU Memory GPU Memory GTX780: kernels can spawn kernels

Programming Languages CUDA NVidia Has market lead OpenCL Many including Nvidia CUDA superset Somewhat different syntax Can target many different devices, e.g., CPUs + programmable accelerators Fairly new Both are evolving

Computation partitioning: At the highest level: Think of computation as a series of loops: for (i = 0; i < big_number; i++) a[i] = some function a[i] = some other function Kernels

Some things are naturally parallel

GPU CPU My first CUDA Program __global__ void arradd (float *a, float fade, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] * fade; } int main() float h[N]; float *d; cudaMalloc ((void **) &a, SIZE); cudaThreadSynchronize (); cudaMemcpy (d, h, SIZE, cudaMemcpyHostToDevice)); arradd <<< n_blocks, block_size >>> (d, 10.0, N); cudaMemcpy (h, d, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a)); GPU CPU

Per Kernel Computation Partitioning Computation Grid: 2D Case Threads within a block can communicate/synchronize Run on the same core Threads across blocks can’t communicate Shouldn’t touch each others data Behavior undefined thread Block

Per Kernel Computation Partitioning Computation Grid: 2D Case One thread can process multiple data elements Other mappings are possible and often desirable More on this when we talk about how to optimize for performance thread Block

Each thread will process one pixel for all elements do in parallel Fade example Each thread will process one pixel for all elements do in parallel out[i] = a[i] * fade;

CPU: GPU: Code Skeleton Initialize image from file Allocate IN and OUT buffers on GPU Copy image to Launch GPU kernel Reads IN Produces OUT Copy Out back to CPU Write image to a file GPU: Launch a thread per pixel

GPU Kernel pseudo-code __global__ void fade (unsigned char *in, unsigned char *out, float f, int xmax, int ymax) { unsigned int v = in[x][y]; v = v * f; if (v > 255) v = 255; out[x][y] = v; } This is the program for one thread It processes one pixel

How does a thread know which pixel to process? gridDim.x blockDim.x threadIdx.x blockDim.y blockIdx.y blockIdx.x threadIdx.y gridDim.y

How many blocks per dimension gridDim gridDim.x = 7, gridDim.y = 6 How many blocks per dimension

blockDim.x= 7, blockDim.y = 7 How many threads in a block per dimension

blockIdx = coordinates of block in the grid blockIdx.x = 2, blockIdx.y = 3 blockIdx.x = 5, blockIdx.y = 1 (0,0)

threadIdx = coordinates of thread in the block threadidx.x= 2, threadIdx.y = 3 threadIdx.x = 5, threadIdx.y = 4 (0,0)

How does a thread know which pixel to process? gridDim.x blockDim.x threadIdx.x blockDim.y blockIdx.y blockIdx.x threadIdx.y gridDim.y x = blockIdx.x * blockDim.x + threadIdx.x y = blockIdx.y * blockDim.y + threadIdx.y

GPU Kernel pseudo-code __global__ void fade (unsigned char *in, unsigned char *out, float f, int xmax, int ymax) { int x = blockDim.x * blockIdx.x + threadIdx.x; int y = blockDim.y * blockIdx.y + threadIdx.y; unsigned int v = in[x][y]; v = v * f; if (v > 255) v = 255; out[x][y] = v; }

GPU Kernel pseudo-code w/ limits __global__ void fade (unsigned char *in, unsigned char *out, float f, int xmax, int ymax) { int x = blockDim.x * blockIdx.x + threadIdx.x; int y = blockDim.y * blockIdx.y + threadIdx.y if ( (x >= xmax) || (y>= ymax) ) return; unsigned int v = in[x][y]; v = v * f; if (v > 255) v = 255; out[x][y] = v; }

Programmer’s view: Memory Model

Allocate CPU Data Structure Initialize Data on CPU CUDA API: Example int a[N]; for (i =0; i < N; i++) a[i] = a[i] + x; Allocate CPU Data Structure Initialize Data on CPU Allocate GPU Data Structure Copy Data from CPU to GPU Define Execution Configuration Run Kernel CPU synchronizes with GPU Copy Data from GPU to CPU De-allocate GPU and CPU memory

My first CUDA Program / Skeleton __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd <<< n_blocks, block_size >>> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); cudaFree (a_d); GPU CPU

1. Allocate CPU Data container float *ha; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N); ... } No memory allocated on the GPU side Pinned memory allocation results in faster CPU to/from GPU copies But pinned memory cannot be paged-out cudaMallocHost (…)

2. Initialize CPU Data (dummy) float *ha; int i; for (i = 0; i < N; i++) ha[i] = i;

3. Allocate GPU Data container float *da; cudaMalloc ((void **) &da, sizeof (float) * N); Notice: no assignment side NOT: da = cudaMalloc (…) Assignment is done internally: That’s why we pass &da Space is allocated in Global Memory on the GPU

The host manages GPU memory allocation: cudaMalloc (void **ptr, size_t nbytes) Must explicitly cast to (void **) cudaMalloc ((void **) &da, sizeof (float) * N); cudaFree (void *ptr); cudaFree (da); cudaMemset (void *ptr, int value, size_t nbytes); cudaMemset (da, 0, N * sizeof (int)); Check the CUDA Reference Manual

4. Copy Initialized CPU data to GPU float *da; float *ha; cudaMemCpy ((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyHostToDevice); // DIRECTION

Host/Device Data Transfers The host initiates all transfers: cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction) Asynchronous from the CPU’s perspective CPU thread continues In-order processing with other CUDA requests enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice

5. Define Execution Configuration How many blocks and threads/block int threads_block = 64; int blocks = N / threads_block; if (blocks % N != 0) blocks += 1; Alternatively: blocks = (N + threads_block – 1) / threads_block;

6. Launch Kernel & 7. CPU/GPU Synchronization Instructs the GPU to launch blocks x threads_block threads: arradd <<<blocks, threads_block>> (da, 10f, N); cudaThreadSynchronize (); // forces CPU to wait arradd: kernel name <<<…>>> execution configuration (da, x, N): arguments 256 byte limit / No variable arguments Not sure this is still true (will check)

CPU/GPU Synchronization CPU does not block on cuda…() calls Kernel/requests are queued and processed in-order Control returns to CPU immediately Good if there is other work to be done e.g., preparing for the next kernel invocation Eventually, CPU must know when GPU is done Then it can safely copy the GPU results cudaThreadSynchronize () Block CPU until all preceding cuda…() and kernel requests have completed

8. Copy data from GPU to CPU & 9. DeAllocate Memory float *da; float *ha; cudaMemCpy ((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost); // DIRECTION cudaFree (da); // display or process results here free (ha);

__global__ darradd (float *da, float x, int N) { The GPU Kernel __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = da[i] + x; } BlockIdx: Unique Block ID. Numerically asceding: 0, 1, … BlockDim: Dimensions of Block = how many threads it has BlockDim.x, BlockDim.y, BlockDim.z Unused dimensions default to 0 ThreadIdx: Unique per Block Index 0, 1, … Per Block

GPU CPU My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd <<< n_blocks, block_size >>> (d_a, 10.0, N); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); GPU CPU

Texture Unit Painting with images