Programming Massively Parallel Graphics Processors

Programming Massively Parallel Graphics Processors
Andreas Moshovos Winter 2009

Goals: How: Ideal Scenario: Graphics Processors Learn how program GPUs
Learn how to get performance out of GPUs Understand GPU architecture and limitations CUDA: Compute Unified Device Architecture/NVidia How: Weekly assignments for the first few weeks A large team project Ideal Scenario: Non-ECE Non-CS people will team up with CS/ECE and attack an interesting problem

What is a GPU What is CUDA Specialized processor for graphics
Embarrassingly parallel: Lots of: Read data, calculate, write Used to be fixed function Are becoming more programmable What is CUDA A C extension for programming for NVIDIA GPUs Straightforward to learn Challenge is in getting performance

Sequential Execution Model int a[N]; // N is large
for (i =0; i < N; i++) a[i] = a[i] * fade; Flow of control / Thread One instruction at the time Optimizations possible at the machine level time

Data Parallel Execution Model / SIMD int a[N]; // N is large
for all elements do in parallel a[index] = a[index] * fade; time

Single Program Multiple Data / SPMD int a[N]; // N is large
for all elements do in parallel if (a[i] > threshold) a[i]*= fade; time

CPU CPU Memory Memory Programmer’s view – Typical System
If you care about performance a lot CPU regs CPU caches Memory Memory 12.8GB/sec – 31.92GB/sec 8B per transfer

CPU GPU Memory GPU Memory
Programmer’s view with GPU CPU GPU 3GB/s 141GB/sec Memory 12.8GB/sec – 31.92GB/sec 8B per transfer GPU Memory 1GB on our systems

Programmer’s view with GPU
CPU GPU Copy to GPU mem Launch GPU threads Synchronize with GPU Copy from GPU mem time

Structure: CPU vs. GPU

But what about performance? Focus on PEAK performance first:
What the manufacturer guarantees you’ll never exceed Two Aspects: Data Access Rate Capability Bandwidth Data Processing Capability How many ops per sec

Data Processing Capability GFLOPS
Focus on floating point data GFLOPS Billion Floating-Point Operations per Second Caveat: FOPs can be different But today things are not as bad as before High-End CPU today 3.4Ghz x 8 FOPS/cycle = 27 GFLOPS Assumes SSE High-End GPU today / GTX280 933.1 GFLOPS or 34x capability

Data Access Capability High-End CPU Today
31.92 GB/sec (nehalem) GB/sec (hapertown) Bus width 64-bit GPU / GTX280 141.7 GB/sec Bus width 512-bit 4.39x – 11x

GPU vs. CPU

What the programmer needs to know?
Many details about the architecture But fortunately most of it is simple

Programmer’s view: GPU Architecture

GPU CPU My first CUDA Program
__global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd <<< n_blocks, block_size >>> (d_a, 10.0, N); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); GPU CPU

Threads / Blocks / Grid Block size = 12 #blocks = 5
Block 0: a[0]…a[11] … Block 4: a[48] .. a[59] a[48] a[59]

Memory Hierarchy Anything declared inside The kernel __shared__ int…
__global__ int…

Performance Programmer’s view
Mark Silberstein, Technion

CUDA keywords, etc. Declspecs Keywords Intrinsics Runtime API
global, device, shared, local, constant Keywords threadIdx, blockIdx Intrinsics __syncthreads Runtime API Memory, symbol, execution management Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) cudaThreadSynchronize (); // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage);

Floating-Point Caveats
Single precisions floating point support is not 100% IEEE 754 No denormals, fixed rounding modes Must check that SNR remains acceptable But there are lots of SP FP units GTX280 supports double precision But there are very few of these units

Get an account on the eecg network Wait until confirmation is received
Development Process Course Specific Get an account on the eecg network Fill in your name/ID/current on the list Wait until confirmation is received Machines ug51.eecg through ug75.eecg.utoronto.ca SF2204 Keycode: _______

Development Process Once you are on ugxx machine source /cad1/CUDA/cuda.csh That will create a NVIDIA_CUDA_SDK Go in and type “make dbg=1” This builds several examples under bin/linux/debug The source code is in the projects subdir We’ll post a handout soon on the course website

Development Process Create a xxxx.cu file Compile it with nvcc Makefile is provided by the SDK Nvcc is a preprocessor

So, why would Parallel Processing work?
Parallel Processing and Programming has been around for a while Golden age was the 80s Didn’t work Programming is hard Hardware was expensive Single processor performance was doubling every 18 months Why would it work now? Cost / Single processor Not a done deal at all  Programming is still hard

Course Staff Andreas Moshovos EA310, TA Hassan Shojania

Till the end of February / weekly assignments
Course Structure Till the end of February / weekly assignments CUDA programming GTX280 architecture CUDA performance Floating Point March / Project Proposal and work Case studies General Parallel Programming guidelines April Project Presentations Make up lectures?

Programming Massively Parallel Graphics Processors

Similar presentations

Presentation on theme: "Programming Massively Parallel Graphics Processors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programming Massively Parallel Graphics Processors

Similar presentations

Presentation on theme: "Programming Massively Parallel Graphics Processors"— Presentation transcript:

Similar presentations

About project

Feedback