Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming Massively Parallel Graphics Processors

Similar presentations

Presentation on theme: "Programming Massively Parallel Graphics Processors"— Presentation transcript:

1 Programming Massively Parallel Graphics Processors
Andreas Moshovos Winter 2009

2 Goals: How: Ideal Scenario: Graphics Processors Learn how program GPUs
Learn how to get performance out of GPUs Understand GPU architecture and limitations CUDA: Compute Unified Device Architecture/NVidia How: Weekly assignments for the first few weeks A large team project Ideal Scenario: Non-ECE Non-CS people will team up with CS/ECE and attack an interesting problem

3 What is a GPU What is CUDA Specialized processor for graphics
Embarrassingly parallel: Lots of: Read data, calculate, write Used to be fixed function Are becoming more programmable What is CUDA A C extension for programming for NVIDIA GPUs Straightforward to learn Challenge is in getting performance

4 Sequential Execution Model int a[N]; // N is large
for (i =0; i < N; i++) a[i] = a[i] * fade; Flow of control / Thread One instruction at the time Optimizations possible at the machine level time

5 Data Parallel Execution Model / SIMD int a[N]; // N is large
for all elements do in parallel a[index] = a[index] * fade; time

6 Single Program Multiple Data / SPMD int a[N]; // N is large
for all elements do in parallel if (a[i] > threshold) a[i]*= fade; time

7 CPU CPU Memory Memory Programmer’s view – Typical System
If you care about performance a lot CPU regs CPU caches Memory Memory 12.8GB/sec – 31.92GB/sec 8B per transfer

8 CPU GPU Memory GPU Memory
Programmer’s view with GPU CPU GPU 3GB/s 141GB/sec Memory 12.8GB/sec – 31.92GB/sec 8B per transfer GPU Memory 1GB on our systems

9 Programmer’s view with GPU
CPU GPU Copy to GPU mem Launch GPU threads Synchronize with GPU Copy from GPU mem time

10 Structure: CPU vs. GPU

11 But what about performance? Focus on PEAK performance first:
What the manufacturer guarantees you’ll never exceed Two Aspects: Data Access Rate Capability Bandwidth Data Processing Capability How many ops per sec

12 Data Processing Capability GFLOPS
Focus on floating point data GFLOPS Billion Floating-Point Operations per Second Caveat: FOPs can be different But today things are not as bad as before High-End CPU today 3.4Ghz x 8 FOPS/cycle = 27 GFLOPS Assumes SSE High-End GPU today / GTX280 933.1 GFLOPS or 34x capability

13 Data Access Capability High-End CPU Today
31.92 GB/sec (nehalem) GB/sec (hapertown) Bus width 64-bit GPU / GTX280 141.7 GB/sec Bus width 512-bit 4.39x – 11x

14 GPU vs. CPU

15 GPU vs. CPU

16 What the programmer needs to know?
Many details about the architecture But fortunately most of it is simple

17 Programmer’s view: GPU Architecture

18 GPU CPU My first CUDA Program
__global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd <<< n_blocks, block_size >>> (d_a, 10.0, N); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); GPU CPU

19 Threads / Blocks / Grid Block size = 12 #blocks = 5
Block 0: a[0]…a[11] Block 4: a[48] .. a[59] a[48] a[59]

20 Memory Hierarchy Anything declared inside The kernel __shared__ int…
__global__ int…

21 Performance Programmer’s view
Mark Silberstein, Technion

22 CUDA keywords, etc. Declspecs Keywords Intrinsics Runtime API
global, device, shared, local, constant Keywords threadIdx, blockIdx Intrinsics __syncthreads Runtime API Memory, symbol, execution management Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) cudaThreadSynchronize (); // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage);

23 Floating-Point Caveats
Single precisions floating point support is not 100% IEEE 754 No denormals, fixed rounding modes Must check that SNR remains acceptable But there are lots of SP FP units GTX280 supports double precision But there are very few of these units

24 Get an account on the eecg network Wait until confirmation is received
Development Process Course Specific Get an account on the eecg network Fill in your name/ID/current on the list Wait until confirmation is received Machines ug51.eecg through SF2204 Keycode: _______

25 Development Process Once you are on ugxx machine source /cad1/CUDA/cuda.csh That will create a NVIDIA_CUDA_SDK Go in and type “make dbg=1” This builds several examples under bin/linux/debug The source code is in the projects subdir We’ll post a handout soon on the course website

26 Development Process Create a file Compile it with nvcc Makefile is provided by the SDK Nvcc is a preprocessor

27 So, why would Parallel Processing work?
Parallel Processing and Programming has been around for a while Golden age was the 80s Didn’t work Programming is hard Hardware was expensive Single processor performance was doubling every 18 months Why would it work now? Cost / Single processor Not a done deal at all  Programming is still hard

28 Course Staff Andreas Moshovos EA310, TA Hassan Shojania

29 Till the end of February / weekly assignments
Course Structure Till the end of February / weekly assignments CUDA programming GTX280 architecture CUDA performance Floating Point March / Project Proposal and work Case studies General Parallel Programming guidelines April Project Presentations Make up lectures?

Download ppt "Programming Massively Parallel Graphics Processors"

Similar presentations

Ads by Google