Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.
Programming with CUDA WS 08/09 Lecture 12 Tue, 02 Dec, 2008.
CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Programming with CUDA WS 08/09 Lecture 11 Thu, 27 Nov, 2008.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Ceng 545 Performance Considerations. Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA - 2.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
ME964 High Performance Computing for Engineering Applications “Once a new technology rolls over you, if you're not part of the steamroller, you're part.
L17: CUDA, cont. Execution Model and Memory Hierarchy November 6, 2012.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
Single Instruction Multiple Threads
Introduction to CUDA Programming
Lecture 5: Performance Considerations
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
L18: CUDA, cont. Memory Hierarchy and Examples
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
L15: CUDA, cont. Memory Hierarchy and Examples
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
Mattan Erez The University of Texas at Austin
ECE 498AL Lecture 10: Control Flow
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lecture 5: Synchronization and ILP
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008

Previously CUDA Runtime Component CUDA Runtime Component –Common Component –Device Component –Host Component: runtime & driver APIs

Today Memory & instruction optimizations Memory & instruction optimizations Final projects - reminder Final projects - reminder

Instruction Performance

Instruction Processing To execute an instruction on a warp of threads, the SM To execute an instruction on a warp of threads, the SM –Reads in instruction operands for each thread –Executes the instuction on all threads –Writes the result of each thread

Instruction Throughput Maximized when Maximized when –Use of low throughput instructions is minimized –Available memory bandwidth is used maximally –The thread scheduler overlaps compute & memory operations Programs have a high arithmetic intensity per memory operation Programs have a high arithmetic intensity per memory operation Each SM has many active threads Each SM has many active threads

Instruction Throughput Maximized when Maximized when –Use of low throughput instructions is minimized –Available memory bandwidth is used maximally –The thread scheduler overlaps compute & memory operations Programs have a high arithmetic intensity per memory operation Programs have a high arithmetic intensity per memory operation Each SM has many active threads Each SM has many active threads

Instruction Throughput Avoid low throughput instructions Avoid low throughput instructions –Be aware of clock cycles used per instruction –There are often faster alternatives for math functions, e.g. sinf and __sinf –Size of operands (24 bit, 32 bit) also makes a difference

Instruction Throughput Avoid low throughput instructions Avoid low throughput instructions –Integer division and modulo are expensive Use bitwise operations (>>, &) instead Use bitwise operations (>>, &) instead –Type conversion costs cycles char / short => int char / short => int double float double float –Define float quantities with f, e.g. 1.0f –Use float functions, e.g. expf –Some devices (<= 1.2) demote double to float

Instruction Throughput Avoid branching Avoid branching –Diverging threads in a warp are serialized –Try to minimize the number of divergent warps –Loop unrolling by the compiler can be controlled using #pragma unroll

Instruction Throughput Avoid high latency memory instructions Avoid high latency memory instructions –A SM takes 4 clock cycles to issue a memory instruction to a warp –In case of local/global memory, there is an overhead of 400 to 600 cycles __shared__ float shared; __device__ float device; shared = device; [400,600] cycles

Instruction Throughput Avoid high latency memory instructions Avoid high latency memory instructions –If local/global memory has to be accessed, surround it with independent arithmetic instructions SM can do math while accessing memory SM can do math while accessing memory

Instruction Throughput Cost of __syncThreads()‏ Cost of __syncThreads()‏ –Instruction itself takes 4 clock cycles for a warp –Additional cycles spent waiting for threads to catch up

Instruction Throughput Maximized when Maximized when –Use of low throughput instructions is minimized –Available memory bandwidth is used maximally –The thread scheduler overlaps compute & memory operations Programs have a high arithmetic intensity per memory operation Programs have a high arithmetic intensity per memory operation Each SM has many active threads Each SM has many active threads

Instruction Throughput Effective memory bandwidth of each memory space (global, local, shared) depends on the memory access pattern Effective memory bandwidth of each memory space (global, local, shared) depends on the memory access pattern Device memory has higher latency and lower bandwidth than on-chip memory Device memory has higher latency and lower bandwidth than on-chip memory –Minimize use of device memory

Instruction Throughput Typical execution Typical execution –Each thread loads data from device to shared memory –Synch threads, if necessary –Each thread processes data in shared memory –Synch threads, if necessary –Write data from shared to device memory

Instruction Throughput Global memory Global memory –High latency, low bandwidth –Not cached –Right access patterns are crucial

Instruction Throughput Global memory: alignment Global memory: alignment –Supported word sizes: 4, 8, 16 bytes – __device__ type device[32]; type data = device[tid]; compiles to a single load instruction if type has a supported size type has a supported size type variables are aligned to sizeof(type) : the address of the variable should be a multiple of sizeof(type)‏ type variables are aligned to sizeof(type) : the address of the variable should be a multiple of sizeof(type)‏

Instruction Throughput Global memory: alignment Global memory: alignment –Alignment requirement is automatically fulfilled for built-in types –For self-defined structures, alignment can be forced struct __align__(8) { float a,b; } myStruct8; struct __align__(16) { float a,b,c; } myStruct12;

Instruction Throughput Global memory: alignment Global memory: alignment –Addresses of global variables are aligned to 256 bytes –Align structures cleverly struct { float a,b,c,d,e; } myStruct20; Five 32-bit load instructions

Instruction Throughput Global memory: alignment Global memory: alignment –Addresses of global variables are aligned to 256 bytes –Align structures cleverly struct __align__(8) { float a,b,c,d,e; } myStruct20; Three 64-bit load instructions

Instruction Throughput Global memory: alignment Global memory: alignment –Addresses of global variables are aligned to 256 bytes –Align structures cleverly struct __align__(16) { float a,b,c,d,e; } myStruct20; Two 128-bit load instructions

Instruction Throughput Global memory: coalescing Global memory: coalescing –Size of a memory transaction on global memory can be 32 (>= 1.2), 64 or 128 bytes –Used most efficiently when simultaneous memory accesses by threads in a half- warp can be coalesced into a single memory transaction –Coalescing varies w/ comp capability

Instruction Throughput Global memory: coalescing, <= 1.1 Global memory: coalescing, <= 1.1 –Global memory access by threads in a half- warp are coalesced if Each thread accesses words of size Each thread accesses words of size –4 bytes: one 64-byte memory operation –8 bytes: one 128-byte memory operation –16 bytes: two 128-byte memory operations All 16 words lie in the same (aligned) segment in global memory All 16 words lie in the same (aligned) segment in global memory Threads access words in sequence Threads access words in sequence

Instruction Throughput Global memory: coalescing, <= 1.1 Global memory: coalescing, <= 1.1 –If any of the conditions is violated by a half-warp, thread memory accesses are serialized –Coalesced access of larger sizes is slower than coalesced access of lower sizes Still a lot more efficient than non-coalesced access Still a lot more efficient than non-coalesced access

Instruction Throughput Global memory: coalescing, >= 1.2 Global memory: coalescing, >= 1.2 –Global memory access by threads in a half- warp are coalesced if accessed words lie in the same aligned segment of required size 32 bytes for 2 byte words 32 bytes for 2 byte words 64 bytes for 4 byte words 64 bytes for 4 byte words 128 bytes for 8/16 byte words 128 bytes for 8/16 byte words –Any access pattern is allowed Lower CC cards restrict access patterns Lower CC cards restrict access patterns

Instruction Throughput Global memory: coalescing, >= 1.2 Global memory: coalescing, >= 1.2 –If a half-warp addresses words in N different segments, N memory transactions are issued Lower CC cards issue 16 Lower CC cards issue 16 –Hardware automatically detects and optimizes for unused words, e.g. if request words lie in the lower of upper half of a 128 byte segment, a 64 byte operation is issued.

Instruction Throughput Global memory: coalescing, >= 1.2 Global memory: coalescing, >= 1.2 –Summary for memory transactions by threads in a half-warp Find the memory segment containing the address requested by the lowest numbered active thread Find the memory segment containing the address requested by the lowest numbered active thread Find all other active threads requesting addresses in the same segment Find all other active threads requesting addresses in the same segment Reduce transaction size, if possible Reduce transaction size, if possible Do the transaction, mark threads inactive Do the transaction, mark threads inactive Repeat until all threads are serviced Repeat until all threads are serviced

Instruction Throughput Global memory: coalescing Global memory: coalescing –General patterns TYPE* BaseAddress; // 1D array // thread reads BaseAddress + tid TYPE* BaseAddress; // 1D array // thread reads BaseAddress + tid TYPE must meet size and alignment req.s TYPE must meet size and alignment req.s If TYPE is larger than 16 bytes, split it into smaller objects that meet the requirements If TYPE is larger than 16 bytes, split it into smaller objects that meet the requirements

Instruction Throughput Global memory: coalescing Global memory: coalescing –General patterns TYPE* BaseAddress; // 2D array of size: width x height // read BaseAddress + tx*width + ty TYPE* BaseAddress; // 2D array of size: width x height // read BaseAddress + tx*width + ty Size and alignment requirements hold Size and alignment requirements hold

Instruction Throughput Global memory: coalescing Global memory: coalescing –General patterns Memory coalescing achieved for all half-warps in a block if Memory coalescing achieved for all half-warps in a block if –Width of the block is a multiple of 16 – width is a multiple of 16 Arrays whose width is a multiple of 16 are accessed more efficiently Arrays whose width is a multiple of 16 are accessed more efficiently –Useful to pad arrays up to multiples of 16 –Done automatically by the cuMemAllocPitch cudaMallocPitch functions

Instruction Throughput Local memory Local memory –Used for some internal variables –Not cached –As expensive as global memory –As accesses are, by definition, per-thread, they are automatically coalesced

Instruction Throughput Constant memory Constant memory –Cached Costs one memory read from device memory on cache miss Costs one memory read from device memory on cache miss Otherwise, one cache read Otherwise, one cache read –For threads in a half-warp, cost of reading cache is proportional to number of different addresses read Recommended to have all threads in a half-warp read the same address Recommended to have all threads in a half-warp read the same address

Instruction Throughput Texture memory Texture memory –Cached Costs one memory read from device memory on cache miss Costs one memory read from device memory on cache miss Otherwise, one cache read Otherwise, one cache read –Texture cache is optimized for 2D spatial locality Recommended for threads in a warp to read neighboring texture addresses Recommended for threads in a warp to read neighboring texture addresses

Instruction Throughput Shared memory Shared memory –On-chip As fast as registers, provided there are no bank conflicts between threads As fast as registers, provided there are no bank conflicts between threads –Divided into equally-sized modules, called banks If N requests fall in N separate banks, they are processed concurrently If N requests fall in N separate banks, they are processed concurrently If N requests fall in the same bank, there is an N-way bank conflict If N requests fall in the same bank, there is an N-way bank conflict –The N requests are serialized

Instruction Throughput Shared memory: banks Shared memory: banks –Successive 32-bit words are assigned to successive banks –Bandwidth: 32 bits per 2 clock cycles –Requests from a warp are split according to half-warps Threads in different half-warps cannot conflict Threads in different half-warps cannot conflict

Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ char shared[32]; char data = shared[BaseIndex+tId]; –Why?

Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ char shared[32]; char data = shared[BaseIndex+tId]; –Multiple array members, e.g. char[0], char[1], char[2] and char[3], lie in the same bank –Can be resolved as char data = shared[BaseIndex+4*tId];

Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ double shared[32]; double data = shared[BaseIndex+tId]; –Why?

Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ double shared[32]; double data = shared[BaseIndex+tId]; –2-way bank conflict because of a stride of two 32-bit words

Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ TYPE shared[32]; TYPE data = shared[BaseIndex+tId];

Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ TYPE shared[32]; TYPE data = shared[BaseIndex+tId]; –Three separate memory reads with no bank conflicts struct TYPE { float x,y,z; }; –Stride of three 32-bit words

Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ TYPE shared[32]; TYPE data = shared[BaseIndex+tId]; –Two separate memory reads with no bank conflicts struct TYPE { float x,y; }; –Stride of two 32-bit words, similar to double

Final Projects Reminder Reminder –Form groups by next lecture –Think of project ideas for your group Encouraged to submit several ideas Encouraged to submit several ideas –For each idea, submit a short text describing the problem you want to solve describing the problem you want to solve why you think it is suited for parallel computation why you think it is suited for parallel computation –Jens and I will assign you one of your suggested topics

Final Projects Reminder Reminder –If some people have not formed groups, Jens and I will assign you randomly to groups. –If you cannot think of any ideas, Jens and I will assign you some. –We will float around some write-ups of our own ideas. You may choose one of those.

Final Projects Time-line Time-line –Thu, 20 Nov (today): Float write-ups on ideas of Jens & Waqar Float write-ups on ideas of Jens & Waqar –Tue, 25 Nov: Suggest groups and topics Suggest groups and topics –Thu, 27 Nov: Groups and topics assigned Groups and topics assigned –Tue, 2 Dec: Last chance to change groups/topics Last chance to change groups/topics Groups and topics finalized Groups and topics finalized

All for today Next time Next time –More on bank conflicts –Other optimizations

See you next week!