Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

Slides:



Advertisements
Similar presentations
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Advertisements

Intermediate GPGPU Programming in CUDA
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
Programming with CUDA WS 08/09 Lecture 6 Thu, 11 Nov, 2008.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Parallell Processing Systems1 Chapter 4 Vector Processors.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.
Programming with CUDA WS 08/09 Lecture 12 Tue, 02 Dec, 2008.
CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Programming with CUDA WS 08/09 Lecture 11 Thu, 27 Nov, 2008.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
Ceng 545 Performance Considerations. Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is.
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA - 2.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
L17: CUDA, cont. Execution Model and Memory Hierarchy November 6, 2012.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Processor Architecture
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
Sunpyo Hong, Hyesoon Kim
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
Lecture 5: Performance Considerations
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
5.2 Eleven Advanced Optimizations of Cache Performance
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
L18: CUDA, cont. Memory Hierarchy and Examples
Lecture 5: GPU Compute Architecture for the last time
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Lecture 5: Synchronization and ILP
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008

Previously Optimizing Instruction Throughput Optimizing Instruction Throughput –Low throughout instructions Different versions of math functions Different versions of math functions Type conversions are costly Type conversions are costly Avoid warp diversion Avoid warp diversion Accessing global memory is expensive Accessing global memory is expensive Overlap memory ops with math ops Overlap memory ops with math ops

Previously Optimizing Instruction Throughput Optimizing Instruction Throughput –Optimal use of memory bandwidth Global memory: coalesce accesses Global memory: coalesce accesses Local memory: coalesced automatically Local memory: coalesced automatically Constant memory: cached, cost proportional to #addresses read Constant memory: cached, cost proportional to #addresses read Texture memory: cached, optimized for 2D spatial locality Texture memory: cached, optimized for 2D spatial locality Shared memory: on chip, fast but avoid bank conflicts Shared memory: on chip, fast but avoid bank conflicts

Today Optimizing Instruction Throughput Optimizing Instruction Throughput –Optimal use of memory bandwidth Shared memory: on chip, fast but avoid bank conflicts Shared memory: on chip, fast but avoid bank conflicts Registers Registers Optimizing #Threads per Block Optimizing #Threads per Block Memory copies Memory copies Texture vs. Global vs. Constant Texture vs. Global vs. Constant General optimizations General optimizations

Shared Memory Bank conflicts Bank conflicts –Shared memory divided into 32-bit modules called banks –Allow simultaneous reads –N-way bank conflict if N threads try to read from the same bank Leads to serializing of reads Leads to serializing of reads Not necessarily N serial reads Not necessarily N serial reads

Shared Memory Bank conflicts Bank conflicts –Broadcast mechanism One word is chosen as a broadcast word One word is chosen as a broadcast word Automatically passed to other threads reading from that word Automatically passed to other threads reading from that word –Cannot control which word is picked as the broadcast word

Registers Generally 0 clock cycles Generally 0 clock cycles –Time to access registers included in instruction time –There could be delays

Registers Delays may occur due to register memory bank conflicts Delays may occur due to register memory bank conflicts –Register memory banks handled by compiler and thread scheduler Try to schedule instructions to avoid conflicts Try to schedule instructions to avoid conflicts Work best when 64x threads per block Work best when 64x threads per block –Application has no other control

Registers Delays may occur due to read-after- write dependencies Delays may occur due to read-after- write dependencies –May be hidden if each SM has at least 192 active threads

Optimizing #threads per block 2 or more blocks per SM 2 or more blocks per SM –A waiting block (thread sync, memo copy) can be overlapped with running blocks Shared memory per block should be less than half the shared memory per SM Shared memory per block should be less than half the shared memory per SM

Optimizing #threads per block Having 32x threads per block fully populates warps Having 32x threads per block fully populates warps Having 64x threads per block allows compiler and thread scheduler to avoid register memory bank conflicts Having 64x threads per block allows compiler and thread scheduler to avoid register memory bank conflicts

Optimizing #threads per block More threads per block = fewer registers per kernel More threads per block = fewer registers per kernel –Compiler option to report memory requirements of a kernel, --ptxas- options=-v –#registers per device varies with compute capability

Optimizing #threads per block When optimizing, go for 64x threads per block When optimizing, go for 64x threads per block –192 or 256 recommended Occupancy of SM = (#active warps) / (max. active warps)‏ Occupancy of SM = (#active warps) / (max. active warps)‏ –Compiler tries to maximize occupancy

Optimizing Memory Copies Host mem Device mem Host mem Device mem –Low bandwidth –Higher bandwidth can be achieved using pagelocked/pinned memory

Optimizing Memory Copies Minimize such transfers Minimize such transfers –Move more code to the device, even if it does not fully utilize parallelism –Create intermediate data structures in device memory –Group several transfers into one

Texture fetches vs. reading Global/Constant mem Cached, optimized for spatial locality Cached, optimized for spatial locality No coalescing constraints No coalescing constraints Address calculation latency is better hidden Address calculation latency is better hidden Data can be packed Data can be packed Optional conversion of integers to normalized floats [0.0,1.0] or [-1.0,1.0] Optional conversion of integers to normalized floats [0.0,1.0] or [-1.0,1.0]

Texture fetches vs. reading Global/Constant mem For textures stored in CUDA arrays For textures stored in CUDA arrays –Filtering –Normalized texture coordinates –Addressing modes

General Guidelines Maximize parallelism Maximize parallelism Maximize memory bandwidth Maximize memory bandwidth Maximize instruction throughput Maximize instruction throughput

Maximize Parallelism Build on data parallelism Build on data parallelism –Broken in case of thread dependency –For threads in the same block __syncThreads()‏ __syncThreads()‏ share data using shared memory share data using shared memory –For threads in different blocks Share data using global memory Share data using global memory Two kernel calls Two kernel calls First to write data First to write data Second to read data Second to read data

Maximize Parallelism Build on data parallelism Build on data parallelism Choose kernel parameters accordingly Choose kernel parameters accordingly Clever device use: streams Clever device use: streams Clever host use: async kernels Clever host use: async kernels

Maximize Memory Bandwidth Minimize host device memory copies Minimize host device memory copies Minimize device device memory data transfer Minimize device device memory data transfer –Use shared memory Might even be better to not copy at all Might even be better to not copy at all –Just recompute on device

Maximize Memory Bandwidth Organize data for optimal memory access patterns Organize data for optimal memory access patterns –Crucial for accesses to global memory

Maximize Instruction Throughput For non-crucial cases, use higher throughput arithmetic instructions For non-crucial cases, use higher throughput arithmetic instructions –Sacrifice accuracy for performance –Replace double with float operations Pay attention to warp diversion Pay attention to warp diversion –Try to arrange diverging threads pe warp if (threadIdx / warp_size) > n

Final Projects Time-line Time-line –Thu, 20 Nov: Float write-ups on ideas of Jens & Waqar Float write-ups on ideas of Jens & Waqar –Tue, 25 Nov (today): Suggest groups and topics Suggest groups and topics –Thu, 27 Nov: Groups and topics assigned Groups and topics assigned –Tue, 2 Dec: Last chance to change groups/topics Last chance to change groups/topics Groups and topics finalized Groups and topics finalized

All for today Next time Next time –A full-fledged example project

On to exercises!