Sathish Vadhiyar Parallel Programming

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Extracted directly from:
GPU Programming with CUDA – Optimisation Mike Griffiths
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.
Sunpyo Hong, Hyesoon Kim
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Single Instruction Multiple Threads
Lecture 5: Performance Considerations
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
L18: CUDA, cont. Memory Hierarchy and Examples
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
ECE 498AL Lecture 10: Control Flow
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lecture 5: Synchronization and ILP
Synchronization These notes introduce:
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Sathish Vadhiyar Parallel Programming CUDA Optimizations Sathish Vadhiyar Parallel Programming

SIMT Architecture and Warps Multi-processor creates, manages, schedules and executes threads in groups of 32 parallel threads called warps Threads in a warp start at the same program address They have separate instruction address counter and register state, and hence free to branch When a SM is given one or more thread blocks to execute, it partitions them into warps that get scheduled by a warp scheduler for execution

Warps and Warp Divergence A warp executes one common instruction at a time Hence max efficiency can be achieved when all threads in a warp agree on a common execution path If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path – branch divergence

Performance Optimization Strategies Maximize parallel execution to achieve maximum utilization Optimize memory usage to achieve maximum memory throughput Optimize instruction usage to achieve maximum instruction throughput

Maximize Parallel Execution Launch kernel with at least as many thread blocks as there are multiprocessors in the device The number of threads per block should be chosen as a multiple of warp size to avoid wasting computing resource with under-populated warps

Maximize Parallel Execution for Maximum Utilization At points when parallelism is broken and threads need to synchronize: Use _syncthreads() if threads belong to the same block Synchronize using global memory through separate kernel invocations Second case should be avoided as much as possible; computations that require inter-thread communication should be performed within a single thread block as much as possible

Maximize Memory Throughput Minimize data transfers between host and device Move more data from host to device Produce intermediate data on the device Data can be left on the GPU between kernel calls to avoid data transfers Batch many small host-device transfers into a single transfer Minimize data transfers between global (device) memory and device Maximize usage of shared memory

Maximize Memory Throughput: Other performance Issues Memory alignment and coalescing : Make sure that all threads in a half warp access continuous portions of memory (Refer to slides 30-34 of NVIDIA’s Advanced CUDA slides) Shared memory divided into memory banks: Multiple simultaneous accesses to a bank can result in bank conflict (Refer to slides 37-40 of NVIDIA’s Advanced CUDA slides)

Maximize Instruction Throughput Minimize use of arithmetic instructions with low throughput Trade precision for speed – e.g., use single-precision for double-precision if it does not affect the result much Minimize divergent warps Hence avoid use of conditional statements that checks on threadID e.g.: (instead of if(threadId >2), use if(threadId/warpSize) > 2)

Maximize Instruction Throughput: Reducing Synchronization _syncthreads() can impact performance A warp executes one common instruction at a time; Hence threads within a warp implicitly synchronize This can be used to omit _syncthreads() for better performance

Example with _syncthreads() Can be converted to….

Reducing Synchronization Threads are guaranteed to belong to the same warp. Hence no need for explit synchronization using _syncthreads()

Occupancy Occupancy depends on resource usage (register and shared memory usage) More the usage per thread, less number of threads can be simultaneously active One cannot indiscriminately use registers in a thread - If (registers used per thread x thread block size) > N, the launch will fail; for Tesla, N = 16384 Very less usage increases the cost of global memory access Maximizing occupancy can help cover latency during global memory loads

Occupancy Calculation Active threads per SM = Active thread blocks per SM * ThreadsPerBlock Active warps per SM = Active thread blocks per SM * my warps per block Active thread blocks per SM = Min(warp-based-limit, registers-based-limit, shared-memory-based-limit) Occupancy = (active warps per SM) / (max warps per SM)