CUDA - 2.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CIS 565 Fall 2011 Qing Sun
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
CUDA programming Performance considerations (CUDA best practices)
Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.
Single Instruction Multiple Threads
GPU Programming with CUDA
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Slides from “PMPP” book
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Memory and Data Locality
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Mattan Erez The University of Texas at Austin
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE 498AL Spring 2010 Lecture 10: Control Flow
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CUDA - 2

Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads. All threads in a grid run the same kernel code (SPMD) Each thread has indexes that it uses to compute memory addresses and make control decisions. blockIdx.x*blockDim.x+threadIdx.x

Thread Blocks: Scalable Cooperation Divide thread array into multiple blocks Threads within a block cooperate via: Shared memory, Atomic operations, and Barrier Synchronization Threads in different blocks do not interact.

Transparent Scalability Each block can execute in any order relative to others. Hardware is free to assign blocks to any processor at any time A kernel scales to any number of parallel processors

Example: Executing Thread Blocks Threads are assigned to Streaming Multiprocessors (SM) in block granularity Up to 8 blocks to each SM as resource allows Fermi SM can take up to 1536 threads Could be 256 (threads/block) * 6 blocks Or 512 (threads/block) * 3 blocks, etc. SM maintains thread/block idx #s SM manages/schedules thread execution

Warps as Scheduling Units Each Block is executed as 32-thread Warps An implementation decision, not part of the CUDA programming model Warps are scheduling units on an SM Threads in a warp execute in SIMD

Warp Example If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps

Example: Thread Scheduling (Cont.) SM implements zero-overhead warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected

How thread blocks are partitioned Thread blocks are partitioned into warps Thread IDs within a warp are consecutive and increasing Warp 0 starts with Thread ID 0 Partitioning is always the same Thus you can use this knowledge in control flow However, the exact size of warps may change from generation to generation DO NOT rely on any ordering within or between warps If there are any dependencies between threads, you must __syncthreads() to get correct results (more later).

Partial Overview of CUDA Memories Device code can: R/W per-thread registers R/W all shared global memory Host code can: Transfer data to/from per-grid global memory

CUDA Device Memory Management API functions cudaMalloc() Allocates object in the device global memory Two parameters Address of a pointer to the allocated object Size ofallocated object in terms of bytes cudaFree() Frees object from device global memory Pointer to freed object

Host-Device Data Transfer API functions cudaMemcpy() memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type/Direction of transfer Transfer to device is asynchronous

INTER-WARP/THREAD LEVEL SYNCHRONIZATION Finer grained control of execution order within a kernel If they can be avoided easily, they should Sometimes you can’t however

SYNCTHREADS The most simple: __synchthreads(); When reached a thread will block until all threads in the block have reached the call Maintain memory access order The CUDA compiler will try to delay memory writes as long as it can

Programmer View of CUDA Memories

Declaring CUDA Variables Variable declaration Memory Scope Lifetime int LocalVar; Register Thread __device__ __shared__ int SharedVar; Shared Block __device__ int GlobalVar; Global Grid Application __device__ __constant__ int ConstantVar; Constant __device__ is optional when used with __shared__, or __constant__ Automatic variables reside in a register Except per-thread arrays that reside in global memory

Where to Declare Variables?

Shared Memory in CUDA A special type of memory whose contents are explicitly declared and used in the source code One in each SM Accessed at much higher speed (in both latency and throughput) than global memory Still accessed by memory access instructions A form of scratchpad memory in computer architecture

Hardware View of CUDA Memories

A Common Programming Strategy Partition data into subsets or tiles that fit into shared memory Use one thread block to handle each tile by: Loading the tile from global memory to shared memory, using multiple threads Performing the computation on the subset from shared memory, reducing traffic to the global memory Upon completion, writing results from shared memory to global memory