Lecture 16 Revisiting Strides, CUDA Threads…

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Introduction to CUDA and GPUGPU Computing
NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
ECSE , Spring 2014 (modified from Stanford CS 193G) Lecture 1: Introduction to Massively Parallel Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
CS 193G Lecture 2: GPU History & CUDA Programming Basics.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Lecture 19 Revisting Strides, CUDA Threads… Topics Strides through memory Practical Performance considerationsReadings November 7, 2012 CSCE 513 Advanced.
My Coordinates Office EM G.27 contact time:
CUDA programming Performance considerations (CUDA best practices)
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
The Present and Future of Parallelism on GPUs
Lecture 5: Performance Considerations
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Basic CUDA Programming
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Mattan Erez The University of Texas at Austin
Lecture 5: GPU Compute Architecture for the last time
CUDA Parallelism Model
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Parallel Computation Patterns (Reduction)
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
ECE 498AL Lecture 10: Control Flow
© David Kirk/NVIDIA and Wen-mei W. Hwu,
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Lecture 3: CUDA Threads & Atomics
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lecture 5: Synchronization and ILP
Parallel Computation Patterns (Histogram)
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Lecture 16 Revisiting Strides, CUDA Threads… CSCE 513 Advanced Computer Architecture Lecture 16 Revisiting Strides, CUDA Threads… Topics Strides through memory Practical Performance considerations Readings November 6, 2017

Overview Last Time Readings for today New Intro to CUDA/GPU programming Readings for today Stanford – (Itunes)http://code.google.com/p/stanford-cs193g-sp2010/ http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule Book (online) David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 http://courses.engr.illinois.edu/ece498/al/Syllabus.html Chapters 1-3 New OpenMP Examples – SC 2008 (link emailed Tuesday) Nvidia CUDA - example

Nvidia NVIDIA Developer Zone - http://developer.nvidia.com/ http://developer.nvidia.com/cuda-toolkit-41 CUDA Toolkit Downloads C/C++ compiler, CUDA-GDB, Visual Profiler, CUDA Memcheck,  GPU-accelerated libraries, Other tools & Documentation Developer Drivers Downloads GPU Computing SDK Downloads

Stanford CS 193G http://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialPrerequisites Vincent Natol “Kudos for CUDA,” HPC Wire (2010) Patterson, David A.; Hennessy, John L. (2011-08-01). Computer Architecture: A Quantitative Approach (The Morgan Kaufmann Series in Computer Architecture and Design) (Kindle Locations 7530-7532). Elsevier Science (reference). Kindle Edition.

Lessons from Graphics Pipeline Throughput is paramount must paint every pixel within frame time scalability Create, run, & retire lots of threads very rapidly measured 14.8 Gthread/s on increment() kernel Use multithreading to hide latency 1 stalled thread is OK if 100 are ready to run Building bigger & better graphics processors has revealed the following lessons: Video games have strict time requirements: bare minimum: 2 Mpixels * 60 fps * 2 = 240 Mthread/s throughput is paramount The scale of these demands dictate that threads must be incredibly lightweight On recent architectures, we’ve observed 15 billion threads created/run/destroyed per second Also dictates multithreading/timesharing to hide latency: it’s okay if one thread stalls if it means that 100 more are allowed to run immediately

Why is this different from a CPU? Different goals produce different designs GPU assumes work load is highly parallel CPU must be good at everything, parallel or not CPU: minimize latency experienced by 1 thread big on-chip caches sophisticated control logic GPU: maximize throughput of all threads # threads in flight limited by resources => lots of resources (registers, bandwidth, etc.) multithreading can hide latency => skip the big caches share control logic across many threads You may ask “Why are these design decisions different from a CPU?” In fact, the GPU’s goals differ significantly from the CPU’s GPU evolved to solve problems on a highly parallel workload CPU evolved to be good at any problem whether it is parallel or not For example, the trip out to memory is long and painful The question for the chip architect: How to deal with latency? One way is to avoid it: the CPU’s computational logic sits in a sea of cache. The idea is that few memory transactions are unique, and the data is usually found quickly after a short trip to the cache. Another way is amortization: GPUs forgo cache in favor of parallelization. Here, the idea is that most memory transactions are unique, and can be processed efficiently in parallel. The cost of a trip to memory is amortized across several independent threads, which results in high throughput.

NVIDIA GPU Architecture Fermi GF100 DRAM I/F HOST I/F Giga Thread L2 Sea of green scalar cores (literally hundreds), thin layer of blue on-chip memory, sandwiches blue communication fabric out to memory With some additional fixed function logic specific to support graphics algorithms (ie, rasterization)

SM (Streaming Multiprocessor) 32 CUDA Cores per SM (512 total) Each core executes identical instruction or sleeps 24 active warps limit 8x peak FP64 performance 50% of peak FP32 performance Direct load/store to memory Usual linear sequence of bytes High bandwidth (Hundreds GB/sec) 64KB of fast, on-chip RAM Software or hardware-managed Shared amongst CUDA cores Enables thread communication Core Core Core Core SM at the heart of the NVIDIA GPU architecture The individual scalar cores from the last slide are assembled into groups of 32 in an SM SM = Streaming Multiprocessor SP = Streaming Processor Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

Key Architectural Ideas Instruction Cache SIMT (Single Instruction Multiple Thread) execution threads run in groups of 32 called warps threads in a warp share instruction unit (IU) HW automatically handles divergence Hardware multithreading HW resource allocation & thread scheduling HW relies on threads to hide latency Threads have all resources needed to run any warp not waiting for something can run context switching is (basically) free Scheduler Scheduler Dispatch Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core SIMT – “software” threads are assembled into groups of 32 called “warps” which time share the 32 hardware threads (CUDA cores) Warps share control logic (such as the current instruction), so at a HW level, they are executed in SIMD. However, threads are also allowed to diverge, and resolving divergence is also automatically handled by the HW I say “software” threads but in fact the hardware manages multithread allocating and scheduling, as well as handling divergence through a HW-managed stack Because threads have all the resources they need to run threads can launch and execute basically for free Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

C for CUDA Philosophy: provide minimal set of extensions necessary to expose power Function qualifiers: __global__ void my_kernel() { } __device__ float my_device_func() { } Variable qualifiers: __constant__ float my_constant_array[32]; __shared__ float my_shared_array[32]; Execution configuration: dim3 grid_dim(100, 50); // 5000 thread blocks dim3 block_dim(4, 8, 8); // 256 threads per block my_kernel <<< grid_dim, block_dim >>> (...); // Launch kernel Built-in variables and functions valid in device code: dim3 gridDim; // Grid dimension dim3 blockDim; // Block dimension dim3 blockIdx; // Block index dim3 threadIdx; // Thread index void __syncthreads(); // Thread synchronization With C for CUDA, our philosophy has been to provide the minimal set of extensions to the C programming language to enable the programmer to succinctly describe a parallel problem such that it can take advantage of the power of massively parallel platforms __global__ keyword tags a function as an entry point to a program that will run on a parallel compute device – think of this as the program’s main() function __device__ keyword tags a function to tell the compiler that the function should execute on the GPU __constant__ declares a read-only variable or array that lives in the GPU’s memory space __shared__ provides read-write access to a variable or array that lives in on-chip memory space. Neighboring threads in a block are allowed to communicate through such variables To launch a program, you call a function declared with the __global__ keyword and configure the “shape” of the computation. It may be convenient to describe the computation as a 1D (working on a linear array), 2D problem (think of doing some work on the pixels of an image), or 3D (think of doing some work on the voxels of a volume grid) While in a __global__ or __device__ function, threads need to know their “location” in the computation, and that is provided by built-in variables: gridDim & blockDim specify the shape of the current launch configuration, and blockIdx and threadIdx uniquely identify each thread in the grid Finally, the built-in function __syncthreads() provides a synchronization barrier for all threads operating in a block – all computation halts until all threads in a block have reached the barrier. This comes in handy when threads want to pass messages to each other to cooperate on a problem

Example: vector_addition // compute vector sum c = a + b // each thread performs one pair-wise addition __global__ void vector_add(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { // elided initialization code ... // Run N/256 blocks of 256 threads each vector_add<<< N/256, 256>>>(d_A, d_B, d_C);

Example: vector_addition // compute vector sum c = a + b // each thread performs one pair-wise addition __global__ void vector_add(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() // elided initialization code ... // launch N/256 blocks of 256 threads each vector_add<<< N/256, 256>>>(d_A, d_B, d_C); Host Code

Example: Initialization code for vector_addition // allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; // allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) ); // launch N/256 blocks of 256 threads each vector_add<<<N/256, 256>>>(d_A, d_B, d_C);

CUDA Programming Model Parallel code (kernel) is launched and executed on a device by many threads Launches are hierarchical Threads are grouped into blocks Blocks are grouped into grids Familiar serial code is written for a thread Each thread is free to execute a unique code path Built-in thread and block ID variables

DAXPY example in text

High Level View SMEM SMEM SMEM SMEM Global Memory PCIe CPU Chipset

Blocks of threads run on an SM Streaming Processor Streaming Multiprocessor SMEM Threadblock Per-block Shared Memory Thread Memory Registers Memory

Whole grid runs on GPU Many blocks of threads . . . SMEM Global Memory

Thread Hierarchy Threads launched for a parallel section are partitioned into thread blocks Grid = all blocks for a given launch Thread block is a group of threads that can: Synchronize their execution Communicate via shared memory

Memory Model Kernel 0 Sequential Kernels Per-device Global . . .

IDs and Dimensions Threads: Blocks: Dimensions set at launch 3D IDs, unique within a block Blocks: 2D IDs, unique within a grid Dimensions set at launch Can be unique for each grid Built-in variables: threadIdx, blockIdx blockDim, gridDim Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)

Kernel with 2D Indexing __global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }

__global__ void kernel( int *a, int dimx, int dimy ) { int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel<<<grid, block>>>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row<dimy; row++) for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0; __global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }

Control Flow Divergence What happens if you have the following code? if(foo(threadIdx.x)) { do_A(); } else { do_B();

Control Flow Divergence Branch Path A Path B Branch Path A Path B From Fung et al. MICRO ‘07 http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Control Flow Divergence Nested branches are handled as well if(foo(threadIdx.x)) { if(bar(threadIdx.x)) do_A(); else do_B(); } do_C();

Control Flow Divergence Branch Branch Branch Path A Path B Path C

Control Flow Divergence You don’t have to worry about divergence for correctness (*) You might have to think about it for performance Depends on your branch conditions * Mostly true, except corner cases (for example intra-warp locks) http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Control Flow Divergence Performance drops off with the degree of divergence switch(threadIdx.x % N) { case 0: ... case 1: }

Divergence http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

The Problem How do you do global communication? Finish a grid and start a new one http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Global Communication step1<<<grid1,blk1>>>(...); Finish a kernel and start a new one All writes from all threads complete before a kernel finishes step1<<<grid1,blk1>>>(...); // The system ensures that all // writes from step1 complete. step2<<<grid2,blk2>>>(...); http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Global Communication Would need to decompose kernels into before and after parts http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Race Conditions Or, write to a predefined memory location Race condition! Updates can be lost http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Race Conditions threadId:0 threadId:1917 // vector[0] was equal to 0 vector[0] += 5; vector[0] += 1; ... ... a = vector[0]; a = vector[0]; What is the value of a in thread 0? What is the value of a in thread 1917? http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Race Conditions Thread 0 could have finished execution before 1917 started Or the other way around Or both are executing at the same time Answer: not defined by the programming model, can be arbitrary http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Atomics CUDA provides atomic operations to deal with this problem http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Atomics An atomic operation guarantees that only a single thread has access to a piece of memory while an operation completes The name atomic comes from the fact that it is uninterruptable No dropped data, but ordering is still arbitrary Different types of atomic instructions atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor} More types in fermi http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Example: Histogram __global__ void histogram(int* color, int* buckets) // Determine frequency of colors in a picture // colors have already been converted into ints // Each thread looks at one pixel and increments // a counter atomically __global__ void histogram(int* color, int* buckets) { int i = threadIdx.x + blockDim.x * blockIdx.x; int c = colors[i]; atomicAdd(&buckets[c], 1); } http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Example: Workqueue __global__ void workq(int* work_q, int* q_counter, // For algorithms where the amount of work per item // is highly non-uniform, it often makes sense for // to continuously grab work from a queue __global__ void workq(int* work_q, int* q_counter, int* output, int queue_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result; }

Atomics Atomics are slower than normal load/store You can have the whole machine queuing on a single location in memory Atomics unavailable on G80! http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Example: Global Min/Max (Naive) // If you require the maximum across all threads // in a grid, you could do it with a single global // maximum value, but it will be VERY slow __global__ void global_max(int* values, int* gl_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; atomicMax(gl_max,val); } http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Example: Global Min/Max (Better) // introduce intermediate maximum results, so that // most threads do not try to update the global max __global__ void global_max(int* values, int* max, int *regional_maxes, int num_regions) { // i and val as before … int region = i % num_regions; if(atomicMax(&reg_max[region],val) < val) atomicMax(max,val); } http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Global Min/Max Single value causes serial bottleneck Create hierarchy of values for more parallelism Performance will still be slow, so use judiciously See next lecture for even better version! http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Summary Can’t use normal load/store for inter-thread communication because of race conditions Use atomic instructions for sparse and/or unpredictable global communication See next lectures for shared memory and scan for other communication patterns Decompose data (very limited use of single global sum/max/min/etc.) for more parallelism http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

How an SM executes threads Overview of how a Stream Multiprocessor works SIMT Execution Divergence http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Scheduling Blocks onto SMs Streaming Multiprocessor Thread Block 5 Thread Block 27 Thread Block 61 Thread Block 2001 HW Schedules thread blocks onto available SMs No guarantee of ordering among thread blocks HW will schedule thread blocks as soon as a previous thread block finishes http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Warps A warp = 32 threads launched together Control Control Control Control Control Control ALU ALU ALU ALU ALU ALU Control ALU ALU ALU ALU ALU ALU A warp = 32 threads launched together Usually, execute together as well http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Mapping of Thread Blocks Each thread block is mapped to one or more warps The hardware schedules each warp independently Thread Block N (128 threads) TB N W1 TB N W2 TB N W3 TB N W4 http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Thread Scheduling Example SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by SM * Warps whose next instruction has its inputs ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Atomics atomicAdd returns the previous value at a certain address Useful for grabbing variable amounts of data from a list http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Compare and Swap int compare_and_swap(int* register, int oldval, int newval) { int old_reg_val = *register; if(old_reg_val == oldval) *register = newval; return old_reg_val; } http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Compare and Swap Most general type of atomic Can emulate all others with CAS http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Locks Use very judiciously Always include a max_iter in your spinloop! Decompose your data and your locks http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule