1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomics.pptx Atomics and Critical Sections These notes will introduce: Accessing.

Slides:



Advertisements
Similar presentations
Operating Systems Semaphores II
Advertisements


1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
1 Chapter 5 Concurrency: Mutual Exclusion and Synchronization Principals of Concurrency Mutual Exclusion: Hardware Support Semaphores Readers/Writers Problem.
Global Environment Model. MUTUAL EXCLUSION PROBLEM The operations used by processes to access to common resources (critical sections) must be mutually.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Chapter 6 Process Synchronization Bernard Chen Spring 2007.
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 5: Process Synchronization.
5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
More on threads, shared memory, synchronization
8a-1 Programming with Shared Memory Threads Accessing shared data Critical sections ITCS4145/5145, Parallel Programming B. Wilkinson Jan 4, 2013 slides8a.ppt.
CUDA More GA, Events, Atomics. GA Revisited Speedup with a more computationally intense evaluation function Parallel version of the crossover and mutation.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming.ppt Measuring Performance of Constant Memory These notes will.
CUDA Grids, Blocks, and Threads
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.
Instructor: Umar KalimNUST Institute of Information Technology Operating Systems Process Synchronization.
Concurrency 1 CS502 Spring 2006 Thought experiment static int y = 0; int main(int argc, char **argv) { extern int y; y = y + 1; return 0; }
Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
CIS 565 Fall 2011 Qing Sun
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
CSCI-455/552 Introduction to High Performance Computing Lecture 19.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Java Thread and Memory Model
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.
1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.
Synchronization These notes introduce:
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 6: Process Synchronization.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
© David Kirk/NVIDIA and Wen-mei W
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
1 Programming with Shared Memory - 2 Issues with sharing data ITCS 4145 Parallel Programming B. Wilkinson Jan 22, _Prog_Shared_Memory_II.ppt.
CUDA Programming Model
Background on the need for Synchronization
Atomic Operations in Hardware
Atomic Operations in Hardware
Chapter 5: Process Synchronization
Device Routines and device variables
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Lecture 5: GPU Compute Architecture for the last time
Programming with Shared Memory
Threading And Parallel Programming Constructs
Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.
Device Routines and device variables
CUDA Grids, Blocks, and Threads
CUDA Programming Model
Chapter 6: Synchronization Tools
Programming with Shared Memory - 2 Issues with sharing data
Measuring Performance
CUDA Programming Model
Programming with Shared Memory Specifying parallelism
Synchronization These notes introduce:
Parallel Computation Patterns (Histogram)
Presentation transcript:

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomics.pptx Atomics and Critical Sections These notes will introduce: Accessing shared data by multiple threads Atomics Critical sections Compare and swap instruction and usage Memory fence instruction and usage

2 Accessing shared data needs careful control. Consider two threads each of which is to add one to a shared data item, x. Location x is read, x + 1 computed, and result written back to the same location: Accessing Shared Data Instruction x = x + 1; Thread 2Thread 1 Read x Compute x + 1 Write to x Read x Compute x + 1 Write to x Time

3 Conflict in accessing shared variable

4 One possible interleaving Thread 1 Read x Thread 1 Compute x + 1 Thread 1 Write to x Time Thread 2 Read x Thread 2 Compute x + 1 Thread 2 Write to x Suppose initial value of x is 10. What is the final value?

5 Need to ensure that each thread is allowed exclusive access to shared variable to complete its operation (if a write operation is involved) Atomic functions perform a read-modify-write operation on a word in shared memory without interference by other threads Access to the memory location with specified address is blocked until atomic completed. Atomic Functions

6 CUDA Atomic Operations Performs a read-modify-write atomic operation on one word residing in global or shared GPU memory. Associative operations on signed/unsigned integers, add, sub, min, max, and, or, xor, increment, decrement, exchange, compare and swap. Requires GPU with compute capability 1.1+ (Shared memory operations and 64-bit words require higher capability) coit-grid06 Tesla C2050 has compute capability 2.0 See for GPU compute capabilities

7 Example CUDA atomics* int atomicAdd(int* address, int val); Adds val to memory location given by address, atomically (atomic read-modify-write operation) int atomicSub(int* address, int val); Subtracts val from memory location given by address, atomically (atomic read-modify-write operation) Functions returns original value in address. * See CUDA C Programming Guide for full list

8 #include __device__ int gpu_Count=0; //global variable in device __global__ void gpu_Counter() { atomicAdd(&gpu_Count,1); } int main(void) { int cpu_Count; … gpu_Counter >>(); cudaMemcpyFromSymbol(&cpu_Count, "gpu_Count", sizeof(int), 0, cudaMemcpyDeviceToHost); printf("Count = %d\n",cpu_Count); … return 0; } Example code Synchronous, so cudaThreadSynchronize() not needed

9 Atomics only implemented on compute capability of 1.1 and above and extra features such as floating point add on later versions Previous code will need to be compiled with -arch=sm_11 (or later) compile flag Compilation Notes Make file: NVCC = /usr/local/cuda/bin/nvcc CUDAPATH = /usr/local/cuda NVCCFLAGS = -I$(CUDAPATH)/include -arch=sm_11 LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm Counter: $(NVCC) $(NVCCFLAGS) $(LFLAGS) -o Counter Counter.cu

10 Another Example Computing Histogram // globally accessible on gpu __device__ int gpu_hist[10]; // histogram computed on gpu __global__ void gpu_histogram(int *a, int N) { int *ptr; int tid = blockIdx.x * blockDim.x + threadIdx.x; int numberThreads = blockDim.x * gridDim.x; if (tid == 0) for (int i = 0; i < 10; i++) // initialize histogram on host to all zeros gpu_hist[i] = 0; // maybe a better way but may not be 10 tids while (tid < N) { ptr = &gpu_hist[a[tid]]; atomicAdd(ptr,1); tid += numberThreads; // if no of threads less than N, threads reused }

11 int main(int argc, char *argv[]) { int T = 10, B = 10; // threads per block and blocks per grid int N = 10;// Number of numbers int *a;// ptr to array holding numbers on host int *dev_a;// ptr to array holding numbers on device int hist[10];// final results from gpu printf("Enter number of numbers, currently %d\n",N); scanf("%d",&N); input_thread_values(&B,&T);// keyboard input for no of threads and blocks if (N > B * T) printf("Note; number of threads less than number of numbers\n"); int size = N * sizeof(int);// number of bytes in total in list of numbers a = (int*) malloc(size); srand(1);// set rand() seed to 1 for repeatability for(int i=0;i<N;i++) // load arrays with digits a[i] = rand() % 10; cudaMalloc((void**)&dev_a, size); cudaMemcpy(dev_a, a, size,cudaMemcpyHostToDevice); // copy numbers to device gpu_histogram >>(dev_a,N); cudaThreadSynchronize();// wait for all threads to complete, needed? cudaMemcpyFromSymbol(&hist, "gpu_hist", sizeof(hist), 0, cudaMemcpyDeviceToHost); printf("Histogram, as computed on GPU\n"); for(int i = 0;i < 10;i++) printf("Number of %d's = %d\n",i,hist[i]); free(a);// clean up cudaFree(dev_a); return 0; }

12 Other atomic operations int atomicSub(int* address, int val); int atomicExch(int* address, int val); int atomicMin(int* address, int val); int atomicMax(int* address, int val); unsigned int atomicInc(unsigned int* address, unsigned int val); unsigned int atomicDec(unsigned int* address, unsigned int val); int atomicCAS(int* address, int compare, int val); //compare and swap int atomicAnd(int* address, int val); int atomicOr(int* address, int val); int atomicXor(int* address, int val); Source: NVIDIA CUDA C Programming Guide, version 3.2, 11/9/2010

13 A mechanism for ensuring that only one process (or in this context, thread) accesses a particular resource at a time. critical section – a section of code for accessing resource Arrange that only one such critical section is executed at a time. This mechanism is known as mutual exclusion. Concept also appears in an operating systems. Critical Sections

14 Simplest mechanism for ensuring mutual exclusion of critical sections. A lock - a 1-bit variable that is a 1 to indicate that a process has entered the critical section and a 0 to indicate that no process is in the critical section. Operates much like that of a door lock: A process coming to “door” of a critical section and finding it open may enter critical section, locking the door behind it to prevent other processes from entering. Once process has finished the critical section, it unlocks the door and leaves. Locks

15 Control of critical sections through busy waiting

16 Implementing Locks Checking lock and setting it if not set at the entrance to a critical section must be done indivisibly and atomically Usual way to achieve this is for the processor to have special atomic machine instruction notably one of: Test and set Fetch and add Compare and Swap CAS (or compare and exchange)

17 Compare and Swap CAS CAS -- compares contents of a memory location to a given value and only if the same, modifies contents of the memory location to a specified value, i.e.: if (x == compare_value ) x = new_val; (else x = x;) For a critical section lock: x = lock variable compare_value = 0 (FALSE) new_value = 1 (TRUE)

18 CUDA Functions for Locks Among the CUDA atomic functions is compare and swap: int atomicCAS(int* address, int compare_value, int new_value); Reads 32/64 bit global/shared memory location at address, compares contents with first supplied value compare_value and if the same stores in memory location the second supplied value, new_value. Returns original value in address.

19 __device__ int lock=0; // unlocked __global__ void kernel(...) {... do {} while (atomicCAS(&lock,0,1) ); // if lock = 0 set to1 // and enter... // critical section lock = 0; // free lock … } Coding Critical Sections with “Spin” Locks To be tested. BW

20 Critical Sections Serializing Code High performance programs should have as few as possible critical sections as their use can serialize the code. Suppose, all processes happen to come to their critical section together. They will execute their critical sections one after the other. In that situation, the execution time becomes almost that of a single processor.

21 Illustration

22 Results from Histogram program

seems max because of accesses to shared histogram array More threads than numbers obviously will not help Less threads than numbers causes threads to be reused in counting, so slower

24 Memory Fences Threads may see effects of a series of writes to memory executed by another thread in different orders. To enforce ordering: void __threadfence_block(); waits until all global and shared memory accesses made by calling thread prior to __threadfence_block() are visible to all threads in thread block. Other routines: void __threadfence(); void __threadfence_system();

25 Writes to device memory not guaranteed in any order, so global writes may not have completed by the time the lock is unlocked __global__ void kernel(...) {... do {} while(atomicCAS(&lock,0,1));...// critical section __threadfence(); // wait for writes to finish lock = 0; } Critical sections with memory operations

Questions