Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution

Cloud Computing Principles

Mobile Computing Principles

CUDA Blocks and Threads CUDA programming model groups threads into special groups it calls warps, blocks, and grids Thread : fundamental building block of a parallel program the single thread of execution through any serial piece of code

Problem decomposition Parallelism in the CPU domain -run more than one (single-threaded)program on a single CPU -task-level parallelism use the data parallelism model and split the task in N parts where N is the number of CPU cores available. use the data parallelism model and split the task in N parts where N is the number of CPU cores available.

Problem decomposition GPU domain

Threading on GPUs void some_func(void) { int i; for (i=0;i<128;i++) { a[i]= b[i] * c[i]; }

quad-core CPU four blocks, where CPU core 1 handles indexes 0–31, core 2 indexes 32–63 core 3 indexes 64–95 core 4 indexes 96–127.

CUDA translate this loop by creating a kernel function, which is a function that executes on the GPU only and cannot be executed directly on the CPU GPU: accelerate computationally intensive sections of a program

CUDA GPU kernel function __global__ void some_kernel_func(int * const a, const int * const b, const int * const c) { a[i] = b[i] * c[i]; } __global__: tells complier to generate GPU code to make that GPU code globally visible from within the CPU.

CUDA CPU and GPU have separate memory spaces. global arrays a, b, and c at the CPU level are no longer visible on the GPU level. i = not defined, the value of i is defined by the thread i.e. currently running. launching 128 instances of this function. CUDA provides a special parameter, different for each thread, which defines the thread ID or number. use as index into the array

CUDA thread information is provided in a structure. As it’s a structure element, store it in a variable, thread_idx.x Code becomes __global__ void some_kernel_func(int * const a, const int * const b, const int * const c) { const unsigned int thread_idx = threadIdx.x; a[thread_idx]= b[thread_idx] * c[thread_idx]; }

Threads are grouped into 32 thread groups groups of threads is a warp (32 threads) and a half warp (16 threads), 128 threads translate into four groups of 32 threads.

Groups of threads is a warp

CUDA kernels To invoke a kernel you use the following syntax: kernel_function >> (param1, param2,.) num_blocks: no of blocks num_threads: no of threads wish to launch into the kenel ( hardware limits you to 512 threads per block on the early hardware and 1024 on the later h/w

CUDA kernels Parameters can be passed via registers or constant memory, If using registers, use one register for every thread per parameter passed 128 threads with three parameters, use 3 *128 =384 registers. 8192 registers in each SM total of 64 registers (8192 registers /128 threads) available

CUDA BLOCKS some_kernel_func >>(a, b, c); __global__ void some_kernel_func(int * const a, const int * const b, const int * const c) { const unsigned int thread_idx = (blockIdx.x * blockDim.x) + threadIdx.x; a[thread_idx]= b[thread_idx] * c[thread_idx]; }

simple kernel to add two integers: __global__ voidadd( int*a, int*b, int*c ) { *c = *a + *b; } – As before, __global__ is a CUDA C keyword meaning add() will execute on the device – add() willbe called from the host

Int main( void ) { Int a, b, c; // host copies of a, b, c Int *dev_a, *dev_b, *dev_c; // device copies of a, b, c Int size = sizeof( int); // we need space for an integer // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = 2; b = 7;

// copy inputs to device cudaMemcpy( dev_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy( dev_b, &b, size, cudaMemcpyHostToDevice); // launch add() kernel on GPU, passing parameters add >>( dev_a, dev_b, dev_c); // copy device result back to host copy of c cudaMemcpy( &c, dev_c, size, cudaMemcpyDeviceToHost); cudaFree( dev_a); cudaFree( dev_b); cudaFree( dev_c); return0;

add >>( dev_a, dev_b, dev_c); Instead of executing add()once, add()executed Ntimes in parallel __global__ void add( int*a, int*b, int*c ) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }

#define N 512 Int main( void ) { Int *a, *b, *c; // host copies of a, b, c Int *dev_a, *dev_b, *dev_c; // device copies of a, b, c Int size = N *sizeof( int); // we need space for 512 integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );

// copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice); // launch add() kernel with N parallel blocks add >>( dev_a, dev_b, dev_c); // copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost); free( a ); free( b ); free( c ); cudaFree( dev_a); cudaFree( dev_b); cudaFree( dev_c); return0;

Programming Model SIMT (Single Instruction Multiple Threads) Threads run in groups of 32 called warps Every thread in a warp executes the same instruction at a time

Programming Model A single kernel executed by several threads Threads are grouped into ‘blocks’ Kernel launches a ‘grid’ of thread blocks

Programming Model © NVIDIA Corporation

Programming Model All threads within a block can – Share data through ‘Shared Memory’ – Synchronize using ‘_syncthreads()’ Threads and Blocks have unique IDs – Available through special variables

Thread Batching: Grids and Blocks Kernel executed as a grid of thread blocks – All threads share data memory space Thread block is a batch of threads, can cooperate with each other by: – Synchronizing their execution: For hazard-free shared memory accesses – Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate – (Unless thru slow global memory) Threads and blocks have IDs Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA

Memory Handling with CUDA

Memory Model Host (CPU) and device (GPU) have separate memory spaces CPU –Two types of memories -Hard Disk(DRAM,MM,Slow ) -Cache : High Speed memory,expensive,frequently used data -3 levels L1-fastest,16k,32k,0r 64k,single core L2- slower than L1 256k to 512k –multi core L3- optional -MB –multi core -sharing of caches Host manages memory on device – Use functions to allocate/set/copy/free memory on device – Similar to C functions

Memory Model Types of device memory – Registers – read/write per-thread – Local Memory – read/write per-thread – Shared Memory – read/write per-block – Global Memory – read/write across grids – Constant Memory – read across grids – Texture Memory – read across grids

Memory Model © NVIDIA Corporation

CUDA Device Memory Space Overview Each thread can: – R/W per-thread registers – R/W per-thread local memory – R/W per-block shared memory – R/W per-grid global memory – Read only per-grid constant memory – Read only per-grid texture memory (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host The host can R/W global, constant, and texture memories

Register Memory Fastest Only accessible by a thread. Lifetime of a thread 8k to 64k per SM for all threads within SM

Shared Memory  L1 Cache 64k per SM  Extremely fast  Highly parallel  Restricted to a block  bank –switched architecture (each bank for one operation /cycle  Example: Fermi’s  shared/L1 is 1+TB/s aggregate

Global, Constant, and Texture Memories (Long Latency Accesses) Global memory – Main means of communicating R/W Data between host and device – Contents visible to all threads Texture and Constant Memories – Read only – Constants initialized by host – Contents visible to all threads (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Register s Local Memory Thread (1, 0) Register s Block (1, 0) Shared Memory Local Memory Thread (0, 0) Register s Local Memory Thread (1, 0) Register s Host Courtesy: NDVIA

Global Memory writable from both the GPU and the CPU. accessed from any device on the PCI-E bus. GPU cards can transfer data to and from one another, directly, without needing the CPU. The memory from the GPU is accessible to the CPU host processor in one of three ways: 1. Explicitly with a blocking transfer. 2. Explicitly with a nonblocking transfer. 3. Implicitly using zero memory copy.

CONSTANT MEMORY a form of virtual addressing of global memory. There is no special reserved constant memory block. two special properties 1.cached 2. it supports broadcasting a single value to all the elements within a warp. read-only memory(declared at compile time as read only or defined at runtime as read only by the host) size of constant memory is restricted to 64 K.

CONSTANT MEMORY To declare a section of memory as constant at compile time, __constant__keyword. ___constant__ float my_array[1024] = { 0.0F, 1.0F, 1.34F,. }; To change the contents at runtime, cudaCopyToSymbol function call prior to invoking the GPU kernel.

TEXTURE MEMORY used for two primary purposes: Caching on compute 1.x and 3.x hardware. Hardware-based manipulation of memory reads.

Memory Hierarchy (1) (3) (4) (2) Thread Per-thread Local Memory Block Per-block Shared Memory Kernel 0... Per-device Global Memory... Kernel 1 Sequential Kernels Device 0 memory Device 1 memory Host memory cudaMemcpy()

Hardware Implementation: Execution Model Each thread block of a grid is split into warps, each gets executed by one multiprocessor (SM) – The device processes only one grid at a time Each thread block is executed by one multiprocessor – So that the shared memory space resides in the on-chip shared memory A multiprocessor can execute multiple blocks concurrently – Shared memory and registers are partitioned among the threads of all concurrent blocks – So, decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently

Threads, Warps, Blocks There are (up to) 32 threads in a Warp – Only <32 when there are fewer than 32 total threads There are (up to) 32 Warps in a Block Each Block (and thus, each Warp) executes on a single SM GF110 has 16 SMs At least 16 Blocks required to “fill” the device More is better – If resources (registers, thread space, shared memory) allow, more than 1 Block can occupy each SM

More Terminology Review device = GPU = set of multiprocessors Multiprocessor = set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory MemoryLocationCachedAccessWho LocalOff-chipNoRead/writeOne thread SharedOn-chipN/A - residentRead/writeAll threads in a block GlobalOff-chipNoRead/writeAll threads + host ConstantOff-chipYesReadAll threads + host TextureOff-chipYesReadAll threads + host

Access Times Register – dedicated HW - single cycle Shared Memory – dedicated HW - single cycle Local Memory – DRAM, no cache - *slow* Global Memory – DRAM, no cache - *slow* Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality Instruction Memory (invisible) – DRAM, cached

Multi-CPU and Multi-GPU Solutions MULTI-CPU SYSTEMS single-socket, multicore desktop laptops and media PCs workstations and low-end servers dual-socket machines, typically powered by multicore Xeon or Opteron CPUs data center–based servers 4, 8, or 16 sockets, each with a multicore CPU, used to create a virtualized set of machines

MULTI-CPU SYSTEMS major problems with multiprocessor system-memory coherency To speed up access to memory locations, CPUs make extensive use of caches cache coherency simple coherency model other cores then mark the entry for x in their caches as invalid,read MM, noncached, a huge performance hit complex coherency models: every write has to be distributed to N caches, N grows,the time to synchronize the caches becomes impractical

MULTI-GPU SYSTEMS 9800GX2, GTX295, GTX590 scientific applications or working with known hardware

Single –Node System CUDA 4.0 SDK

STREAMS Streams are virtual work queues on the GPU used for asynchronous operation, that is, when the GPU to operate separately from the CPU. By creating a stream you can push work and events into the stream Streams and events are associated with the GPU context in which they were created.

Example Refer page no 276 Shane Cook Cuda Progamming

Multiple-Node Systems

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

Similar presentations

Presentation on theme: "Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

Similar presentations

Presentation on theme: "Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution."— Presentation transcript:

Similar presentations

About project

Feedback