Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
Graphics Processing Units
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 3: The CUDA Memory Model.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
OpenCL Programming James Perry EPCC The University of Edinburgh.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
CUDA C/C++ Basics Part 2 - Blocks and Threads
CUDA Programming Model
Basic CUDA Programming
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Execution Model – III Streams and Events
CUDA Programming Model
6- General Purpose GPU Programming
Presentation transcript:

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution

Cloud Computing Principles

Mobile Computing Principles

CUDA Blocks and Threads CUDA programming model groups threads into special groups it calls warps, blocks, and grids Thread : fundamental building block of a parallel program the single thread of execution through any serial piece of code

Problem decomposition Parallelism in the CPU domain -run more than one (single-threaded)program on a single CPU -task-level parallelism use the data parallelism model and split the task in N parts where N is the number of CPU cores available. use the data parallelism model and split the task in N parts where N is the number of CPU cores available.

Problem decomposition GPU domain

Threading on GPUs void some_func(void) { int i; for (i=0;i<128;i++) { a[i]= b[i] * c[i]; }

quad-core CPU four blocks, where CPU core 1 handles indexes 0–31, core 2 indexes 32–63 core 3 indexes 64–95 core 4 indexes 96–127.

CUDA translate this loop by creating a kernel function, which is a function that executes on the GPU only and cannot be executed directly on the CPU GPU: accelerate computationally intensive sections of a program

CUDA GPU kernel function __global__ void some_kernel_func(int * const a, const int * const b, const int * const c) { a[i] = b[i] * c[i]; } __global__: tells complier to generate GPU code to make that GPU code globally visible from within the CPU.

CUDA CPU and GPU have separate memory spaces. global arrays a, b, and c at the CPU level are no longer visible on the GPU level. i = not defined, the value of i is defined by the thread i.e. currently running. launching 128 instances of this function. CUDA provides a special parameter, different for each thread, which defines the thread ID or number. use as index into the array

CUDA thread information is provided in a structure. As it’s a structure element, store it in a variable, thread_idx.x Code becomes __global__ void some_kernel_func(int * const a, const int * const b, const int * const c) { const unsigned int thread_idx = threadIdx.x; a[thread_idx]= b[thread_idx] * c[thread_idx]; }

Threads are grouped into 32 thread groups groups of threads is a warp (32 threads) and a half warp (16 threads), 128 threads translate into four groups of 32 threads.

Groups of threads is a warp

CUDA kernels To invoke a kernel you use the following syntax: kernel_function >> (param1, param2,.) num_blocks: no of blocks num_threads: no of threads wish to launch into the kenel ( hardware limits you to 512 threads per block on the early hardware and 1024 on the later h/w

CUDA kernels Parameters can be passed via registers or constant memory, If using registers, use one register for every thread per parameter passed 128 threads with three parameters, use 3 *128 =384 registers registers in each SM total of 64 registers (8192 registers /128 threads) available

CUDA BLOCKS some_kernel_func >>(a, b, c); __global__ void some_kernel_func(int * const a, const int * const b, const int * const c) { const unsigned int thread_idx = (blockIdx.x * blockDim.x) + threadIdx.x; a[thread_idx]= b[thread_idx] * c[thread_idx]; }

simple kernel to add two integers: __global__ voidadd( int*a, int*b, int*c ) { *c = *a + *b; } – As before, __global__ is a CUDA C keyword meaning add() will execute on the device – add() willbe called from the host

Int main( void ) { Int a, b, c; // host copies of a, b, c Int *dev_a, *dev_b, *dev_c; // device copies of a, b, c Int size = sizeof( int); // we need space for an integer // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = 2; b = 7;

// copy inputs to device cudaMemcpy( dev_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy( dev_b, &b, size, cudaMemcpyHostToDevice); // launch add() kernel on GPU, passing parameters add >>( dev_a, dev_b, dev_c); // copy device result back to host copy of c cudaMemcpy( &c, dev_c, size, cudaMemcpyDeviceToHost); cudaFree( dev_a); cudaFree( dev_b); cudaFree( dev_c); return0;

add >>( dev_a, dev_b, dev_c); Instead of executing add()once, add()executed Ntimes in parallel __global__ void add( int*a, int*b, int*c ) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }

#define N 512 Int main( void ) { Int *a, *b, *c; // host copies of a, b, c Int *dev_a, *dev_b, *dev_c; // device copies of a, b, c Int size = N *sizeof( int); // we need space for 512 integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );

// copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice); // launch add() kernel with N parallel blocks add >>( dev_a, dev_b, dev_c); // copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost); free( a ); free( b ); free( c ); cudaFree( dev_a); cudaFree( dev_b); cudaFree( dev_c); return0;

Programming Model SIMT (Single Instruction Multiple Threads) Threads run in groups of 32 called warps Every thread in a warp executes the same instruction at a time

Programming Model A single kernel executed by several threads Threads are grouped into ‘blocks’ Kernel launches a ‘grid’ of thread blocks

Programming Model © NVIDIA Corporation

Programming Model All threads within a block can – Share data through ‘Shared Memory’ – Synchronize using ‘_syncthreads()’ Threads and Blocks have unique IDs – Available through special variables

Thread Batching: Grids and Blocks Kernel executed as a grid of thread blocks – All threads share data memory space Thread block is a batch of threads, can cooperate with each other by: – Synchronizing their execution: For hazard-free shared memory accesses – Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate – (Unless thru slow global memory) Threads and blocks have IDs Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA

Memory Handling with CUDA

Memory Model Host (CPU) and device (GPU) have separate memory spaces CPU –Two types of memories -Hard Disk(DRAM,MM,Slow ) -Cache : High Speed memory,expensive,frequently used data -3 levels L1-fastest,16k,32k,0r 64k,single core L2- slower than L1 256k to 512k –multi core L3- optional -MB –multi core -sharing of caches Host manages memory on device – Use functions to allocate/set/copy/free memory on device – Similar to C functions

Memory Model Types of device memory – Registers – read/write per-thread – Local Memory – read/write per-thread – Shared Memory – read/write per-block – Global Memory – read/write across grids – Constant Memory – read across grids – Texture Memory – read across grids

Memory Model © NVIDIA Corporation

CUDA Device Memory Space Overview Each thread can: – R/W per-thread registers – R/W per-thread local memory – R/W per-block shared memory – R/W per-grid global memory – Read only per-grid constant memory – Read only per-grid texture memory (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host The host can R/W global, constant, and texture memories

Register Memory Fastest Only accessible by a thread. Lifetime of a thread 8k to 64k per SM for all threads within SM

Shared Memory  L1 Cache 64k per SM  Extremely fast  Highly parallel  Restricted to a block  bank –switched architecture (each bank for one operation /cycle  Example: Fermi’s  shared/L1 is 1+TB/s aggregate

Global, Constant, and Texture Memories (Long Latency Accesses) Global memory – Main means of communicating R/W Data between host and device – Contents visible to all threads Texture and Constant Memories – Read only – Constants initialized by host – Contents visible to all threads (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Register s Local Memory Thread (1, 0) Register s Block (1, 0) Shared Memory Local Memory Thread (0, 0) Register s Local Memory Thread (1, 0) Register s Host Courtesy: NDVIA

Global Memory writable from both the GPU and the CPU. accessed from any device on the PCI-E bus. GPU cards can transfer data to and from one another, directly, without needing the CPU. The memory from the GPU is accessible to the CPU host processor in one of three ways: 1. Explicitly with a blocking transfer. 2. Explicitly with a nonblocking transfer. 3. Implicitly using zero memory copy.

CONSTANT MEMORY a form of virtual addressing of global memory. There is no special reserved constant memory block. two special properties 1.cached 2. it supports broadcasting a single value to all the elements within a warp. read-only memory(declared at compile time as read only or defined at runtime as read only by the host) size of constant memory is restricted to 64 K.

CONSTANT MEMORY To declare a section of memory as constant at compile time, __constant__keyword. ___constant__ float my_array[1024] = { 0.0F, 1.0F, 1.34F,. }; To change the contents at runtime, cudaCopyToSymbol function call prior to invoking the GPU kernel.

TEXTURE MEMORY used for two primary purposes: Caching on compute 1.x and 3.x hardware. Hardware-based manipulation of memory reads.

Memory Hierarchy (1) (3) (4) (2) Thread Per-thread Local Memory Block Per-block Shared Memory Kernel 0... Per-device Global Memory... Kernel 1 Sequential Kernels Device 0 memory Device 1 memory Host memory cudaMemcpy()

Hardware Implementation: Execution Model Each thread block of a grid is split into warps, each gets executed by one multiprocessor (SM) – The device processes only one grid at a time Each thread block is executed by one multiprocessor – So that the shared memory space resides in the on-chip shared memory A multiprocessor can execute multiple blocks concurrently – Shared memory and registers are partitioned among the threads of all concurrent blocks – So, decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently

Threads, Warps, Blocks There are (up to) 32 threads in a Warp – Only <32 when there are fewer than 32 total threads There are (up to) 32 Warps in a Block Each Block (and thus, each Warp) executes on a single SM GF110 has 16 SMs At least 16 Blocks required to “fill” the device More is better – If resources (registers, thread space, shared memory) allow, more than 1 Block can occupy each SM

More Terminology Review device = GPU = set of multiprocessors Multiprocessor = set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory MemoryLocationCachedAccessWho LocalOff-chipNoRead/writeOne thread SharedOn-chipN/A - residentRead/writeAll threads in a block GlobalOff-chipNoRead/writeAll threads + host ConstantOff-chipYesReadAll threads + host TextureOff-chipYesReadAll threads + host

Access Times Register – dedicated HW - single cycle Shared Memory – dedicated HW - single cycle Local Memory – DRAM, no cache - *slow* Global Memory – DRAM, no cache - *slow* Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality Instruction Memory (invisible) – DRAM, cached

Multi-CPU and Multi-GPU Solutions MULTI-CPU SYSTEMS single-socket, multicore desktop laptops and media PCs workstations and low-end servers dual-socket machines, typically powered by multicore Xeon or Opteron CPUs data center–based servers 4, 8, or 16 sockets, each with a multicore CPU, used to create a virtualized set of machines

MULTI-CPU SYSTEMS major problems with multiprocessor system-memory coherency To speed up access to memory locations, CPUs make extensive use of caches cache coherency simple coherency model other cores then mark the entry for x in their caches as invalid,read MM, noncached, a huge performance hit complex coherency models: every write has to be distributed to N caches, N grows,the time to synchronize the caches becomes impractical

MULTI-GPU SYSTEMS 9800GX2, GTX295, GTX590 scientific applications or working with known hardware

Single –Node System CUDA 4.0 SDK

STREAMS Streams are virtual work queues on the GPU used for asynchronous operation, that is, when the GPU to operate separately from the CPU. By creating a stream you can push work and events into the stream Streams and events are associated with the GPU context in which they were created.

Example Refer page no 276 Shane Cook Cuda Progamming

Multiple-Node Systems