Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.

Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010 INFONOR-CHILE 2010 November 15th - 19th, 2010 Antofagasta, Chile Dr. Barry Wilkinson University of North Carolina Charlotte Nov 8, 2010 © Barry Wilkinson

Part 1b Emergence of GPU systems and clusters for general purpose High Performance Computing

3 Last few years GPUs have developed from graphics cards into a platform from HPC There is now great interest in using GPUs for scientific high performance computing and GPUs are being designed with that application in mind CUDA programming – C/C++ with a few additional features and routines to support GPU programming. Uses data parallel paradigm Graphics Processing Units (GPUs)

4 http://www.hpcwire.com/blogs/New-China-GPGPU-Super-Outruns-Jaguar-105987389.html

Graphics Processing Units (GPUs) Brief History 1970 2010 200019901980 Atari 8-bit computer text/graphics chip Source of information http://en.wikipedia.org/wiki/Graphics_Processing_Unit IBM PC Professional Graphics Controller card S3 graphics cards- single chip 2D accelerator OpenGL graphics API Hardware-accelerated 3D graphics DirectX graphics API Playstation GPUs with programmable shading Nvidia GeForce GE 3 (2001) with programmable shading General-purpose computing on graphics processing units (GPGPUs) GPU Computing

NVIDIA products NVIDIA Corp. is the leader in GPUs for high performance computing: 1993201019991995 http://en.wikipedia.org/wiki/GeForce 2009200720082000200120022003200420052006 Established by Jen- Hsun Huang, Chris Malachowsky, Curtis Priem NV1GeForce 1 GeForce 2 series GeForce FX series GeForce 8 series GeForce 200 series GeForce 400 series GTX460/465/470/475/ 480/485 GTX260/275/280/285/295 GeForce 8800 GT 80 Tesla Quadro NVIDIA's first GPU with general purpose processors C870, S870, C1060, S1070, C2050, … Tesla 2050 GPU has 448 thread processors Fermi Kepler (2011) Maxwell (2013)

7 GPU performance gains over CPUs T12 Westmere NV30 NV40 G70 G80 GT200 3GHz Dual Core P4 3GHz Core2 Duo 3GHz Xeon Quad Source © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

8 CPU-GPU architecture evolution Co-processors -- very old idea that appeared in 1970s and 1980s with floating point coprocessors attached to microprocessors that did not then have floating point capability. These coprocessors simply executed floating point instructions that were fetched from memory. Around same time, interest to provide hardware support for displays, especially with increasing use of graphics and PC games. Led to graphics processing units (GPUs) attached to CPU to create video display. CPU Graphics card Display Memory Early design

9 Birth of general purpose programmable GPU Dedicated pipeline (late1990s-early 2000s) By late1990’s, graphics chips needed to support 3-D graphics, especially for games and graphics APIs such as DirectX and OpenGL. Graphics chips generally had a pipeline structure with individual stages performing specialized operations, finally leading to loading frame buffer for display. Individual stages may have access to graphics memory for storing intermediate computed data. Input stage Vertex shader stage Geometry shader stage Rasterizer stage Frame buffer Pixel shading stage Graphics memory

11 General-Purpose GPU designs High performance pipelines call for high-speed (IEEE) floating point operations. People had been trying to use GPU cards to speed up scientific computations Known as GPGPU (General-purpose computing on graphics processing units) -- Difficult to do with specialized graphics pipelines, but possible.) By mid 2000’s, recognized that individual stages of graphics pipeline could be implemented by a more general purpose processor core (although with a data-parallel paradym) a

12 2006 -- First GPU for general high performance computing as well as graphics processing, NVIDIA GT 80 chip/GeForce 8800 card. Unified processors that could perform vertex, geometry, pixel, and general computing operations Could now write programs in C rather than graphics APIs. Single-instruction multiple thread (SIMT) programming model GPU design for general high performance computing

14 Evolving GPU design NVIDIA Fermi architecture (announced Sept 2009) 512 stream processing engines (SPEs) Organized as 16 SPEs, each having 32 cores 3GB or 6 GB GDDR5 memory Many innovations including L1/L2 caches, unified device memory addressing, ECC memory, … First implementation: Tesla 20 series (single chip C2050/2070, 4 chip S2050/2070) 3 billion transistor chip? New Fermi chips planned (GT 300, GeForce 400 series)

15 Fermi Streaming Multiprocessor (SM) * Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008

16 CUDA (Compute Unified Device Architecture) Architecture and programming model, introduced in NVIDIA in 2007 Enables GPUs to execute programs written in C. Within C programs, call SIMT “kernel” routines that are executed on GPU. CUDA syntax extension to C identify routine as a Kernel. Very easy to learn although to get highest possible execution performance requires understanding of hardware architecture

17 Programming Model Program once compiled has code executed on CPU and code executed on GPU Separate memories on CPU and GPU Need to Explicitly transfer data from CPU to GPU for GPU computation, and Explicitly transfer results in GPU memory copied back to CPU memory Copy from CPU to GPU Copy from GPU to CPU GPU CPU CPU main memory GPU global memory

18 Basic CUDA program structure int main (int argc, char **argv ) { 1. Allocate memory space in device (GPU) for data 2. Allocate memory space in host (CPU) for data 3. Copy data to GPU 4. Call “kernel” routine to execute on GPU ( with CUDA syntax that defines no of threads and their physical structure) 5. Transfer results from GPU to CPU 6. Free memory space in device (GPU) 7. Free memory space in host (CPU) return; }

19 1. Allocating memory space in “device” (GPU) for data Use CUDA malloc routines: int size = N *sizeof( int); // space for N integers int *devA, *devB, *devC; // devA, devB, devC ptrs cudaMalloc( (void**)&devA, size) ); cudaMalloc( (void**)&devB, size ); cudaMalloc( (void**)&devC, size ); Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.

20 2. Allocating memory space in “host” (CPU) for data Use regular C malloc routines: int *a, *b, *c; … a = (int*)malloc(size); b = (int*)malloc(size); c = (int*)malloc(size); or statically declare variables: #define N 256 … int a[N], b[N], c[N];

21 3. Transferring data from host (CPU) to device (GPU) Use CUDA routine cudaMemcpy cudaMemcpy( devA, &A, size, cudaMemcpyHostToDevice); cudaMemcpy( dev_B, &B, size, cudaMemcpyHostToDevice); where devA and devB are pointers to destination in device and A and B are pointers to host data

22 4. Declaring “kernel” routine to execute on device (GPU) CUDA introduces a syntax addition to C: Triple angle brackets mark call from host code to device code. Contains organization and number of threads in two parameters: myKernel >>(arg1, … ); n and m will define organization of thread blocks and threads in a block. For now, we will set n = 1, which say one block and m = N, which says N threads in this block. arg1, …, -- arguments to routine myKernel typically pointers to device memory obtained previously from cudaMallac.

23 A kernel defined using CUDA specifier __global__ Example – Adding to vectors A and B #define N 256 __global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { // allocate device memory & // copy data to device // device mem. ptrs devA,devB,devC vecAdd >>(devA,devB,devC); … } Loosely derived from CUDA C programming guide, v 3.2, 2010, NVIDIA Declaring a Kernel Routine Each of the N threads performs one pair- wise addition: Thread 0: devC[0] = devA[0] + devB[0]; Thread 1: devC[1] = devA[1] + devB[1]; Thread N-1: devC[N-1] = devA[N-1]+devB[N-1]; Grid of one block, block has N threads CUDA structure that provides thread ID in block

24 5. Transferring data from device (GPU) to host (CPU) Use CUDA routine cudaMemcpy cudaMemcpy( &C, devC, size, cudaMemcpyDeviceToHost); where devC is a pointer in device and C is a pointer in host.

25 6. Free memory space in “device” (GPU) Use CUDA cudaFree routine: cudaFree( dev_a); cudaFree( dev_b); cudaFree( dev_c);

26 7. Free memory space in (CPU) host (if CPU memory allocated with malloc) Use regular C free routine to deallocate memory if previously allocated with malloc: free( a ); free( b ); free( c );

27 Complete CUDA program Adding two vectors, A and B N elements in A and B, and N threads (without code to load arrays with data) #define N 256 __global__ void vecAdd(int *A, int *B, int *C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main (int argc, char **argv ) { int size = N *sizeof( int); int a[N], b[N], c[N], *devA, *devB, *devC; cudaMalloc( (void**)&devA, size) ); cudaMalloc( (void**)&devB, size ); cudaMalloc( (void**)&devC, size ); a = (int*)malloc(size); b = (int*)malloc(size);c = (int*)malloc(size); cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice); cudaMemcpy( dev_B, b size, cudaMemcpyHostToDevice); vecAdd >>(devA, devB, devC); cudaMemcpy( &c, devC size, cudaMemcpyDeviceToHost); cudaFree( dev_a); cudaFree( dev_b); cudaFree( dev_c); free( a ); free( b ); free( c ); return (0); } Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.

28 Can be 1 or 2 dimensions Can be 1, 2 or 3 dimensions CUDA C programming guide, v 3.2, 2010, NVIDIA CUDA SIMT Thread Structure Allows flexibility and efficiency in processing 1D, 2-D, and 3-D data on GPU. Linked to internal organization Threads in one block execute together.

29 Need to provide each kernel call with values for two key structures: Number of blocks in each dimension Threads per block in each dimension myKernel >>(arg1, … ); numBlocks – number of blocks in grid in each dimension (1D or 2D). An integer would define a 1D grid of that size, otherwise use CUDA structure, see next. threadsperBlock – number of threads in a block in each dimension (1D, 2D, or 3D). An integer would define a 1D block of that size, otherwise use CUDA structure, see next. Notes: Number of blocks not limited by specific GPU. Number of threads/block is limited by specific GPU. Defining Grid/Block Structure

30 CUDA provided with built-in variables and structures to define number of blocks of threads in grid in each dimension and number of threads in a block in each dimension. CUDA Vector Types/Structures unit3 and dim3 – can be considered essentially as CUDA-defined structures of unsigned integers: x, y, z, i.e. struct unit3 { x; y; z; }; struct dim3 { x; y; z; }; Used to define grid of blocks and threads, see next. Unassigned structure components automatically set to 1. There are other CUDA vector types. Built-in CUDA data types and structures

31 Built-in Variables for Grid/Block Sizes dim3 gridDim -- Size of grid: gridDim.x * gridDim.y (z not used) dim3 blockDim -- Size of block: blockDim.x * blockDim.y * blockDim.z Example dim3 grid(16, 16); // Grid -- 16 x 16 blocks dim3 block(32, 32); // Block -- 32 x 32 threads myKernel >>(...);

32 Built-in Variables for Grid/Block Indices uinit3 blockIdx -- block index within grid: blockIdx.x, blockIdx.y (z not used) uint3 threadIdx -- thread index within block: blockIdx.x, blockIdx.y, blockId.z Full global thread ID in x and y dimensions can be computed by: x = blockIdx.x * blockDim.x + threadIdx.x; y = blockIdx.y * blockDim.y + threadIdx.y;

33 Example -- x direction A 1-D grid and 1-D block 4 blocks, each having 8 threads 01234765012347650123476501234765 threadIdx.x blockIdx.x = 3 threadIdx.x blockIdx.x = 1blockIdx.x = 0 Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010. blockIdx.x = 2 gridDim = 4 x 1 blockDim = 8 x 1 Global thread ID = blockIdx.x * blockDim.x + threadIdx.x = 3 * 8 + 2 = thread 26 with linear global addressing Global ID 26

34 #define N 2048 // size of vectors #define T 256 // number of threads per block __global__ void vecAdd(int *A, int *B, int *C) { int i = blockIdx.x*blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; } int main (int argc, char **argv ) { … vecAdd >>(devA, devB, devC); // assumes N/T is an integer … return (0); } Code example with a 1-D grid and 1-D blocks Number of blocks to map each vector across grid, one element of each vector per thread

35 #define N 2048 // size of vectors #define T 240 // number of threads per block __global__ void vecAdd(int *A, int *B, int *C) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; // allows for more threads than vector elements // some unused } int main (int argc, char **argv ) { int blocks = (N + T - 1) / T; // efficient way of rounding to next integer … vecAdd >>(devA, devB, devC); … return (0); } If T/N not necessarily an integer:

36 Example using 1-D grid and 2-D blocks Adding two arrays #define N 2048 // size of arrays __global__void addMatrix (int *a, int *b, int *c) { int i = blockIdx.x*blockDim.x+threadIdx.x; int j =blockIdx.y*blockDim.y+threadIdx.y; int index = i + j * N; if ( i < N && j < N) c[index]= a[index] + b[index]; } Void main() {... dim3 dimBlock (16,16); dim3 dimGrid (N/dimBlock.x, N/dimBlock.y); addMatrix >>(devA, devB, devC); … }

37 Memory Structure within GPU Local private memory -- per thread Shared memory -- per block Global memory -- per application GPU executes one or more kernel grids. Streaming multiprocessor (SM) executes one or more thread blocks CUDA cores and other execution units in the SM execute threads. SM executes threads in groups of 32 threads called a warp.* * Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008

38 Compiling code Linux Command line. CUDA provides nvcc (a NVIDIA “compiler-driver”. Use instead of gcc nvcc –O3 –o -I/usr/local/cuda/include –L/usr/local/cuda/lib –lcudart Separates compiled code for CPU and for GPU and compiles code. Need regular C compiler installed for CPU. Make files also provided. Windows NVIDIA suggests using Microsoft Visual Studio

39 Debugging NVIDIA has recently develped a debugging tool called Parallel Nsight Available for use with Visual Studio

40 GPU Clusters  GPU systems for HPC  GPU clusters  GPU Grids  GPU Clouds With advent of GPUs for scientific high performance computing, compute cluster now can incorporate, greatly increasing their compute capability.

42 Maryland CPU-GPU Cluster Infrastructure http://www.umiacs.umd.edu/res earch/GPU/facilities.html

43 Hybrid Programming Model for Clusters having Multicore Shared Memory Processors Combine MPI between nodes and OpenMP with nodes (or other thread libraries such as Pthreads): MPI/OpenMP compilation: mpicc -o mpi_out mpi_test.c -fopenmp

44 Hybrid Programming Model for Clusters having Multicore Shared Memory Processors and GPUs Combine OpenMP and CUDA on one node for CPU and GPU respectively, with MPI between nodes Note – All three as C-based so can be compiled together. NVIDIA does provide a sample OpenMP/CUDA

45 http://www.gpugrid.net/

46 Intel’s response to Nvidia and GPUs

Questions

Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.

Similar presentations

Presentation on theme: "Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.

Similar presentations

Presentation on theme: "Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010."— Presentation transcript:

Similar presentations

About project

Feedback