The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA Grids, Blocks, and Threads
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
GPU Programming with CUDA – Optimisation Mike Griffiths
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Some things are naturally parallel
Programming Massively Parallel Graphics Processors
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
CUDA Grids, Blocks, and Threads
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Chapter 4:Parallel Programming in CUDA C
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

The Missouri S&T CS GPU Cluster Cyriac Kandoth

Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their GPUs as general purpose devices (GPGPUs) They have donated 8 Tesla GPUs to the Missouri S&T Computer Science Department This presentation will introduce you to the concept of creating applications that take advantage of the massive parallelism inherent in GPUs

Tech Specs 4 desktop-style cases, each housing: Intel Core i7 (Quad core with HyperThreading i.e. 8 logical processors) 8GB DDR3 (The Core i7 under-clocks the RAM to a speed that it supports) 500 GB 3.0Gbps 7200rpm SATA Hard disk drive Two Tesla C1060 cards on PCI-Express 2.0 x16 (A Tesla C1060 is a compute-only GPU that has no video output ports) ATI Radeon HD2400 on standard PCI (for display)

The Tesla C1060 Form Factor10.5" x 4.376", Dual Slot # of Streaming Processor Cores240 Frequency of processor cores1.3GHz Single Precision floating point performance (peak) 933 GFLOP/s Double Precision floating point performance (peak) 78 GFLOP/s Floating Point PrecisionIEEE 754 single & double Total Dedicated Memory4GB GDDR3 Memory Speed800MHz Memory Interface512-bit Memory Bandwidth102GB/sec Max Power Consumption200 W peak, 160 W typical System InterfacePCIe x16 Source:

eth0 Cluster networking gpu0gpu1gpu2gpu3 The Missouri S&T Network Gigabit Switch The 4 nodes are named gpu0 thru gpu3 gpu0 is the frontend that acts as the gateway into the cluster from the MST-USERS domain eth1

The CUDA programming model CUDA (Compute Unified Device Architecture) is a C programming model and API (Application Programming Interface) introduced by NVIDIA to enable software developers to code general purpose apps that run on the massively parallel hardware on GPUs. GPUs are optimal for data parallel apps aka SIMD (Single Instruction Multiple Data). CUDA allows us to also code MIMD apps, but at a reduced efficiency. Threads running in parallel use extremely fast shared memory for communication. There is no MPI_Send(), but the equivalent of MPI_Barrier() is __syncthreads().

The CUDA programming model In your code, you can create a kernel (a function) that will run many instances of itself on parallel threads on the GPU. Threads running in parallel are collectively known as a grid. Kernels are run on the device (GPU) while the rest of the code runs on the host (CPU).

The CUDA programming model A grid is organized into blocks, and each block is organized into threads. Only threads within the same block can communicate via shared memory; and sync. This type of organization helps the GPU parallelize thread execution using it’s inbuilt hardware protocols.

Built-in Variables accessible in a Kernel dim3 gridDim Contains the dimensions of the grid as specified during kernel invocation. gridDim.x, gridDim.y (.z is unused) uint3 blockIdx Contains the block index within the grid. blockIdx.x, blockIdx.y (.z is unused) dim3 blockDim Contains the dimensions of the block ( blockDim.x, blockDim.y, and blockDim.z ) uint3 threadIdx Contains the thread index within the block ( threadIdx.x, threadIdx.y, and threadIdx.z )

E.g. Host invokes kernel on a device // Kernel definition, runs a copy on every thread __global__ void vectorAdd( float* A, float* B, float* C ) {... } int main(int argc, char** argv) { dim3 blockSize(16, 16); // 256 threads per block (up to 3D) dim3 gridSize(4, 2); // 8 blocks in the grid (up to 2D) // Invoke the kernel on the device (GPU) vectorAdd >>(A, B, C);... // Continue running on host (CPU) when device is done }

CUDA Type Qualifiers Function type qualifiers __device__ Executed on the device Callable from the device only __global__ Executed on the device Callable from the host only __host__ Executed on the host Callable from the host only Default type if unspecified Variable type qualifiers __device__ global memory space Is accessible from all the threads within the grid __constant__ constant memory space Is accessible from all the threads within the grid __shared__ space of a thread block Is only accessible from all the threads within the block

Template of a typical main() int main(int argc, char** argv) { // Allocate memory on the host for input data - malloc() // Initialize input data from file, user input, etc. // Allocate memory on the device - cudaMalloc() // Send input data to the device - cudaMemcpy() // Set up grid and block dimensions - dim3 variables // Invoke the kernel on the device (GPU) - kernelName >>(input_params); // Copy results from device to host - cudaMemcpy() // Free up device memory - cudaFree() // Print results at the host, because device can’t. // printf() from kernel only works in emulation mode }

CUDA apps in emulation mode Compile the program with the emu parameter enabled: make emu=1 The program emulates a GPU on the host CPU. Usually much slower. Helps with debugging, because you are allowed to use printf() statements in device code (from CUDA kernels)

Types of shared memory Registers: Fastest form of memory on the GPU. Is only accessible by individual threads and has the lifetime of a thread. We don’t need to deal with it directly (but we can). Shared Memory: Can be as fast as a register when there are no bank conflicts (when threads read from the same address). Accessible by any thread of the block from which it was created. Has the lifetime of the block. Global memory: Potentially 150x slower than register or shared memory because of un-coalesced reads and writes. Accessible from either the host or device. Has the lifetime of the application. Read-only global memory is called constant memory. Local memory: Resides in global memory and can be 150x slower than register/shared memory. Is only accessible by the thread. Has the lifetime of the thread.

A few CUDA API functions cudaSetDevice(int dev) - Sets the device to run the kernel. __syncthreads() - Blocks execution of all threads within a block until they synchronize. cudaMalloc(void** devPtr, size_t count) - Allocates count bytes in GPU memory and returns a pointer to it in the parameter *devPtr. cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind) - copies count bytes from src to dst where kind is A complete listing of the CUDA API functions can be found in the Reference Manual.Reference Manual

Tips for speedy code Have the kernel use the whole card - Have a multiple of 32 threads per block and at least as many blocks as multiprocessors (240 on the Tesla C1060s). Access global memory properly. Coalescing - Memory read by consecutive threads are combined by the hardware into several, wide memory reads. Avoid shared memory bank conflicts. Have as few branching conditional loops as possible. Have small loops unrolled. Have no unnecessary __syncthreads() calls. See the CUDA Programming Guide for further discussion on all of the above.CUDA Programming Guide

Demo: helloWorld using CUDA Within the NVIDIA_CUDA_SDK projects folder, you will find the helloWorld project. Compile it in emulation mode using make emu=1 Execute the binary which gets stored at ~/NVIDIA_CUDA_SDK/bin/linux/emurelease/helloWorld Now go back to the code and take a closer look at it. Files: helloWorld.cu, helloWorld_kernel.cu Next week, we will see how to perform block matrix multiplication using CUDA (see the matrixMul project in the SDK)

Questions?