CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1) 27.10.2016.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Killdevil Running CUDA programs on cluster. Requesting permission bin/unc_id/services bin/unc_id/services.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Martin Kruliš by Martin Kruliš (v1.1)1.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
CUDA Programming Model
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Basic CUDA Programming
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Programming Massively Parallel Graphics Processors
GPU Programming using OpenCL
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
General Purpose Graphics Processing Units (GPGPUs)
Chapter 4:Parallel Programming in CUDA C
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1) 27.10.2016

GPU Programming Revision GPU Device A separate device connected via PCIe Quite independent on the host system But all high-level commands are issued from host Tailored for data-parallelism Originally to process many vertices/fragments with the same piece of code (shader) Requires specific code optimizations/fine-tuning Wrong approach to work decomposition or suboptimal memory access patterns can degrade the performance significantly by Martin Kruliš (v1.1) 27.10.2016

CUDA Compute Unified Device Architecture Alternatives NVIDIA parallel computing platform Implemented solely for GPUs First API released in 2007 Used in various libraries cuFFT, cuBLAS, … Many additional features OpenGL/DirectX Interoperability, computing clusters support, integrated compiler, … Alternatives OpenCL, AMD Stream SDK, C++ AMP by Martin Kruliš (v1.1) 27.10.2016

CUDA vs. OpenCL CUDA OpenCL GPU-specific platform By NVIDIA Faster changes Limited to NVIDIA hardware only Host and device code together Easier to tune for peak performance OpenCL Generic platform By Khronos Slower changes Supported by various vendors and devices Device code is compiled at runtime Easier for more portable application by Martin Kruliš (v1.1) 27.10.2016

Terminology Some differences between CUDA and OpenCL CUDA OpenCL GPU Multiprocessor CUDA Core Thread Thread Block Shared Memory Local Memory Registers OpenCL Device Compute Unit Compute Element Work Item Work Group Local Memory Private Memory by Martin Kruliš (v1.1) 27.10.2016

Device index is from range 0, deviceCount-1 Device Detection Device Detection int deviceCount; cudaGetDeviceCount(&deviceCount); ... cudaSetDevice(deviceIdx); Querying Device Information cudaDeviceProp deviceProp; cudaGetDeviceProperties(&deviceProp, deviceIdx); Device index is from range 0, deviceCount-1 Example by Martin Kruliš (v1.1) 27.10.2016

Device Features Compute Capability Prescribed set of technologies and constants that a GPU device must implement Incrementally defined Architecture dependent CC for known architectures: 1.0, 1.3 – Tesla 2.0, 2.1 – Fermi (Tesla M2090 – CC 2.0) 3.x – Kepler (Tesla K20m – CC 3.5) 5.x – Maxwell (GTX 980 – CC 5.2) 6.x – Pascal (most recent), 7.x – Volta (planned) by Martin Kruliš (v1.1) 27.10.2016

Memory Allocation Device Memory Allocation Host-Device Data Transfers C-like allocation system The programmer must distinguish host/GPU pointers! float *vec; cudaMalloc((void**)&vec, count*sizeof(float)); cudaFree(vec); Host-Device Data Transfers Explicit blocking functions cudaMemcpy(vec, localVec, count*sizeof(float), cudaMemcpyHostToDevice); Note that C++ bindings (with type control) exist. by Martin Kruliš (v1.1) 27.10.2016

Kernel Execution Kernel Special function declarations Kernel Execution __device__ void foo(…) { … } __global__ void bar(…) { … } Kernel Execution bar<<<Dg, Db [, Ns [, S]]>>>(args); Dg – dimensions and sizes of blocks spawned Db – dimensions and sizes of threads per block Ns – dynamically allocated shared memory per block S – stream index by Martin Kruliš (v1.1) 27.10.2016

Working Threads Grid Each Block Kernel Constants Consists of blocks Up to 3 dimensions Each Block Consist of threads Same dimensionality Kernel Constants gridDim, blockDim blockIdx, threadIdx .x, .y, .z Note that logically, the threads (and blocks) are organized that the x- coordinate is closest. In other words, linearized index of a thread in a block is idx = x + y*blockDim.x + z*blockDim.x*blockDim.y. by Martin Kruliš (v1.1) 27.10.2016

Code Example __global__ void vec_mul(float *X, float *Y, float *res) { int idx = blockIdx.x * blockDim.x + threadIdx.x; res[idx] = X[idx] * Y[idx]; } ... float *X, *Y, *res, *cuX, *cuY, *cuRes; cudaSetDevice(0); cudaMalloc((void**)&cuX, N * sizeof(float)); cudaMalloc((void**)&cuY, N * sizeof(float)); cudaMalloc((void**)&cuRes, N * sizeof(float)); cudaMemcpy(cuX, X, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(cuY, Y, N * sizeof(float), cudaMemcpyHostToDevice); vec_mul<<<(N/64), 64>>>(cuX, cuY, cuRes); cudaMemcpy(res, cuRes, N * sizeof(float), cudaMemcpyDeviceToHost); by Martin Kruliš (v1.1) 27.10.2016

Compilation The nvcc Compiler Used for compiling both host and device code Defers compilation of the host code to gcc (linux) or Microsoft VCC (Windows) $> nvcc –cudart static code.cu –o myapp Can be used for compilation only $> nvcc –compile … kernel.cu –o kernel.obj $> cc –lcudart kernel.obj main.obj –o myapp Device code is generated for target architecture $> nvcc –arch sm_13 … $> nvcc –arch compute_35 … Compile for real GPU with compute capability 1.3 and to PTX with capability 3.5 by Martin Kruliš (v1.1) 27.10.2016

NVIDIA Tools System Management Interface NVIDIA Visual Profiler nvidia-smi (CLI application) NVML library (C-based API) Query GPU details $> nvidia-smi -q Set various properties (ECC, compute mode …), … $> nvidia-smi –pm 1 Set drivers to persistent mode (recommended) NVIDIA Visual Profiler $> nvvp & Use X11 SSH forwarding from ubergrafik/knight For some reason, the “nvidia-smi –pm 1” did not work under new drivers under RHEL 7. Another command (“nvidia-persistenced --persistence-mode”) has to be used. Example by Martin Kruliš (v1.1) 27.10.2016

Discussion by Martin Kruliš (v1.1) 27.10.2016