INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

INF5063 – GPU & CUDA Håkon Kvale Stensland Department for Informatics / Simula Research Laboratory.

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

Håkon Kvale Stensland Simula Research Laboratory

INF5062 – GPU & CUDA Håkon Kvale Stensland Simula Research Laboratory.

INF5062 – Competition Håkon Kvale Stensland & Håvard Espeland Simula Research Laboratory.

INF5063 – GPU & CUDA Håkon Kvale Stensland Simula Research Laboratory.

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

GPU History CUDA Intro. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.

CIS 565 Fall 2011 Qing Sun

GPU Architecture and Programming

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Lecture 8-2 : CUDA Programming Slide Courtesy : Dr. David Kirk and Dr. Wen-Mei Hwu and Mulphy Stein.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

CUDA Programming Model

CS427 Multicore Architecture and Parallel Computing

Basic CUDA Programming

Slides from “PMPP” book

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

6- General Purpose GPU Programming

Parallel Computing 18: CUDA - I

Presentation transcript:

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Recap: The CUDA Programming Model  The GPU is viewed as a compute device that: −Is a coprocessor to the CPU, referred to as the host −Has its own DRAM called device memory −Runs many threads in parallel  Data-parallel parts of an application are executed on the device as kernels, which run in parallel on many threads  Differences between GPU and CPU threads −GPU threads are extremely lightweight Very little creation overhead −GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Recap: Terminology  device = GPU = Set of multiprocessors  Multiprocessor = Set of processors & shared memory  Kernel = Program running on the GPU  Grid = Array of thread blocks that execute a kernel  Thread block = Group of SIMD threads that execute a kernel and can communicate via shared memory MemoryLocationCachedAccessWho LocalOff-chipNoRead/writeOne thread SharedOn-chipN/A - residentRead/writeAll threads in a block GlobalOff-chipNoRead/writeAll threads + host ConstantOff-chipYesReadAll threads + host TextureOff-chipYesReadAll threads + host

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Recap: Access Times  Register – Dedicated HW – Single cycle  Shared Memory – Dedicated HW – Single cycle  Local Memory – DRAM, no cache – “Slow”  Global Memory – DRAM, no cache – “Slow”  Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality  Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality

CUDA – API

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo CUDA Highlights  The API is an extension to the ANSI C programming language Low learning curve than OpenGL/Direct3D  The hardware is designed to enable lightweight runtime and driver High performance

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo CUDA Device Memory Allocation  cudaMalloc() −Allocates object in the device Global Memory −Requires two parameters Address of a pointer to the allocated object Size of allocated object  cudaFree() −Frees object from device Global Memory Pointer to the object (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memor y Thread (0, 0) Register s Local Memor y Thread (1, 0) Register s Block (1, 0) Shared Memory Local Memor y Thread (0, 0) Register s Local Memor y Thread (1, 0) Register s Host

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo CUDA Device Memory Allocation  Code example: −Allocate a 64 * 64 single precision float array −Attach the allocated storage to Md.elements −“d” is often used to indicate a device data structure BLOCK_SIZE = 64; Matrix Md int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float); cudaMalloc((void**)&Md.elements, size); cudaFree(Md.elements);

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo CUDA Host-Device Data Transfer  cudaMemcpy() −memory data transfer −Requires four parameters Pointer to source Pointer to destination Number of bytes copied Type of transfer  Host to Host  Host to Device  Device to Host  Device to Device  Asynchronous operations available (Streams) (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memor y Thread (0, 0) Register s Local Memor y Thread (1, 0) Register s Block (1, 0) Shared Memory Local Memor y Thread (0, 0) Register s Local Memor y Thread (1, 0) Register s Host

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo CUDA Host-Device Data Transfer  Code example: −Transfer a 64 * 64 single precision float array −M is in host memory and Md is in device memory −cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice); cudaMemcpy(M.elements, Md.elements, size, cudaMemcpyDeviceToHost);

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo CUDA Function Declarations Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() devicehost __host__ float HostFunc() host  __global__ defines a kernel function −Must return void  __device__ and __host__ can be used together

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo CUDA Function Declarations  __device__ functions cannot have their address taken  Limitations for functions executed on the device: − No recursion − No static variable declarations inside the function − No variable number of arguments

Example: Hello World

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo // Hello World CUDA - INF5063 // #include the entire body of the cuPrintf code (availible in the SDK) #include "util/cuPrintf.cu" #include __global__ void device_hello(void) { cuPrintf("Hello, world from the GPU!\n"); } int main(void) { // greet from the CPU printf("Hello, world from the CPU!\n"); // init cuPrintf cudaPrintfInit(); // launch a kernel with a single thread to say hi from the device device_hello >>(); // display the device's greeting cudaPrintfDisplay(); // clean up after cuPrintf cudaPrintfEnd(); return 0; } Example: Hello World

Example: Vector Addition