Computer Engg, IIT(BHU)

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Intermediate GPGPU Programming in CUDA

Speed, Accurate and Efficient way to identify the DNA.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Optimization on Kepler Zehuan Wang

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

An Introduction to Programming with CUDA Paul Richmond

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

GPU computing and CUDA Marko Mišić

CIS 565 Fall 2011 Qing Sun

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

GPU Architecture and Programming

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

Martin Kruliš by Martin Kruliš (v1.0)1.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

GPUs (Graphics Processing Units). Information from Textbook Online Appendix C includes information on GPUs Access online resources from: –

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Basic CUDA Programming

Lecture 2: Intro to the simd lifestyle and GPU internals

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

NVIDIA Fermi Architecture

© David Kirk/NVIDIA and Wen-mei W. Hwu,

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Graphics Processing Unit

Chapter 4:Parallel Programming in CUDA C

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)

6- General Purpose GPU Programming

Presentation transcript:

Computer Engg, IIT(BHU) CUDA-1 3/12/2013 Computer Engg, IIT(BHU)

CUDA is a set of developing tools to create applications that will perform execution on GPU (Graphics Processing Unit). CUDA compiler uses variation of C with future support of C++ CUDA was developed by NVidia and as such can only run on NVidia GPUs of G8x series and up.

GPU It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded multiprocessor optimized for visual computing. It provide real-time visual interaction with computed objects via graphics images, and video.

GPU It serves as both a programmable graphics processor and a scalable parallel computing platform. Heterogeneous Systems: combine a GPU with a CPU

GPU-Evolution 1980’s – No GPU. PC used VGA controller 1990’s – Add more function into VGA controller 1997 – 3D acceleration functions: Hardware for triangle setup and rasterization Texture mapping Shading 2000 – A single chip graphics processor ( beginning of GPU term) 2005 – Massively parallel programmable processors 2007 – CUDA (Compute Unified Device Architecture)

GPU Graphic Trend OpenGL – an open standard for 3D programming DirectX – a series of Microsoft multimedia programming interfaces New GPU are being developed every 12 to 18 months New idea of visual computing: combines graphics processing and parallel computing Heterogeneous System – CPU + GPU

GPU Graphic Trend GPU evolves into scalable parallel processor GPU Computing: GPGPU and CUDA GPU unifies graphics and computing GPU visual computing application: OpenGL, and DirectX

Why CUDA CUDA provides ability to use high-level languages such as C to develop application that can take advantage of high level of performance and scalability that GPUs architecture offer. GPUs allow creation of very large number of concurrently executed threads at very low system resource cost

Why CUDA CUDA also exposes fast shared memory (16KB) that can be shared between threads. Full support for integer and bitwise operations. Compiled code will run directly on GPU

CUDA programming Model The GPU is seen as a compute device to execute a portion of an application that Has to be executed many times Can be isolated as a function Works independently on different data Such a function can be compiled to run on the device. The resulting program is called a Kernel

CUDA Programming Model The batch of threads that executes a kernel is organized as a grid of thread blocks

CUDA Programming Model Thread Block Batch of threads that can cooperate together Fast shared memory Synchronizable Thread ID Block can be one-, two- or three-dimensional arrays

CUDA Programming Model Grid of Thread Block Limited number of threads in a block Allows larger numbers of thread to execute the same kernel with one invocation Blocks identifiable via block ID Leads to a reduction in thread cooperation Blocks can be one- or two-dimensional arrays

CUDA Programming Model

CUDA Memory Model

CUDA Memory Model Shared Memory Is on-chip: much faster than the local and global memory, as fast as a register when no bank conflicts, divided into equally-sized memory banks. Successive 32-bit words are assigned to successive banks, Each bank has a bandwidth of 32 bits per clock cycle.

CUDA Memory Model Shared Memory memory request requires two cycles for a warp One for the first half, one for the second half of the warp No conflicts between threads from first and second half

CUDA API An Extension to the C Programming Language Function type qualifiers to specify execution on host or device Variable type qualifiers to specify the memory location on the device A new directive to specify how to execute a kernel on the device Four built-in variables that specify the grid and block dimensions and the block and thread indices

CUDA API Function type qualifiers __device__ __global__ __host__ Executed on the device Callable from the device only. __global__ Executed on the device, Callable from the host only. __host__ Executed on the host,

CUDA API Variable Type Qualifiers __device__ Resides in global memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library. __constant__ (optionally used together with __device__) Resides in constant memory space, __shared__ (optionally used together with __device__) Resides in the shared memory space of a thread block, Has the lifetime of the block, Is only accessible from all the threads within the block.

CUDA API Execution Configuration (EC) Must be specified for any call to a __global__ function. Defines the dimension of the grid and blocks specified by inserting an expression between function name and argument list: function: __global__ void Func(float* parameter); must be called like this: Func<<< Dg, Db, Ns >>>(parameter);

CUDA API Execution Configuration (EC) Where Dg, Db, Ns are : Dg is of type dim3  dimension and size of the grid Dg.x * Dg.y = number of blocks being launched; Db is of type dim3  dimension and size of each block Db.x * Db.y * Db.z = number of threads per block; Ns is of type size_t  number of bytes in shared memory that is dynamically allocated in addition to the statically allocated memory Ns is an optional argument which defaults to 0.

CUDA API Built-in Variables gridDim is of type dim3 dimensions of the grid. blockIdx is of type uint3  block index within the grid. blockDim is of type dim3  dimensions of the block. threadIdx is of type uint3  thread index within the block.