Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Slides:

Advertisements

Similar presentations

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.

FSOSS Dr. Chris Szalwinski Professor School of Information and Communication Technology Seneca College, Toronto, Canada GPU Research Capabilities.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

GPU PROGRAMMING David Gilbert California State University, Los Angeles.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.

GPU Architecture and Programming

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Computer Engg, IIT(BHU)

CUDA Programming Model

CS427 Multicore Architecture and Parallel Computing

Chapter 1 Introduction.

6- General Purpose GPU Programming

Presentation transcript:

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM

Training Program on GPU Programming with CUDA Sanath Jayasena CUDA Teaching UoM Day 1, Session 1 Introduction

Outline Training Program Description CUDA Teaching Center at UoM Subject Matter Introduction to GPU Computing GPU Computing with CUDA CUDA Programming Basics July-Aug 20113CUDA Training Program

Overview of Training Program 3 Sundays, starting 31 st July Schedule and program outline Main resource persons – Sanath Jayasena, Jayathu Samarawickrama, Kishan Wimalawarna, Lochandaka Ranathunga Dept of Computer Science & Eng, Dept of Electronic & Telecom. Engineering (of Faculty of Engineering) and Faculty of IT July-Aug 2011CUDA Training Program4

CUDA Teaching Center UoM was selected as a CTC – A group of people from multiple Depts – Benefits – Donation of hardware by NVIDIA (GeForce GTX480s and Tesla C2070) – Access to other resources Expectations – Use of the resources for teaching/research, industry collaboration July-Aug 2011CUDA Training Program5

GPU Computing: Introduction Graphics Processing Units (GPUs) – high-performance many-core processors that can be used to accelerate a wide range of applications GPGPU - General-Purpose computation on Graphics Processing Units GPUs lead the race for floating-point performance since start of 21 st century GPUs are being used as parallel processors July-Aug 2011CUDA Training Program6

GPU Computing: Introduction General computing, until end of 20 th century – Relied on the advances in hardware to increase the speed of software/apps Slowed down since then due to – Power consumption issues – Limited productivity within a single processor Switch to multi-core and many-core models – Multiple processing units (processor cores) used in each chip to increase the processing power – Impact on software developers? July-Aug 2011CUDA Training Program7

GPU Computing: Introduction A sequential program will only run on one of the cores, which will not become any faster With each new generation of processors – Software that will continue to enjoy performance improvement will be parallel programs – Where, multiple threads of execution cooperate to achieve the functionality faster July-Aug 2011CUDA Training Program8

CPU-GPU Performance Gap July-Aug 2011CUDA Training Program 9 Source: CUDA Prog. Guide 4.0

CPU-GPU Performance Gap July-Aug 2011CUDA Training Program 10 Source: CUDA Prog. Guide 4.0

GPGPU & CUDA GPU designed as a numeric computing engine – Will not perform well on some tasks as CPUs – Most applications will use both CPUs and GPUs CUDA – NVIDIA’s parallel computing architecture aimed at increasing computing performance by harnessing the power of the GPU – A programming model July-Aug 2011CUDA Training Program11

More Details on GPUs GPU is typically a computer card, installed into a PCI Express 16x slot Market leaders: NVIDIA, Intel, AMD (ATI) – Example NVIDIA GPUs (donated to UoM) GeForce GTX 480Tesla 2070 July-Aug CUDA Training Program

Example Specifications GTX 480Tesla 2070 Peak double precision floating point performance 650 Gigaflops515 Gigaflops Peak single precision floating point performance 1300 Gigaflops1030 Gigaflops CUDA cores Frequency of CUDA Cores 1.40 GHz 1.15 GHz Memory size (GDDR5)1536 MB 6 GigaBytes Memory bandwidth177.4 GBytes/sec150 GBytes/sec ECC MemoryNOYES July-Aug CUDA Training Program

CPU vs. GPU Architecture The GPU devotes more transistors for computation July-Aug CUDA Training Program

CPU-GPU Communication July-Aug CUDA Training Program

CUDA Architecture CUDA is NVIDA’s solution to access the GPU Can be seen as an extension to C/C++ CUDA Software Stack July-Aug CUDA Training Program

CUDA Architecture There are two main parts 1.Host (CPU part) -Single Program, Single Data 2.Device (GPU part) -Single Program, Multiple Data July-Aug CUDA Training Program

CUDA Architecture GRID Architecture July-Aug CUDA Training Program The Grid 1.A group of threads all running the same kernel 2.Can run multiple grids at once The Block 1.Grids composed of blocks 2.Each block is a logical unit containing a number of coordinating threads and some amount of shared memory

Some Applications of GPGPU Computational Structural Mechanics Bio-Informatics and Life Sciences Computational Electromagnetics and Electrodynamics Computational Finance July-Aug CUDA Training Program

Some Applications… Computational Fluid Dynamics Data Mining, Analytics, and Databases Imaging and Computer Vision Medical Imaging July-Aug CUDA Training Program

Some Applications… Molecular Dynamics Numerical Analytics Weather, Atmospheric, Ocean Modeling and Space Sciences July-Aug CUDA Training Program

CUDA Programming Basics

Accessing/Using the CUDA-GPUs You have been given access to our cluster – User accounts on x – It is a Linux system CUDA Toolkit and SDK for development – Includes CUDA C/C++ compiler for GPUs (“nvcc”) – Will need C/C++ compiler for CPU code NVIDIA device drivers needed to run programs – For programs to communicate with hardware July-Aug 2011CUDA Training Program23

Example Program 1 “__global__” says the function is to be compiled to run on a “device” (GPU), not “host” (CPU) Angle brackets “ >>” for passing params/args to runtime July-Aug 2011CUDA Training Program24 #include __global__ void kernel (void) { } int main (void) { kernel >> (); printf("Hello World!\n"); return 0; } A function executed on the GPU (device) is usually called a “kernel”

Example Program 2 – Part 1 July-Aug 2011CUDA Training Program 25 As can be seen in next slide: We can pass parameters to a kernel as we would with any C function We need to allocate memory to do anything useful on a device, such as return values to the host

Example Program 2 – Part 2 int main (void) { int c, *dev_c; cudaMalloc ((void **) &dev_c, sizeof (int)); add >> (2,7, dev_c); cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost); printf(“2 + 7 = %d\n“, c); cudaFree(dev_c); return 0; } July-Aug 2011CUDA Training Program26

Example Program 3 Within host (CPU) code, call the kernel by using >> specifying the grid size (number of blocks) and/or the block size (number of threads) - (more details later) July-Aug CUDA Training Program

Example Program 3 …contd July-Aug CUDA Training Program Note: Details on threads and thread IDs will come later

Example Program 4 July-Aug CUDA Training Program

Grids, Blocks and Threads July-Aug CUDA Training Program A grid of size 6 (3x2 blocks) Each block has 12 threads (4x3)

Conclusion In this session we discussed – Introduction to GPU Computing – GPU Computing with CUDA – CUDA Programming Basics Next session – Data Parallelism – CUDA Programming Model – CUDA Threads July-Aug 2011CUDA Training Program31

References for this Session Chapters 1 and 2 of: D. Kirk and W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann, 2010 Chapters 1-4 of: E. Kandrot and J. Sanders, CUDA by Example, Addison-Wesley, 2010 Chapters 1-2 of: NVIDIA CUDA C Programming Guide, NVIDIA Corporation, (Versions 3.2 and 4.0) July-Aug 2011CUDA Training Program32