Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Intermediate GPGPU Programming in CUDA
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS Fall 2012.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
Introduction to CUDA (1 of n*)
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
1 ECE 498AL Lecture 2: The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
GPU Architecture and Programming
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
1 The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
My Coordinates Office EM G.27 contact time:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
CS427 Multicore Architecture and Parallel Computing
Slides from “PMPP” book
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
6- General Purpose GPU Programming
Presentation transcript:

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012

Announcements IBM Almaden Readings listed on our website Clone your git repository Image from

Acknowledgements Many slides are from David Kirk and Wen- mei Hwu’s UIUC course:

Agenda Parallelism Review GPU Architecture Review CUDA

Parallelism Review Pipeline Parallel  Pipelined processors  Graphics pipeline

Parallelism Review Task Parallel  Spell checker  Game engines  Virtual globes Image from:

Parallelism Review Data Parallel  Cloth simulation  Particle system  Matrix multiply Image from:

Matrix Multiply Reminder Vectors Dot products Row major or column major? Dot product per output element

GPU Architecture Review GPUs are:  Parallel  Multithreaded  Many-core GPUs have:  Tremendous computational horsepower  High memory bandwidth

GPU Architecture Review GPUs are specialized for  Compute-intensive, highly parallel computation  Graphics! Transistors are devoted to:  Processing  Not: Data caching Flow control

GPU Architecture Review Image from: Transistor Usage

Let’s program this thing!

GPU Computing History 2001/2002 – researchers see GPU as data- parallel coprocessor  The GPGPU field is born 2007 – NVIDIA releases CUDA  CUDA – Compute Uniform Device Architecture  GPGPU shifts to GPU Computing 2008 – Khronos releases OpenCL specification

CUDA Abstractions A hierarchy of thread groups Shared memories Barrier synchronization

CUDA Terminology Host – typically the CPU  Code written in ANSI C Device – typically the GPU (data-parallel)  Code written in extended ANSI C Host and device have separate memories CUDA Program  Contains both host and device code

CUDA Terminology Kernel – data-parallel function  Invoking a kernel creates lightweight threads on the device Threads are generated and scheduled with hardware Similar to a shader in OpenGL?

CUDA Kernels Executed N times in parallel by N different CUDA threads Thread ID Execution Configuration Declaration Specifier

CUDA Program Execution Image from:

Thread Hierarchies Grid – one or more thread blocks  1D or 2D Block – array of threads  1D, 2D, or 3D  Each block in a grid has the same number of threads  Each thread in a block can Synchronize Access shared memory

Thread Hierarchies Image from:

Thread Hierarchies Block – 1D, 2D, or 3D  Example: Index into vector, matrix, volume

Thread Hierarchies Thread ID: Scalar thread identifier Thread Index: threadIdx 1D: Thread ID == Thread Index 2D with size (D x, D y )  Thread ID of index (x, y) == x + y D y 3D with size (D x, D y, D z )  Thread ID of index (x, y, z) == x + y D y + z D x D y

Thread Hierarchies 1 Thread Block2D Block 2D Index

Thread Hierarchies Thread Block  Group of threads G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads  Reside on same processor core  Share memory of that core

Thread Hierarchies Thread Block  Group of threads G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads  Reside on same processor core  Share memory of that core Image from:

Thread Hierarchies Block Index: blockIdx Dimension: blockDim  1D or 2D

Thread Hierarchies 2D Thread Block 16x16 Threads per block

Thread Hierarchies Example: N = 32  16x16 threads per block (independent of N) threadIdx ([0, 15], [0, 15])  2x2 thread blocks in grid blockIdx ([0, 1], [0, 1]) blockDim = 16 i = [0, 1] * 16 + [0, 15]

Thread Hierarchies Thread blocks execute independently  In any order: parallel or series  Scheduled in any order by any number of cores Allows code to scale with core count

Thread Hierarchies Image from:

Thread Hierarchies Threads in a block  Share (limited) low-latency memory  Synchronize execution To coordinate memory accesses __syncThreads()  Barrier – threads in block wait until all threads reach this  Lightweight Image from:

CUDA Memory Transfers Image from:

CUDA Memory Transfers Host can transfer to/from device  Global memory  Constant memory Image from:

CUDA Memory Transfers cudaMalloc()  Allocate global memory on device cudaFree()  Frees memory Image from:

CUDA Memory Transfers Code from:

CUDA Memory Transfers Code from: Pointer to device memory

CUDA Memory Transfers Code from: Size in bytes

CUDA Memory Transfers cudaMemcpy()  Memory transfer Host to host Host to device Device to host Device to device HostDevice Global Memory Similar to buffer objects in OpenGL

CUDA Memory Transfers cudaMemcpy()  Memory transfer Host to host Host to device Device to host Device to device HostDevice Global Memory

CUDA Memory Transfers cudaMemcpy()  Memory transfer Host to host Host to device Device to host Device to device HostDevice Global Memory

CUDA Memory Transfers cudaMemcpy()  Memory transfer Host to host Host to device Device to host Device to device HostDevice Global Memory

CUDA Memory Transfers cudaMemcpy()  Memory transfer Host to host Host to device Device to host Device to device HostDevice Global Memory

CUDA Memory Transfers Code from: HostDevice Global Memory Host to device

CUDA Memory Transfers Code from: HostDevice Global Memory Source (host)Destination (device)

CUDA Memory Transfers Code from: HostDevice Global Memory

Matrix Multiply Image from: P = M * N Assume M and N are square for simplicity

Matrix Multiply 1,000 x 1,000 matrix 1,000,000 dot products Each 1,000 multiples and 1,000 adds

Matrix Multiply: CPU Implementation Code from: void MatrixMulOnHost(float* M, float* N, float* P, int width)‏ { for (int i = 0; i < width; ++i)‏ for (int j = 0; j < width; ++j) { float sum = 0; for (int k = 0; k < width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * width + j] = sum; }

Matrix Multiply: CUDA Skeleton Code from:

Matrix Multiply: CUDA Skeleton Code from:

Matrix Multiply: CUDA Skeleton Code from:

Matrix Multiply Step 1  Add CUDA memory transfers to the skeleton

Matrix Multiply: Data Transfer Code from: Allocate input

Matrix Multiply: Data Transfer Code from: Allocate output

Matrix Multiply: Data Transfer Code from:

Matrix Multiply: Data Transfer Code from: Read back from device

Matrix Multiply: Data Transfer Code from: Similar to GPGPU with GLSL.

Matrix Multiply Step 2  Implement the kernel in CUDA C

Matrix Multiply: CUDA Kernel Code from: Accessing a matrix, so using a 2D block

Matrix Multiply: CUDA Kernel Code from: Each kernel computes one output

Matrix Multiply: CUDA Kernel Code from: Where did the two outer for loops in the CPU implementation go?

Matrix Multiply: CUDA Kernel Code from: No locks or synchronization, why?

Matrix Multiply Step 3  Invoke the kernel in CUDA C

Matrix Multiply: Invoke Kernel Code from: One block with width by width threads

Matrix Multiply 65 One Block of threads compute matrix Pd  Each thread computes one element of Pd Each thread  Loads a row of matrix Md  Loads a column of matrix Nd  Perform one multiply and addition for each pair of Md and Nd elements  Compute to off-chip memory access ratio close to 1:1 (not very high)‏ Size of matrix limited by the number of threads allowed in a thread block Grid 1 Block 1 48 Thread (2, 2)‏ WIDTH Md Pd Nd © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign Slide from:

Matrix Multiply What is the major performance problem with our implementation? What is the major limitation?