Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
CUDA More on Blocks/Threads. 2 Debugging Using the Device Emulation Mode An executable compiled in device emulation mode ( nvcc -deviceemu ) runs completely.
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Slides by David Kirk.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU History CUDA Intro. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 3: The CUDA Memory Model.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CIS 565 Fall 2011 Qing Sun
1 ECE 498AL Lecture 2: The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.
1 ECE 8823A GPU Architectures Module 3: CUDA Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu,
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.
ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
L16: Introduction to CUDA November 1, 2012 CS4230.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
1 The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.
ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, , SSL 2014.
CS6963 L2: Introduction to CUDA January 14, L2:Introduction to CUDA.
Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 4: The CUDA Memory Model (Cont.)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Computer Engg, IIT(BHU)
GPU Programming with CUDA
CS427 Multicore Architecture and Parallel Computing
Prof. Fred CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.
Programming Massively Parallel Graphics Processors
Slides from “PMPP” book
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Programming Massively Parallel Graphics Processors
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Parallel Computing 18: CUDA - I
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different ways of dealing with above 1

CUDA Devices and Threads  A compute device  Is a coprocessor to the CPU or host  Has its own DRAM (device memory) ‏  Runs many threads in parallel  Is typically a GPU but can also be another type of parallel processing device  Data-parallel portions of an application are expressed as device kernels which run on many threads  Differences between GPU and CPU threads  GPU threads are extremely lightweight  Very little creation overhead  GPU needs 1000s of threads for full efficiency  Multi-core CPU needs only a few 2

Extended C  Type Qualifiers  global, device, shared, local, constant  Keywords  threadIdx, blockIdx  Intrinsics  __syncthreads  Runtime API  Memory, symbol, execution management  Function launch 3 __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M];... region[threadIdx] = image[i]; __syncthreads()... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve >> (myimage);

4 Arrays of Parallel Threads A CUDA kernel is executed by an array of threads –All threads run the same code (SPMD)‏ –Each thread has an ID that it uses to compute memory addresses and make control decisions … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID

Thread Blocks: Scalable Cooperation  Divide monolithic thread array into multiple blocks  Threads within a block cooperate via shared memory, atomic operations and barrier synchronization  Threads in different blocks cannot cooperate 5 … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID Thread Block 0 … … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block N

Block IDs and Thread IDs  Each thread uses IDs to decide what data to work on  Block ID: 1D or 2D  Thread ID: 1D, 2D, or 3D  Simplifies memory addressing when processing multidimensional data  Image processing  Solving PDEs on volumes  … 6

CUDA Memory Model Overview  Global memory  Main means of communicating R/W Data between host and device  Contents visible to all threads  Long latency access  We will focus on global memory for now  Constant and texture memory will come later 7 Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host

CUDA Function Declarations __global__ defines a kernel function  Must return void __device__ and __host__ can be used together 8 host __host__ float HostFunc()‏ hostdevice __global__ void KernelFunc()‏ device __device__ float DeviceFunc()‏ Only callable from the: Executed on the:

9 Transparent Scalability  Hardware is free to assigns blocks to any processor at any time  A kernel scales across any number of parallel processors Device Block 0Block 1 Block 2Block 3 Block 4Block 5 Block 6Block 7 Kernel grid Block 0Block 1 Block 2Block 3 Block 4Block 5 Block 6Block 7 Device Block 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7 Each block can execute in any order relative to other blocks. time

10 Language Extensions: Built-in Variables  dim3 gridDim;  Dimensions of the grid in blocks ( gridDim.z unused)  dim3 blockDim;  Dimensions of the block in threads  dim3 blockIdx;  Block index within the grid  dim3 threadIdx;  Thread index within the block

11 Common Runtime Component: Mathematical Functions  pow, sqrt, cbrt, hypot  exp, exp2, expm1  log, log2, log10, log1p  sin, cos, tan, asin, acos, atan, atan2  sinh, cosh, tanh, asinh, acosh, atanh  ceil, floor, trunc, round  Etc.  When executed on the host, a given function uses the C runtime implementation if available  These functions are only supported for scalar types, not vector types

12 Device Runtime Component: Mathematical Functions  Some mathematical functions (e.g. sin(x) ) have a less accurate, but faster device-only version (e.g. __sin(x) )  __pow  __log, __log2, __log10  __exp  __sin, __cos, __tan