CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

CUDA More on Blocks/Threads. 2 Debugging Using the Device Emulation Mode An executable compiled in device emulation mode ( nvcc -deviceemu ) runs completely.

Speed, Accurate and Efficient way to identify the DNA.

List Ranking and Parallel Prefix

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS Fall 2012.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

CUDA and the Memory Model (Part II). Code executed on GPU.

CUDA Grids, Blocks, and Threads

Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

CIS 565 Fall 2011 Qing Sun

CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

Computer Engg, IIT(BHU)

Device Routines and device variables

Device Routines and device variables

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

CUDA Grids, Blocks, and Threads

General Purpose Graphics Processing Units (GPGPUs)

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Chapter 4:Parallel Programming in CUDA C

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

6- General Purpose GPU Programming

Presentation transcript:

CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials from Wisconsin (Negrut), North Carolina Charlotte (Wikinson/Li) and NCSA (Kindratenko).

Topics Implementing MM on GPU – Memory hierarchy – synchronization

More about threads/block See matrixmul.cu. Following is the execution trace: A warp can only contain threads in one block. We need at least 32 threads in one block!! time./a.out 3.318u 3.402s 0: % 0+0k 0+0io 0pf+0w time./a.out u 3.200s 0: % 0+0k 0+0io 0pf+0w time./a.out u 3.129s 0: % 0+0k 0+0io 0pf+0w time./a.out u 3.227s 1: % 0+0k 0+0io 0pf+0w time./a.out u 3.917s 3: % 0+0k 0+0io 0pf+0w

CUDA extension to declare kernel routines __global__indicates routine can only be called from host and only executed on device __device__indicates routine can only be called from device and only executed on device __host__indicates routine can only be called from host and only executed on host

Routine for device __global__ routine must have a void return value. Generally cannot call C library routines except CUDA built-in math routines such as sin, cos, etc. – Check NVIDIA CUDA programming guide for details. CUDA also has device only routines.

Example for 2D grid/blocks Matrix multiply: for (i=0; i<N; i++) for(j=0; j<K; j++) for (k=0; k<M; k++) c[i][j] += a[i][k] * b[k][j] 2D mesh must be stored in the linear (1D) array (column major order) c[i][j] = c[i+N*j] = *(c+i+N*j); a[i][k] = a[i+K*j] = *(a+i+K*k);

First cut Using one thread to compute one c[i][j], a total of N*K threads will be needed. – N*K blocks of threads and 1 thread each block – See mm0.cu // kernel MM routine __global__ void mmkernel(float *a, float *b, float *c, int N, int M, int K) { int i = blockIdx.x, j = blockIdx.y; float sum = 0.0f; for (int k = 0; k< M; k++) sum += a[i+N*k] * b[k+K*j]; c [i+N*j] = sum; } dim3 dimBlock(1); dim3 dimGrid(N, N); mmkernel >> (dev_A, dev_B, dev_C, N, M, K);

Another try – See mm0_1.cu // kernel MM routine __global__ void mmkernel(float *a, float *b, float *c, int N, int M, int K) { int i = threadIdx.x, j = threadIdx.y; float sum = 0.0f; for (int k = 0; k< M; k++) sum += a[i+N*k] * b[k+K*j]; c [i+N*j] = sum; } dim3 dimBlock(1); dim3 dimGrid(N, K); mmkernel >> (dev_A, dev_B, dev_C, N, M, K); Another thing wrong here?

Second try Add threads to blocks to exploit the SIMT (SIMD) support – need to have at least 32 threads per block to have one 32 thread warp. – The more the better (GPU will have more options).

CPU and GPU memory Mm with blocks of threads __global__ void mmkernel(float *a, float *b, float *c, int N, int M, int K) { int i = blockIdx.x * BLOCK_SIZE + threadIdx.x, j = blockIdx.y; float sum = 0.0f; for (int k = 0; k< M; k++) sum += a[i+N*k] * b[k+K*j]; c [i+N*j] = sum; } dim3 dimBlock(BLOCK_SIZE); dim3 dimGrid(N/BLOCK_SIZE, K); mmkernel >> (dev_A, dev_B, dev_C, N, M, K); Notice the relationship between index calculation and kernel invocation. Try mm1.cu with different BLOCK_SIZE’s

CUDA memory hierarchy Register: per-thread basis – Private per thread – Can spill into local memory (perf. hit) Shared Memory: per-block basis – Shared by threads of the same block – Used for: Inter-thread communication Global Memory: per-application basis – Available for use to all threads – Used for: Inter-thread communication – Also used for inter-grid communication Thread Register Grid 0... Global Device Memory... Grid 1 Sequential Grids in Time Block Shared Memory 12

CUDA memory allocation MemoryDeclarationScope Lifetime RegistersAuto variablesThreadKernel other than arrays LocalAuto arrays ThreadKernel Shared__shared__BlockKernel Global__device__GridApplication Constant__constant__GridApplication

An example __global__ float A[1000]; __global__ void mmkernel(float *a, float *b, float *c, int N, int M, int K) { int i = blockIdx.x * BLOCK_SIZE + threadIdx.x; int j = blockIdx.y; int tx = threadIdx.x; __shared__ float cb[BLOCK_SIZE]; int workb[BLOCK_SIZE]; …… } Which type of variables are A, i, j, cb, workb?

MM with shared memory In mm1.cu, threads use register variables and global arrays A block of BLOCK_SIZE threads is used to compute: BLOCK_SIZE c items: c[0][0], c[1][0], c[2][0], …. C[BLOCK_SIZE][0] – The calculation: C[0][0] = A[0][0] * B[0][0] + A[0][1]*B[1][0] + A[0][2] * B[2][0] … C[1][0] = A[1][0] * B[0][0] + A[1][1]*B[1][0] + A[1][2] * B[2][0] … C[2][0] = A[2][0] * B[0][0] + A[2][1]*B[1][0] + A[2][2] * B[2][0] … – A matrix has different values in different threads – can’t use shared memory – B matrix has the same items Put B in shared memory may reduce the (global) memory traffic. Shared memory in GPU is limited, can’t hold the whole column: need to reduce the memory footprint. How? – for(k=0; i<M; k++) C[i][j] += A[i][k]*B[k][j]

MM with shared memory for(k=0; i<M; k++) C[i][j] += A[i][k]*B[k][j] For (ks=0; ks < M; ks+=TSIZE) for(k=ks; k<ks+TSIZE; k++) C[i][j] += A[i][k] * B[k][j]; For(ks=0; ks<M; ks+=TSIZE) Forall (k=ks; k<ks+TSIZE; k++) workB[k][j] = B[k][j]; for (k=ks; k<ks+TSIZE;k++) C[i][j] += A[i][k] * workB[k][j];

MM with shared memory __global__ void mmkernel(float *a, float *b, float *c, int N, int M, int K) { int i = blockIdx.x * BLOCK_SIZE + threadIdx.x; int j = blockIdx.y; int tx = threadIdx.x; __shared__ float cb[BLOCK_SIZE]; float sum = 0.0f; for (int ks = 0; ks < M; ks+= BLOCK_SIZE) { cb[tx] = b[ks+tx+M*j]; // copy from global to shared, all threads parallel read for (int k = ks; k< ks+BLOCKINGSIZE; k++) sum += a[i+N*k] * cb[k-ks]; } c [i+N*j] = sum; } Any problem here?

MM with shared memory __global__ void mmkernel(float *a, float *b, float *c, int N, int M, int K) { int i = blockIdx.x * BLOCK_SIZE + threadIdx.x; int j = blockIdx.y; int tx = threadIdx.x; __shared__ float cb[BLOCK_SIZE]; float sum = 0.0f; for (int ks = 0; ks < M; ks+= BLOCK_SIZE) { cb[tx] = b[ks+tx+M*j]; // all BLOCK_SIZE threads parallel read for (int k = ks; k< ks+BLOCKINGSIZE; k++) sum += a[i+N*k] * cb[k-ks]; } c [i+N*j] = sum; } True dependence due to shared memory Anti-dependence

MM with shared memory __global__ void mmkernel(float *a, float *b, float *c, int N, int M, int K) { int i = blockIdx.x * BLOCK_SIZE + threadIdx.x; int j = blockIdx.y; int tx = threadIdx.x; __shared__ float cb[BLOCK_SIZE]; float sum = 0.0f; for (int ks = 0; ks < M; ks+= BLOCK_SIZE) { cb[tx] = b[ks+tx+M*j]; // all BLOCK_SIZE threads parallel read __syncthreads(); // barrier among all threads in a block for (int k = ks; k< ks+BLOCKINGSIZE; k++) sum += a[i+N*k] * cb[k-ks]; __syncthreads(); // barrier among all threads in a block } c [i+N*j] = sum; } See mm2.cu

More schemes to improve MM performance Compute multiple points in each threads – See mm3.cu Using 2D block and 2D grid.

More information about __syncthreads() All threads must reach the barrier before any thread can move on. – Threads arrives early must wait __syncthreads() is kernel only.

More information about __syncthreads() Only synchronize within a block. Barriers in different blocks are independent. Barrier Block 0 Continue Barrier Block n-1 Continue Separate barriers

More information about __syncthreads() CUDA requires threads to synchronize using the exact the same __syncthreads() calls. Cannot do if... __syncthreads() else … __syncthreads() What if we want synchronize among all threads? – Make separate kernel invocations.