Λλ Fernando Magno Quintão Pereira P ROGRAMMING L ANGUAGES L ABORATORY Universidade Federal de Minas Gerais - Department of Computer Science P ROGRAM A.

λλ Fernando Magno Quintão Pereira P ROGRAMMING L ANGUAGES L ABORATORY Universidade Federal de Minas Gerais - Department of Computer Science P ROGRAM A NALYSIS AND O PTIMIZATION – DCC888 M EMORY O PTIMIZATIONS FOR G RAPHICS P ROCESSING U NITS The material in these slides has been taken from the NVIDIA manuals (Best Practices Guide & Optimizing Matrix Transpose in CUDA), and from a paper by Ryoo et al [Ryoo12]. See "A bit of History" in the last slide

DCC 888 λλ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory W HAT ARE G RAPHICS P ROCESSING U NITS

The Wheel of Reincarnation In the good old days, the graphics hardware was just the VGA. All the processing was in software. People started complaining: software is slow… – But, what do you want to run at the hardware level? A scandalously brief history of GPUs 1)Do you know how the frame buffer works? 2)Can you program the VGA standard in any way?

The Wheel of Reincarnation Some functions, like the rasterizer, are heavily used. What is rasterization? Better to implement these functions in hardware. A scandalously brief history of GPUs 1)How can we implement a function at the hardware level? 2)What is the advantage of implementing a function at the hardware level? 3)Is there any drawback? 4)Can we program (in any way) this hardware used to implement a specific function?

Graphics Pipeline Graphics can be processed in a pipeline. – Transform, project, clip, display, etc… Some functions, although different, can be implemented by very similar hardware. Add a graphics API to program the shaders. – But this API is so specific… and the hardware is so powerful… what a waste! A scandalously brief history of GPUs Shading is an example. Do you know what is a shader?

General Purpose Hardware Let’s add a instruction set to the shader. – Let’s augment this hardware with general purpose integer operations. – What about adding some branching machinery too? Hum… add a high level language on top of this stuff. – Plus a lot of documentation. Advertise it! It should look cool! Oh boy: we have now two general purpose processors. – We should unify them. The rant starts all over again… A scandalously brief history of GPUs

1.5 turns around the wheel Lets add a display processor to the display processor – After all, there are some operations that are really specific, and performance critical… Dedicated rasterizer A scandalously brief history of GPUs

Brief Timeline YearTransistorsModelTech 199925MGeForce 256DX7, OpenGL 200160MGeForce 3Programmable Shader 2002125MGeForce FXCg programs 2006681MGeForce 8800C for CUDA 20081.4GGeForce GTX 280 IEEE FP 20103.0GFermiCache, C++ A scandalously brief history of GPUs

Computer Organization GPUs show different types of parallelism – Single Instruction Multiple Data (SIMD) – Single Program Multiple Data (SPMD) In the end, we have a MSIMD hardware. 1)Why are GPUs so parallel? 2)Why traditional CPUs do not show off all this parallelism? We can think on a SIMD hardware as a firing squad: we have a captain, and a row of soldiers. The captain issues orders, such as set, aim, fire! And all the soldiers, upon hearing one of these orders, performs an action. They all do the same action, yet, they use different guns and bullets. An outrageously concise overview of the programming mode.

The Programming Environment An outrageously concise overview of the programming mode. There are two main programming languages used to program graphics processing units today: OpenCL and C for CUDA These are not the first languages developed for GPUs. They came after Cg or HLSL, for instance. – But they are much more general and expressive. We will focus on C for CUDA. This language lets the programmer explicitly write code that will run in the CPU, and code that will run in the GPU. – It is a heterogeneous programming language.

From C to CUDA in one Step An outrageously concise overview of the programming mode. void saxpy_serial(int n, float alpha, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = alpha*x[i] + y[i]; } // Invoke the serial function: saxpy_serial(n, 2.0, x, y); This program, written in C, performs a typical vector operation, reading two arrays, and writing on a third array. We will translate this program to C for CUDA. 1)What is the asymptotic complexity of this program? 2)How much can we parallelize this program? In a world with many – really many – processors, e.g., the PRAM world, what would be the complexity of this program?

The first Cuda program void saxpy_serial(int n, float alpha, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = alpha*x[i] + y[i]; } // Invoke the serial function: saxpy_seral(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float alpha, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = alpha * x[i] + y[i]; } // Invoke the parallel kernel: int nblocks = (n + 255) / 256; saxpy_parallel >>(n, 2.0, x, y); What happened to the loop in the CUDA program? An outrageously concise overview of the programming mode.

Understanding the Code Threads are grouped in warps, blocks and grids Threads in different grids do not talk to each other – Grids are divided in blocks Threads in the same block share memory and barriers – Blocks are divided in warps Threads in the same warp follow the SIMD model. __global__ void saxpy_parallel(int n, float alpha, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = alpha * x[i] + y[i]; } // Invoke the parallel kernel: int nblocks = (n + 255) / 256; saxpy_parallel >>(n, 2.0, x, y); An outrageously concise overview of the programming mode.

Raising the level Cuda programs contain CPU programs plus kernels Kernels are called via a special syntax: The C part of the program is compiled as traditional C. The kernel part is first translated into PTX, and then this high level assembly is translated into SASS. __global__ void matMul1(float* B, float* C, float* A, int w) { float Pvalue = 0.0; for (int k = 0; k < w; ++k) { Pvalue += B[threadIdx.y * w + k] * C[k * w + threadIdx.x]; } A[threadIdx.x + threadIdx.y * w] = Pvalue; } void Mul(const float* A, const float* B, int width, float* C) { int size = width * width * sizeof(float); // Load A and B to the device float* Ad; cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice); // Allocate C on the device float* Cd; cudaMalloc((void**)&Cd, size); // Compute the execution configuration assuming // the matrix dimensions are multiples of BLOCK_SIZE dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y); // Launch the device computation Muld >>(Ad, Bd, width, Cd); // Read C from the device cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Ad); cudaFree(Bd); cudaFree(Cd); } kernel >>(A,B,w,C); An outrageously concise overview of the programming mode.

Lowering the level CUDA assembly is called Parallel Thread Execution (PTX).entry saxpy_GPU (n, a, x, y) {.reg.u16 %rh ;.reg.u32 %r ;.reg.u64 %rd ;.reg.f32 %f ;.reg.pred %p ; $LBB1__Z9saxpy_GPUifPfS_: mov.u16 %rh1, %ctaid.x; mov.u16 %rh2, %ntid.x; mul.wide.u16 %r1, %rh1, %rh2; cvt.u32.u16 %r2, %tid.x; add.u32 %r3, %r2, %r1; ld.param.s32 %r4, [n]; setp.le.s32 %p1, %r4, %r3; @%p1 bra $Lt_0_770;.loc 28 13 0 cvt.u64.s32 %rd1, %r3; mul.lo.u64 %rd2, %rd1, 4; ld.param.u64 %rd3, [y]; add.u64 %rd4, %rd3, %rd2; ld.global.f32 %f1, [%rd4+0]; ld.param.u64 %rd5, [x]; add.u64 %rd6, %rd5, %rd2; ld.global.f32 %f2, [%rd6+0]; ld.param.f32 %f3, [alpha]; mad.f32 %f4, %f2, %f3, %f1; st.global.f32 [%rd4+0], %f4; exit; } __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = bid.x * bid.x + tid.x; if (i < n) y[i] = a * x[i] + y[i]; } What do you think an assembly language for parallel programming should have? An outrageously concise overview of the programming mode.

DCC 888 λλ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory M EMORY O PTIMIZATIONS

A Brief Overview of the GPU Threading Model Each thread has local registers and local memory Threads are grouped in warps – Warps run in SIMD exec Warps are grouped in blocks – Shared memory + syncs Blocks are grouped in grids – Each grid represents a kernel

1)How do different grids communicate? 2)How do threads in the same block communicate? 3)What determines the size of the block of threads? 4)What determines the size of the grid of threads? 5)What is the effect of branches in the warp execution? A Brief Overview of the GPU Threading Model

Going to the Archives GPUs are much more memory intensive than traditional CPUs. Lets look into an example? – The GeForce 8800 processes 32 pixels per clock. Each pixel contains a color (3 bytes) and a depth (4 bytes), which are read and written. On the average 16 extra bytes of information are read for each pixel. How many bytes are processed per clock? To put these numbers in perspective, how much data is processed in each cycle of an ordinary x86 CPU?

The GPU Archive Registers: fast, yet few. Private to each thread Shared memory: used by threads in the same block Local memory: off-chip and slow. Private to each thread Global memory: off-chip and slow. Used to provide communication between blocks and grids.

The GPU Archive 1)Why can't we leave all the data in registers? 2)Why can't we leave all the data in shared memory? 3)The CPU also has a memory hierarchy. Do you remember how is this hierarchy like? 4)Why do we have a memory hierarchy also in the CPU? Registers: fast, yet few. Private to each thread Shared memory: used by threads in the same block Local memory: off-chip and slow. Private to each thread Global memory: off-chip and slow. Used to provide communication between blocks and grids.

The Traveler’s Journal The interplanetary tripCommuting data between host and device The silk road tripCommuting data between global and shared memory The bakery walkCommuting data between shared memory and registers Reaching a book on the tableReading/Writing data to registers

DCC 888 λλ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory I NTER -D EVICE C OMMUNICATION

The Interplanetary Trip Copying data between GPU and CPU is pretty slow. CUDA provides some library functions for this: – cudaMalloc: allocates data in the GPU memory space – cudaMemset: fills a memory area with a value – cudaFree: frees the data in the GPU memory space – cudaMemcpy: copies data from CPU to GPU, or vice-versa

The interplanetary trip int nbytes = 1024 * sizeof(int); int *a_d = 0; cudaMalloc( (void**) &a_d, nbytes); cudaMemset( a_d, 0, nbytes); cudaFree(a_d); What is each of these calls below doing?

This program copies data from the host to the device, and then moves this data inside the device, and finally brings the data back to the host memory. The interplanetary trip int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; }

The interplanetary trip int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i;

The interplanetary trip int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } nBytes = N * sizeof(float); a_h = (float*)malloc(nBytes); b_h = (float*)malloc(nBytes); Host a_h b_h Host a_h b_h

The interplanetary trip int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } cudaMalloc((void**)&a_d, nBytes); cudaMalloc((void**)&b_d, nBytes); Host a_h b_h Host a_h b_h Device a_d b_d Device a_d b_d

The interplanetary trip int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } cudaMemset(&a_d, 0, nBytes); Host a_h b_h Host a_h b_h Device a_d b_d Device a_d b_d

The interplanetary trip int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); Host a_h b_h Host a_h b_h Device a_d b_d Device a_d b_d

The interplanetary trip int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); Host a_h b_h Host a_h b_h Device a_d b_d Device a_d b_d

The interplanetary trip int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); Host a_h b_h Host a_h b_h Device a_d b_d Device a_d b_d

The interplanetary trip int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } for (i=0; i< N; i++) { ASSERT(a_h[i] == b_h[i]); } Host a_h b_h Host a_h b_h Device a_d b_d Device a_d b_d

The interplanetary trip int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } free(a_h); free(b_h); Host Device a_d b_d Device a_d b_d

The interplanetary trip int main(int argc, char** argv) { float *a_h, *b_h; float *a_d, *b_d; int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMemset(&a_d, 0, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) { ASSERT( a_h[i] == b_h[i] ); } free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return EXIT_SUCCESS; } Host Device cudaFree(a_d); cudaFree(b_d);

Inter-device Communication Inter-device communication, i.e, between the CPU and the GPU, should be minimized as much as possible. Inter-device communication is orders of magnitude slower than reading data from shared memory, for instance. That is why GPUs are not good for interactive applications.

Avoid traveling whenever you can cudaMalloc((void**) &d_vec, mem_size); cudaMemcpy(d_vec, h_vec, mem_size, cudaMemcpyHostToDevice); kernel0 >>(d_vec, vec_size); kernel1 >>(d_vec, vec_size); cudaMemcpy(h_vec, d_vec, mem_size, cudaMemcpyDeviceToHost); cudaFree(d_vec); From maxSort, available in the course webpage. d_vec does not change between kernl calls. Therefore, there is no need to send it again!

Keep data on the GPU Once data is sent to the GPU, it stays on the DRAM, even after the kernel is done executing Try invoking kernels on data already on the GPU 1)Can you think about a situation in which it is better to leave a kernel, do some computation on the CPU, and then call another kernel? 2)By the way, can you think about a problem that is inherently sequential?

The GPU deserves complex work __global__ void matSumKernel(float* S, float* A, float* B, int side) { int ij = tid.x + tid.y * side; A[ij] = B[ij] + C[ij]; } __global__ void matMul1(float* B, float* C, float* A, int w) { float v = 0.0; for (int k = 0; k < w; ++k) { v += B[tid.x*w+k] * C[k*w+tid.x]; } A[tid.x + tid.y * w] = v; } 1)What is the complexity of copying data from the CPU to the GPU? 2)Is it worth to do matrix sum in the GPU? 3)Is it worth to do matrix multiplication in the GPU?

Matrix Sum × Matrix Mul Matrix Sum: Matrix Mul:

The ballerina’s waltz Start working as soon as data is available - use a pipeline! cudaMemcpy(dst, src, N * sizeof(float), dir); kernel >>(dst); sz = N * sizeof(float) / nStreams; for (i = 0; i < nStreams; i++) { offset = i * N / nStreams; cudaMemcpyAsync(dst+offset, src+offset, sz, dir, stream[i]); } for (i=0; i<nStreams; i++) { gridSize = N / (nThreads * nStreams); offset = i * N / nStreams; kernel >>(dst+offset); } C for CUDA has an API for asynchronous transfers: What is the glue between data and computation in this example?

The ballerina’s waltz Asynchronous memory copy overlaps data transfer and GPU processing This technique to obtain parallelism, e.g., pipeline parallelism, is a pattern used in many different scenarios. Could you name other examples of pipeline parallelism?

DCC 888 λλ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory G LOBAL M EMORY A CCESS

The Silk Road Trip Reading or writing to the global memory is also slow. – But not as much as reading/writing between host and device. – The Global Memory is on-board.

The Matrix Multiplication Kernel void matmult(float* B, float* C, float* A, int w) { for (unsigned int i = 0; i < w; ++i) { for (unsigned int j = 0; j < w; ++j) { A[i * w + j] = 0.0; for (unsigned int k = 0; k < w; ++k) { A[i * w + j] += B[i * w + k] * C[k * w + j]; } 1)In the PRAM model, what is the asymptotic complexity of the matrix multiplication problem? 2)Could you translate this program to C for CUDA?

Matrix Multiplication Kernel __global__ void matMul1(float* B, float* C, float* A, int Width) { float Pvalue = 0.0; int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y; for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty]; } A[ty + tx * Width] = Pvalue; } From matMul, available in the course webpage. 1)What is the asymptotic complexity of this program? 2)Given width = 10, how many accesses to the global memory does this program perform? 3)How to know how many floating-point operations per second this program will perform? In this example, each thread is responsible for multiplying one line of B by one column of C, to produce an element of A.

GFLOPS mov.f32 %f1, 0f00000000; $Lt_0_1282: cvt.u64.u32 %rd3, %r7; mul.lo.u64 %rd4, %rd3, 4; ld.param.u64 %rd2, [B]; add.u64 %rd5, %rd2, %rd4; ld.global.f32 %f2, [%rd5+0]; cvt.u64.u32 %rd6, %r9; mul.lo.u64 %rd7, %rd6, 4; ld.param.u64 %rd1, [C]; add.u64 %rd8, %rd1, %rd7; ld.global.f32 %f3, [%rd8+0]; mad.f32 %f1, %f2, %f3, %f1; add.u32 %r7, %r7, 1; ld.param.s32 %r3, [w]; add.u32 %r9, %r3, %r9; setp.ne.s32 %p2, %r7, %r8; @%p2 bra $Lt_0_1282; 1)How many instructions we find in block Lt_0_1282 of this code? 2)How many floating point operations does this program perform in the inner loop? 3)What is, then, the proportion of floating point operations per GPU operation? 4)If the GTX 8800 can perform 172.8 Gflops, how many GFlops could we expect from this code? 5)But if we get a much lower number, what could be the reasons for this bad performance behavior?

Coalesced Access to the Global Memory The global memory is divided into segments of 16 cells. If 16 threads read data from the same segment, the memory access contains only one trip. However, if each thread reads from a different segment we may have a slow access to the global memory. In order to know if we are accessing the same segment, we can do the following check: take thread t whose id is 16 × n find out the segment s that thread t is accessing for every thread t+i, 1 ≤ i ≤ 15, see if t + i is accessing segment s

The Anatomy of a Block 0, 01, 02, 03, 0 0, 11, 12, 13, 1 0, 21, 22, 23, 2 0, 31, 32, 33, 3 …, 0 …, 1 …, 2 …, 3 0, …1, …2, …3, … …, … … … … … (tid.x, tid.y) Warp Warps should read data in the same segment.

Checking for coalesced access __global__ void matMul1(float* B, float* C, float* A, int Width) { float Pvalue = 0.0; int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y; for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty]; } A[ty + tx * Width] = Pvalue; } 1) What is the work that each thread (tx, ty) is doing? 2) Which accesses to the global memory are coalesced, and which ones are not?

Checking Memory Accesses Assuming Width = 640 and k = 0 Warp: (0, 0) (1, 0) (2, 0) … (31, 0) B[tx * Width + k]: B[0], B[640], B[1280], …, B[19840] C[k * Width + ty]: C[0], C[0], C[0], …, C[0] A[ty + tx * Width]: A[0], A[640], A[1280], …, A[19840]... int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y; for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty]; } A[ty + tx * Width] = Pvalue; We have a lot of uncoalesced accesses. How can we improve this situation?

Change the indexes a tiny bit… __global__ void matMul5(float* B, float* C, float* A, int Width) { float Pvalue = 0.0; int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y; for (int k = 0; k < Width; ++k) { Pvalue += B[ty * Width + k] * C[k * Width + tx]; } A[tx + ty * Width] = Pvalue; } for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty]; } A[ty + tx * Width] = Pvalue; Old Code: New Code:

Why is it so much better? Assuming Width = 640 and k = 0 Warp: (0, 0) (1, 0) (2, 0) … (31, 0) B[ty * Width + k]: B[0], B[0], B[0], …, B[0] C[k * Width + tx]: C[0], C[1], C[2], …, C[31] A[tx + ty * Width]: A[0], A[1], A[2], …, A[31]... int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y; for (int k = 0; k < Width; ++k) { Pvalue += B[ty * Width + k] * C[k * Width + tx]; } A[tx + ty * Width] = Pvalue; How much speed do you think you have gotten with this small change?

DCC 888 λλ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory T HE S HARED M EMORY

Second law of performance Avoid going to the global or local memory. Ideally threads should be able to share as much data as possible in shared memory. Remember: the shared memory is on-chip; the global memory is off-chip. Can we improve this matrix multiplication by sharing data among threads?

Moving data to the shared memory __global__ void matMul5(float* B, float* C, float* A, int Width) { float Pvalue = 0.0; int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y; for (int k = 0; k < Width; ++k) { Pvalue += B[ty * Width + k] * C[k * Width + tx]; } A[tx + ty * Width] = Pvalue; } 1)How many accesses to the global memory take place in this kernel? 2)Again: what is the shared memory? 3)How to use the shared memory to improve this program?

__global__ void matMul6(float* B, float* C, float* A, int Width) { __shared__ float Bs[TILE_WIDTH][TILE_WIDTH]; __shared__ float Cs[TILE_WIDTH][TILE_WIDTH]; int tx = threadIdx.x; int ty = threadIdx.y; int Row = blockIdx.x * TILE_WIDTH + tx; int Col = blockIdx.y * TILE_WIDTH + ty; float Pvalue = 0; for (int m = 0; m < Width/TILE_WIDTH; ++m) { Bs[ty][tx] = B[Col * Width + (m * TILE_WIDTH + tx)]; Cs[ty][tx] = C[Row + (m * TILE_WIDTH + ty) * Width]; __syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Bs[ty][k] * Cs[k][tx]; __syncthreads(); } A[Col * Width + Row] = Pvalue; } TILE_WIDTH = blockDim.x = blockDim.y TILE_WIDTH = blockDim.x = blockDim.y Why do we have to synchronize threads at this point?

Tiling Can you tell if each of these indices gives us coalesced access or not?

The Tiles of Scheherazade 1)How many access to the global memory does the original program perform? Answer in terms of N, the width of the matrix. 2)What about the tiles Program? Answer in terms of N, the Matrix width, and T, the tile width. 3)What if we assume that T = N/2?

N N/2 The Tiles of Scheherazade How many reads with no tiling? How many reads with tiling? – Size of each tile: – Num of times each tile is read: – Num of tiles:

N N/2 The Tiles of Scheherazade How many reads with no tiling? 2N * N2 How many reads with tiling? – Size of each tile: – Num of times each tile is read: – Num of tiles:

N N/2 The Tiles of Scheherazade How many reads with no tiling? 2N * N2 How many reads with tiling? – Size of each tile: N2/4 – Num of times each tile is read: N/(N/2) – Num of tiles: – (N/(N/2)) 2 * 2 So: N 2 /4 * 2 * 8 = 4*N 2

__global__ void matMul6(float* B, float* C, float* A, int Width) { __shared__ float Bs[TILE_WIDTH][TILE_WIDTH]; __shared__ float Cs[TILE_WIDTH][TILE_WIDTH]; int tx = threadIdx.x; int ty = threadIdx.y; int Row = blockIdx.x * TILE_WIDTH + tx; int Col = blockIdx.y * TILE_WIDTH + ty; float Pvalue = 0; for (int m = 0; m < Width/TILE_WIDTH; ++m) { Bs[ty][tx] = B[Col * Width + (m * TILE_WIDTH + tx)]; Cs[ty][tx] = C[Row + (m * TILE_WIDTH + ty) * Width]; __syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Bs[ty][k] * Cs[k][tx]; __syncthreads(); } A[Col * Width + Row] = Pvalue; } 1)Can we optimize this program with loop unrolling? 2)We have two loops. Which loop is more reasonable to unroll? Loop Unrolling

float Pvalue = 0; for (int m = 0; m < w/8; ++m) { Mds[tx][ty] = Md[Row * w + (m * 8 + ty)]; Nds[tx][ty] = Nd[Col + (m * 8 + tx) * w]; __syncthreads(); Pvalue += Mds[tx][ 0] * Nds[ 0][ty]; Pvalue += Mds[tx][ 1] * Nds[ 1][ty]; Pvalue += Mds[tx][ 2] * Nds[ 2][ty]; Pvalue += Mds[tx][ 3] * Nds[ 3][ty]; Pvalue += Mds[tx][ 4] * Nds[ 4][ty]; Pvalue += Mds[tx][ 5] * Nds[ 5][ty]; Pvalue += Mds[tx][ 6] * Nds[ 6][ty]; Pvalue += Mds[tx][ 7] * Nds[ 7][ty]; __syncthreads(); } Pd[Row*Width+Col] = Pvalue; Loop Unrolling 1)What is the proportion of floating-point to non-floating- point operations in the innermost loop of the optimized program? 2)What do we gain, at the hardware level, for having less branches to execute? 3)About ½ of the instructions in the innermost loop are floating- point operations. What is the new GFops in the GTX 8800, that has an upper limit of 172.8 GFlops?

Shared Bank Conflicts The shared memory has 16 access doors. If two threads of a half-warp read from the same door, then a conflict happens. 0123 4567 891011 12131415 0123 4567 891011 12131415 0123 4567 891011 12131415 Ideal casesVery bad case No conflict Same address

Matrix transpose __shared__ float tile[BLOCK_WIDTH][BLOCK_WIDTH]; // Compute the index of the data in the input matrix: int xIndex = blockIdx.x * blockDim.x + threadIdx.x; int yIndex = blockIdx.y * blockDim.y + threadIdx.y; int index_in = yIndex * Width + xIndex; // Compute the index of the data in the output matrix: xIndex = blockIdx.y * blockDim.x + threadIdx.x; yIndex = blockIdx.x * blockDim.y + threadIdx.y; int index_out = yIndex * Width + xIndex; // Copy the data: tile[threadIdx.x][threadIdx.y] = In[index_in]; __syncthreads(); Out[index_out] = tile[threadIdx.y][threadIdx.x]; } BLOCK_WIDTH = 16 From transpose, available in the course webpage. 1)Is access to the global memory coalesced? 2)Where is the bank conflict to access shared memory?

0,00,10,20,3 1,01,11,21,3 2,02,12,22,3 3,03,13,23,3 Warp 0,01,02,03,0 0,11,12,13,1 0,21,22,23,2 0,31,32,33,3 tile[threadIdx.x][threadIdx.y] tile[threadIdx.y] [threadIdx.x] Let’s assume warp with 4 threads, and 4 access doors Each color is an access door The red square marks a warp If the warp falls on blocks having the same color, then we have a conflict Warp tile[threadIdx.x][threadIdx.y] = In[index_in]; __syncthreads(); Out[index_out] = tile[threadIdx.y][threadIdx.x]; Ps.: in fig, (x,y) is warp index, not data index.

A very simple solution: You may not quite believe me… __global__ void transpose2(float* In, float* Out, int Width) { __shared__ float tile[BLOCK_WIDTH][BLOCK_WIDTH + 1]; // Compute the index of the data in the input matrix: int xIndex = blockIdx.x * blockDim.x + threadIdx.x; int yIndex = blockIdx.y * blockDim.y + threadIdx.y; int index_in = yIndex * Width + xIndex; // Compute the index of the data in the output matrix: xIndex = blockIdx.y * blockDim.x + threadIdx.x; yIndex = blockIdx.x * blockDim.y + threadIdx.y; int index_out = yIndex * Width + xIndex; // Copy the data: tile[threadIdx.x][threadIdx.y] = In[index_in]; __syncthreads(); Out[index_out] = tile[threadIdx.y][threadIdx.x]; }

0,00,10,20,3 1,01,11,21,3 2,02,12,22,3 3,03,13,23,3 Warp 0,01,02,03,0 0,11,12,13,1 0,21,22,23,2 0,31,32,33,3 tile[threadIdx.x][threadIdx.y] tile[threadIdx.y][threadIdx.x] 0,00,10,20,3 1,01,11,21,3 2,02,12,22,3 3,03,13,23,3 0,00,10,20,3 1,01,11,21,3 2,02,12,22,3 3,03,13,23,3 The extra column breaks the bad alignment in both cases.

Stay Tuned We have seen a suite of memory optimizations that speed up code meant to run on a GPU. There exists another phenomenon that plays a very important role in the execution of GPU programs: the control flow divergences. Control flow divergences are the subject of our next class.

Λλ Fernando Magno Quintão Pereira P ROGRAMMING L ANGUAGES L ABORATORY Universidade Federal de Minas Gerais - Department of Computer Science P ROGRAM A.

Similar presentations

Presentation on theme: "Λλ Fernando Magno Quintão Pereira P ROGRAMMING L ANGUAGES L ABORATORY Universidade Federal de Minas Gerais - Department of Computer Science P ROGRAM A."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Λλ Fernando Magno Quintão Pereira P ROGRAMMING L ANGUAGES L ABORATORY Universidade Federal de Minas Gerais - Department of Computer Science P ROGRAM A.

Similar presentations

Presentation on theme: "Λλ Fernando Magno Quintão Pereira P ROGRAMMING L ANGUAGES L ABORATORY Universidade Federal de Minas Gerais - Department of Computer Science P ROGRAM A."— Presentation transcript:

Similar presentations

About project

Feedback