Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation of David Kirk/NVIDIA and Professor Wen-mei W. Hwu 3. IBM CELL documents 4. Toshiba SpursEngine documents 鄭羽伸 麗臺科技股份有限公司 Nov. 13 th, 2009
CUDA Programming CUDA : Compute Unified Device Architecture 2
Floating-Point Operations per Second for CPU and GPU 3
Memory Bandwidth for CPU and GPU 4
CUDA Architecture A set of SIMT multiprocessors with on-chip shared memory. 5
6
7
8
9
10
11
12
Note: Each TPC (Texture Processing Clusters) contains a group of Streaming Multiprocessors (SMs), and each SM is made up of individual Streaming Processors (SPs). The original G80 had eight TPCs, with two SMs inside each TPC. Each TPC, therefore, was composed of 16 individual stream processors, with 128 SPs across the entire die. 13
Grid of Thread Blocks A kernel is executed over an NDRange by a grid of thread blocks. 14
Automatic Scalability A device with more multiprocessors will automatically execute a kernel in less time than a device with fewer multiprocessors. 15
G80 CUDA mode 16 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008
Block IDs and Thread IDs Each thread uses IDs to decide what data to work on –Block ID: 1D or 2D –Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data –Image processing –Solving PDEs on volumes –… Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) PDE : Partial Differential Equations 17 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008
Block IDs and Thread IDs for (bi=0; bi<M; bi++) { for (bj=0; bj<N; bj++) { for (ti=0; ti<W; ti++) { for (tj=0; tj<Z; tj++) { } Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Thread Processing (bi, bj, ti, tj) Note : Number of thread in a block is limited !! (512 or above) 18
Programming Model: Square Matrix Multiplication Example P = M * N of size WIDTH x WIDTH Without tiling: –One thread calculates one element of P –M and N are loaded WIDTH times from global memory M N P WIDTH 19 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008
Step 1: Matrix Multiplication A Simple Host Version in C M N P WIDTH // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } i k k j 20 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008
void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; … 1. // Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); Step 2: Input Matrix Data Transfer (Host-side Code) 21 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008
Step 3: Output Matrix Data Transfer (Host-side Code) 2. // Kernel invocation code – to be shown later … 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd); } 22 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008
Step 4: Kernel Function // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; 23 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008
Nd MdPd WIDTH Step 4: Kernel Function (cont.) for (int k = 0; k < Width; ++k) { float Melement = Md[ty * Width + k]; float Nelement = Nd[k * Width + tx]; Pvalue += Melement * Nelement; } // Write the matrix to device memory; // each thread writes one element Pd[ty * Width + tx] = Pvalue; } ty tx ty tx k k 24 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008
// Setup the execution configuration dim3 dimBlock(Width, Width); dim3 dimGrid(1, 1); // Launch the device computation threads! MatrixMulKernel >>(Md, Nd, Pd); Step 5: Kernel Invocation (Host-side Code) 25 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory How about performance on G80? All threads access global memory for their input matrix elements –Two memory accesses (8 bytes) per floating point multiply-add –4B/s of memory bandwidth/FLOPS –4*346.5 = 1386 GB/s required to achieve peak FLOP rating –86.4 GB/s limits the code at 21.6 GFLOPS The actual code runs at about 15 GFLOPS Need to drastically cut down memory accesses to get closer to the peak GFLOPS
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, G80 Implementation of CUDA Memories Each thread can: –Read/write per-thread registers –Read/write per-thread local memory –Read/write per-block shared memory –Read/write per-grid global memory –Read/only per-grid constant memory Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Matrix Multiplication using Shared Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Idea: Use Shared Memory to reuse global memory data Each input element is read by WIDTH threads. Load each element into Shared Memory and have several threads use the local version to reduce the memory bandwidth –Tiled algorithms M N P WIDTH ty tx
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Tiled Multiply Each block computes one square sub-matrix Pd sub of size TILE_WIDTH Each thread computes one element of Pd sub Assume that the dimensions of Md and Nd are multiples of TILE_WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, First-order Size Considerations in G80 Each thread block should have many threads –TILE_WIDTH of 16 gives 16*16 = 256 threads There should be many thread blocks –A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. –Memory bandwidth no longer a limiting factor
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, CUDA Code – Kernel Execution Configuration // Setup the execution configuration dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3 dimGrid(Width / TILE_WIDTH, Width / TILE_WIDTH);
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, CUDA Code – Kernel Overview // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue stores the element of the block sub-matrix // that is computed by the thread – automatic variable! float Pvalue = 0; // Loop over all the sub-matrices of M and N // required to compute the block sub-matrix for (int m = 0; m < Width/TILE_WIDTH; ++m) { code from the next few slides };
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Tiled Multiply Each block computes one square sub-matrix Pd sub of size TILE_WIDTH Each thread computes one element of Pd sub m kbx by k m
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, CUDA Code - Load Data to Shared Memory // Get a pointer to the current sub-matrix Msub of M Float* Mdsub = GetSubMatrix(Md, m, by, Width); // Get a pointer to the current sub-matrix Nsub of N Float* Ndsub = GetSubMatrix(Nd, bx, m, Width); __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; // each thread loads one element of the sub-matrix Mds[ty][tx] = GetMatrixElement(Mdsub, tx, ty); // each thread loads one element of the sub-matrix Nds[ty][tx] = GetMatrixElement(Ndsub, tx, ty);
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Tiled Multiply Each block computes one square sub-matrix Pd sub of size TILE_WIDTH Each thread computes one element of Pd sub m kbx by k m
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, CUDA Code - Compute Result // Synchronize to make sure the sub-matrices are loaded // before starting the computation __syncthreads(); // each thread computes one element of the block sub-matrix for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; // Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of M and N in the next iteration __syncthreads();
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, CUDA Code - Save Result // Get a pointer to the block sub-matrix of P Matrix Psub = GetSubMatrix(P, bx, by); // Write the block sub-matrix to device memory; // each thread writes one element SetMatrixElement(Psub, tx, ty, Pvalue); This code runs at about 45 GFLOPS on G80.
39