Graphics Processing Units References: Computer Architecture 5th edition, Hennessy and Patterson, 2012 http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fe rmi_Compute_Architecture_Whitepaper.pdf http://www.realworldtech.com/page.cfm?ArticleID=RWT09300911 0932&p=1 http://www.moderngpu.com/intro/performance.html http://heather.cs.ucdavis.edu/parprocbook
CPU vs. GPU CPU: small fraction of chip used for arithmetic http://chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html
CPU vs GPU GPU: large fraction of chip used for arithmetic http://www.pcper.com/reviews/Graphics-Cards/NVIDIA-GT200-Revealed -GeForce-GTX-280-and-GTX-260-Review/NVIDIA-GT200-Archite
CPU vs GPU Intel Haswell AMD Radeon R9 290 Nvidia GTX 970 170 GFlops on quad-core at 3.4GHz AMD Radeon R9 290 4800 GFlops at 9.5GHz Nvidia GTX 970 5000 Gflops at 1.05GHz
GPGPU General Purpose GPU programming Massively parallel Scientific computing, brain simulations, etc In supercomputers 53 of top500.org supercomputers used NVIDIA/AMD GPUs (Nov 2014 ranking) Including 2nd and 6th places
OpenCL vs CUDA Both for GPGPU OpenCL CUDA Similar performance Open standard Supported on AMD, NVIDIA, Intel, Altera, … CUDA Proprietary (Nvidia) Losing ground to OpenCL? Similar performance
CUDA Programming on Parallel Machines, Norm Matloff, Chapter 5 http://www.nvidia.com/content/PDF/fermi_white _papers/NVIDIA_Fermi_Compute_Architecture_W hitepaper.pdf Uses a thread hierarchy Thread Block Grid
Thread Executes an instance of a kernel (program) ThreadID (within block), program counter, registers, private memory, input and output parameters Private memory for register spills, function calls, array variables Nvidia Fermi Whitepaper pg 6
Block Set of concurrently executing threads Cooperate via barrier synchronization and shared memory (fast but small) BlockID (within grid) Nvidia Fermi Whitepaper pg 6
Grid Array of thread blocks running same kernel Read and write global memory (slow – hundreds of cycles) Synchronize between dependent kernel calls Nvidia Fermi Whitepaper pg 6
Hardware Mapping GPU Streaming Multiprocessor (sm) CUDA core executes 1+ kernel (program) grids Streaming Multiprocessor (sm) executes 1+ thread blocks CUDA core executes thread
Fermi Architecture Debuted in 2010 512 CUDA cores 32 CUDA cores per SM executes 1 FP or integer instruction per cycle 32 CUDA cores per SM 16 SMs per GPU 6 64-bit memory ports PCI-Express interface to CPU GigaThread scheduler distributes blocks to SMs each SM has a thread scheduler (in hardware) fast context switch 3 billion transistors
Nvidia Fermi Whitepaper pg 7
CUDA core pipelined integer and FP units IEEE 754-2008 FP integer unit fused multiply-add integer unit boolean, shift, move, compare, ... Nvidia Fermi Whitepaper pg 8
Streaming Multiprocessor (SM) 32 CUDA cores 16 ld/st units calculate source/destination addresses Special Function Units sin, cosine, reciprocal, sqrt Nvidia Fermi Whitepaper pg 8
Warps 32 threads from a block are bundled into warps which execute the same instr/cycle this becomes the minimum size of SIMD data warps are implicitly synchronized if threads branch in different directions, they step through both using predicated instructions two warp schedulers select 1 instruction from a warp each to issue to 16 cores, 16 ld/st units or 4 SFUs
Maxwell Architecture 2014 16 streaming multiprocessors * 128 cores/SM
Programming CUDA C code daxpy(n,2.0,x,y); // invoke void daxpy(int n, double a, double *x double *y) { for(int i=0; i<n; i++) y[i] = a*x[i] + y[i]; }
Programming CUDA CUDA code __host__ int nblocks=(n+511)/512; // grid size daxpy<<<nblocks,512>>(n,2.0,x,y); // 512 threads/block __global__ void daxpy(int n, double a, double *x double *y) { int i=blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i] = a*x[i] + y[i]; }
n=8192, 512 threads/block grid block0 warp0 Y[0]=A*X[0]+Y[0] ...
Moving data between host and GPU int main() { double *x, *y, a, *dx, *dy; x = (double *)malloc(sizeof(double)*n); y = (double *)malloc(sizeof(double)*n); // initialize x and y… cudaMalloc(dx, n*sizeof(double)); cudaMalloc(dy, n*sizeof(double)); cudaMemcpy(dx, x, n*sizeof(double), cudaMemcpyHostToDevice); … daxpy<<<nblocks,512>>(n,2.0,x,y); cudaThreadSynchronize(); cudaMemcpy(y, dy, n*sizeof(double), cudaMemcpyDeviceToHost); cudaMemFree(dx); cudaMemFree(dy); free(x); free(y); }