Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graphics Processing Units

Similar presentations


Presentation on theme: "Graphics Processing Units"— Presentation transcript:

1 Graphics Processing Units
References: Computer Architecture 5th edition, Hennessy and Patterson, 2012 rmi_Compute_Architecture_Whitepaper.pdf 0932&p=1

2 CPU vs. GPU CPU: small fraction of chip used for arithmetic

3 CPU vs GPU GPU: large fraction of chip used for arithmetic
-GeForce-GTX-280-and-GTX-260-Review/NVIDIA-GT200-Archite

4 CPU vs GPU Intel Haswell AMD Radeon R9 290 Nvidia GTX 970
170 GFlops on quad-core at 3.4GHz AMD Radeon R9 290 4800 GFlops at 9.5GHz Nvidia GTX 970 5000 Gflops at 1.05GHz

5 GPGPU General Purpose GPU programming
Massively parallel Scientific computing, brain simulations, etc In supercomputers 53 of top500.org supercomputers used NVIDIA/AMD GPUs (Nov 2014 ranking) Including 2nd and 6th places

6 OpenCL vs CUDA Both for GPGPU OpenCL CUDA Similar performance
Open standard Supported on AMD, NVIDIA, Intel, Altera, … CUDA Proprietary (Nvidia) Losing ground to OpenCL? Similar performance

7 CUDA Programming on Parallel Machines, Norm Matloff, Chapter 5
_papers/NVIDIA_Fermi_Compute_Architecture_W hitepaper.pdf Uses a thread hierarchy Thread Block Grid

8 Thread Executes an instance of a kernel (program)
ThreadID (within block), program counter, registers, private memory, input and output parameters Private memory for register spills, function calls, array variables Nvidia Fermi Whitepaper pg 6

9 Block Set of concurrently executing threads
Cooperate via barrier synchronization and shared memory (fast but small) BlockID (within grid) Nvidia Fermi Whitepaper pg 6

10 Grid Array of thread blocks running same kernel
Read and write global memory (slow – hundreds of cycles) Synchronize between dependent kernel calls Nvidia Fermi Whitepaper pg 6

11 Hardware Mapping GPU Streaming Multiprocessor (sm) CUDA core
executes 1+ kernel (program) grids Streaming Multiprocessor (sm) executes 1+ thread blocks CUDA core executes thread

12 Fermi Architecture Debuted in 2010 512 CUDA cores 32 CUDA cores per SM
executes 1 FP or integer instruction per cycle 32 CUDA cores per SM 16 SMs per GPU 6 64-bit memory ports PCI-Express interface to CPU GigaThread scheduler distributes blocks to SMs each SM has a thread scheduler (in hardware) fast context switch 3 billion transistors

13 Nvidia Fermi Whitepaper pg 7

14 CUDA core pipelined integer and FP units IEEE 754-2008 FP integer unit
fused multiply-add integer unit boolean, shift, move, compare, ... Nvidia Fermi Whitepaper pg 8

15 Streaming Multiprocessor (SM)
32 CUDA cores 16 ld/st units calculate source/destination addresses Special Function Units sin, cosine, reciprocal, sqrt Nvidia Fermi Whitepaper pg 8

16 Warps 32 threads from a block are bundled into warps which execute the same instr/cycle this becomes the minimum size of SIMD data warps are implicitly synchronized if threads branch in different directions, they step through both using predicated instructions two warp schedulers select 1 instruction from a warp each to issue to 16 cores, 16 ld/st units or 4 SFUs

17 Maxwell Architecture 2014 16 streaming multiprocessors * 128 cores/SM

18 Programming CUDA C code daxpy(n,2.0,x,y); // invoke
void daxpy(int n, double a, double *x double *y) { for(int i=0; i<n; i++) y[i] = a*x[i] + y[i]; }

19 Programming CUDA CUDA code
__host__ int nblocks=(n+511)/512; // grid size daxpy<<<nblocks,512>>(n,2.0,x,y); // 512 threads/block __global__ void daxpy(int n, double a, double *x double *y) { int i=blockIdx.x*blockDim.x threadIdx.x; if(i<n) y[i] = a*x[i] + y[i]; }

20 n=8192, 512 threads/block grid block0 warp0 Y[0]=A*X[0]+Y[0] ...

21 Moving data between host and GPU
int main() { double *x, *y, a, *dx, *dy; x = (double *)malloc(sizeof(double)*n); y = (double *)malloc(sizeof(double)*n); // initialize x and y… cudaMalloc(dx, n*sizeof(double)); cudaMalloc(dy, n*sizeof(double)); cudaMemcpy(dx, x, n*sizeof(double), cudaMemcpyHostToDevice); … daxpy<<<nblocks,512>>(n,2.0,x,y); cudaThreadSynchronize(); cudaMemcpy(y, dy, n*sizeof(double), cudaMemcpyDeviceToHost); cudaMemFree(dx); cudaMemFree(dy); free(x); free(y); }


Download ppt "Graphics Processing Units"

Similar presentations


Ads by Google