Graphics Processing Units

Name: Graphics Processing Units
Uploaded: 2017-08-18T03:38:19+00:00
Duration: PTM7S25
Channel: Todd Nicholson
Description: Graphics Processing Units

Graphics Processing Units
References: Computer Architecture 5th edition, Hennessy and Patterson, 2012 rmi_Compute_Architecture_Whitepaper.pdf 0932&p=1

CPU vs. GPU CPU: small fraction of chip used for arithmetic

CPU vs GPU GPU: large fraction of chip used for arithmetic
-GeForce-GTX-280-and-GTX-260-Review/NVIDIA-GT200-Archite

CPU vs GPU Intel Haswell AMD Radeon R9 290 Nvidia GTX 970
170 GFlops on quad-core at 3.4GHz AMD Radeon R9 290 4800 GFlops at 9.5GHz Nvidia GTX 970 5000 Gflops at 1.05GHz

GPGPU General Purpose GPU programming
Massively parallel Scientific computing, brain simulations, etc In supercomputers 53 of top500.org supercomputers used NVIDIA/AMD GPUs (Nov 2014 ranking) Including 2nd and 6th places

OpenCL vs CUDA Both for GPGPU OpenCL CUDA Similar performance
Open standard Supported on AMD, NVIDIA, Intel, Altera, … CUDA Proprietary (Nvidia) Losing ground to OpenCL? Similar performance

CUDA Programming on Parallel Machines, Norm Matloff, Chapter 5
_papers/NVIDIA_Fermi_Compute_Architecture_W hitepaper.pdf Uses a thread hierarchy Thread Block Grid

Thread Executes an instance of a kernel (program)
ThreadID (within block), program counter, registers, private memory, input and output parameters Private memory for register spills, function calls, array variables Nvidia Fermi Whitepaper pg 6

Block Set of concurrently executing threads
Cooperate via barrier synchronization and shared memory (fast but small) BlockID (within grid) Nvidia Fermi Whitepaper pg 6

Grid Array of thread blocks running same kernel
Read and write global memory (slow – hundreds of cycles) Synchronize between dependent kernel calls Nvidia Fermi Whitepaper pg 6

Hardware Mapping GPU Streaming Multiprocessor (sm) CUDA core
executes 1+ kernel (program) grids Streaming Multiprocessor (sm) executes 1+ thread blocks CUDA core executes thread

Fermi Architecture Debuted in 2010 512 CUDA cores 32 CUDA cores per SM
executes 1 FP or integer instruction per cycle 32 CUDA cores per SM 16 SMs per GPU 6 64-bit memory ports PCI-Express interface to CPU GigaThread scheduler distributes blocks to SMs each SM has a thread scheduler (in hardware) fast context switch 3 billion transistors

Nvidia Fermi Whitepaper pg 7

CUDA core pipelined integer and FP units IEEE 754-2008 FP integer unit
fused multiply-add integer unit boolean, shift, move, compare, ... Nvidia Fermi Whitepaper pg 8

Streaming Multiprocessor (SM)
32 CUDA cores 16 ld/st units calculate source/destination addresses Special Function Units sin, cosine, reciprocal, sqrt Nvidia Fermi Whitepaper pg 8

Warps 32 threads from a block are bundled into warps which execute the same instr/cycle this becomes the minimum size of SIMD data warps are implicitly synchronized if threads branch in different directions, they step through both using predicated instructions two warp schedulers select 1 instruction from a warp each to issue to 16 cores, 16 ld/st units or 4 SFUs

Maxwell Architecture 2014 16 streaming multiprocessors * 128 cores/SM

Programming CUDA C code daxpy(n,2.0,x,y); // invoke
void daxpy(int n, double a, double *x double *y) { for(int i=0; i<n; i++) y[i] = a*x[i] + y[i]; }

Programming CUDA CUDA code
__host__ int nblocks=(n+511)/512; // grid size daxpy<<<nblocks,512>>(n,2.0,x,y); // 512 threads/block __global__ void daxpy(int n, double a, double *x double *y) { int i=blockIdx.x*blockDim.x threadIdx.x; if(i<n) y[i] = a*x[i] + y[i]; }

n=8192, 512 threads/block grid block0 warp0 Y[0]=A*X[0]+Y[0] ...

Moving data between host and GPU
int main() { double *x, *y, a, *dx, *dy; x = (double *)malloc(sizeof(double)*n); y = (double *)malloc(sizeof(double)*n); // initialize x and y… cudaMalloc(dx, n*sizeof(double)); cudaMalloc(dy, n*sizeof(double)); cudaMemcpy(dx, x, n*sizeof(double), cudaMemcpyHostToDevice); … daxpy<<<nblocks,512>>(n,2.0,x,y); cudaThreadSynchronize(); cudaMemcpy(y, dy, n*sizeof(double), cudaMemcpyDeviceToHost); cudaMemFree(dx); cudaMemFree(dy); free(x); free(y); }

Graphics Processing Units

Similar presentations

Presentation on theme: "Graphics Processing Units"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Graphics Processing Units

Similar presentations

Presentation on theme: "Graphics Processing Units"— Presentation transcript:

Similar presentations

About project

Feedback