Download presentation
1
Graphics Processing Units
References: Computer Architecture 5th edition, Hennessy and Patterson, 2012 rmi_Compute_Architecture_Whitepaper.pdf 0932&p=1
2
CPU vs. GPU CPU: small fraction of chip used for arithmetic
3
CPU vs GPU GPU: large fraction of chip used for arithmetic
-GeForce-GTX-280-and-GTX-260-Review/NVIDIA-GT200-Archite
4
CPU vs GPU Intel Haswell AMD Radeon R9 290 Nvidia GTX 970
170 GFlops on quad-core at 3.4GHz AMD Radeon R9 290 4800 GFlops at 9.5GHz Nvidia GTX 970 5000 Gflops at 1.05GHz
5
GPGPU General Purpose GPU programming
Massively parallel Scientific computing, brain simulations, etc In supercomputers 53 of top500.org supercomputers used NVIDIA/AMD GPUs (Nov 2014 ranking) Including 2nd and 6th places
6
OpenCL vs CUDA Both for GPGPU OpenCL CUDA Similar performance
Open standard Supported on AMD, NVIDIA, Intel, Altera, … CUDA Proprietary (Nvidia) Losing ground to OpenCL? Similar performance
7
CUDA Programming on Parallel Machines, Norm Matloff, Chapter 5
_papers/NVIDIA_Fermi_Compute_Architecture_W hitepaper.pdf Uses a thread hierarchy Thread Block Grid
8
Thread Executes an instance of a kernel (program)
ThreadID (within block), program counter, registers, private memory, input and output parameters Private memory for register spills, function calls, array variables Nvidia Fermi Whitepaper pg 6
9
Block Set of concurrently executing threads
Cooperate via barrier synchronization and shared memory (fast but small) BlockID (within grid) Nvidia Fermi Whitepaper pg 6
10
Grid Array of thread blocks running same kernel
Read and write global memory (slow – hundreds of cycles) Synchronize between dependent kernel calls Nvidia Fermi Whitepaper pg 6
11
Hardware Mapping GPU Streaming Multiprocessor (sm) CUDA core
executes 1+ kernel (program) grids Streaming Multiprocessor (sm) executes 1+ thread blocks CUDA core executes thread
12
Fermi Architecture Debuted in 2010 512 CUDA cores 32 CUDA cores per SM
executes 1 FP or integer instruction per cycle 32 CUDA cores per SM 16 SMs per GPU 6 64-bit memory ports PCI-Express interface to CPU GigaThread scheduler distributes blocks to SMs each SM has a thread scheduler (in hardware) fast context switch 3 billion transistors
13
Nvidia Fermi Whitepaper pg 7
14
CUDA core pipelined integer and FP units IEEE 754-2008 FP integer unit
fused multiply-add integer unit boolean, shift, move, compare, ... Nvidia Fermi Whitepaper pg 8
15
Streaming Multiprocessor (SM)
32 CUDA cores 16 ld/st units calculate source/destination addresses Special Function Units sin, cosine, reciprocal, sqrt Nvidia Fermi Whitepaper pg 8
16
Warps 32 threads from a block are bundled into warps which execute the same instr/cycle this becomes the minimum size of SIMD data warps are implicitly synchronized if threads branch in different directions, they step through both using predicated instructions two warp schedulers select 1 instruction from a warp each to issue to 16 cores, 16 ld/st units or 4 SFUs
17
Maxwell Architecture 2014 16 streaming multiprocessors * 128 cores/SM
18
Programming CUDA C code daxpy(n,2.0,x,y); // invoke
void daxpy(int n, double a, double *x double *y) { for(int i=0; i<n; i++) y[i] = a*x[i] + y[i]; }
19
Programming CUDA CUDA code
__host__ int nblocks=(n+511)/512; // grid size daxpy<<<nblocks,512>>(n,2.0,x,y); // 512 threads/block __global__ void daxpy(int n, double a, double *x double *y) { int i=blockIdx.x*blockDim.x threadIdx.x; if(i<n) y[i] = a*x[i] + y[i]; }
20
n=8192, 512 threads/block grid block0 warp0 Y[0]=A*X[0]+Y[0] ...
21
Moving data between host and GPU
int main() { double *x, *y, a, *dx, *dy; x = (double *)malloc(sizeof(double)*n); y = (double *)malloc(sizeof(double)*n); // initialize x and y… cudaMalloc(dx, n*sizeof(double)); cudaMalloc(dy, n*sizeof(double)); cudaMemcpy(dx, x, n*sizeof(double), cudaMemcpyHostToDevice); … daxpy<<<nblocks,512>>(n,2.0,x,y); cudaThreadSynchronize(); cudaMemcpy(y, dy, n*sizeof(double), cudaMemcpyDeviceToHost); cudaMemFree(dx); cudaMemFree(dy); free(x); free(y); }
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.