GPU baseline architecture and gpgpu-sim Presented by 王建飞 2017.9.28
A typical GPGPU: Related terminology: On-chip memory: GPC:SM cluster SM:streaming multiprocessor SIMT core:single instruction multiple threads (?SIMD) On-chip memory: RF:register file,large L1D cache:private,weak coherence Shared memory: programmer-controlled
Runtime of GPGPU 1:
Runtime of GPGPU 2: Scheduler:LRR,GTO SIMT stack:post-dominator Operand collector:access RF Lane:SP,SFU,MEM
A typical code study 1: Constant gridDim.x,blockDim.x Variable:blockIdx.x threadIdx.x blocksPerGrid = 32 threadsPerBlock = 256 So: gridDim.x = 32 blockDim.x = 256 __global__: call from host __device__: call from device Source: cuda by example;
A typical code study 2:
GPGPU-sim: a cycle-level GPU performance simulator that focuses on "GPU computing" (general purpose computation on GPUs) Replace cuda api and supply a configurable GPU Simulation model: functional simulation (cuda-sim.h/cc) and timing simulation (shader.h/cc) gpu-cache.h/cc: cache model
Simulation line: register_set: instruction temporary buffer m_fu: sp, sfu, ldst_unit Reference: GPGPU-sim manual; Nvidia Fermi/Kepler architecture whitepaper
Instruction Set Architecture: PTX: Parallel Thread eXecution , a pseudo-assembly instruction set ptxas SASS: a native GPU ISA (strength reduction, instruction scheduling, register allocation) PTXPlus: to extend PTX with the required features in order to provide a one-to-one mapping to SASS
Instruction Set Architecture:
Instruction Set Architecture: //SASS S2R R0, SR_CTAid_X; S2R R2, SR_Tid_X; //PTX mov.u32 %r3, %ctaid.x; mov.u32 %r5, %tid.x;; //PTXPlus mad.lo.u16 $r0, %ctaid.x, 0x00000200, $r0; mov.u16 $r4.lo, 0x00000000;
Thanks