Download presentation
Presentation is loading. Please wait.
1
GPU baseline architecture and gpgpu-sim
Presented by 王建飞
2
A typical GPGPU: Related terminology: On-chip memory: GPC:SM cluster
SM:streaming multiprocessor SIMT core:single instruction multiple threads (?SIMD) On-chip memory: RF:register file,large L1D cache:private,weak coherence Shared memory: programmer-controlled
3
Runtime of GPGPU 1:
4
Runtime of GPGPU 2: Scheduler:LRR,GTO SIMT stack:post-dominator
Operand collector:access RF Lane:SP,SFU,MEM
5
A typical code study 1: Constant gridDim.x,blockDim.x
Variable:blockIdx.x threadIdx.x blocksPerGrid = 32 threadsPerBlock = 256 So: gridDim.x = 32 blockDim.x = 256 __global__: call from host __device__: call from device Source: cuda by example;
6
A typical code study 2:
7
GPGPU-sim: a cycle-level GPU performance simulator that focuses on "GPU computing" (general purpose computation on GPUs) Replace cuda api and supply a configurable GPU Simulation model: functional simulation (cuda-sim.h/cc) and timing simulation (shader.h/cc) gpu-cache.h/cc: cache model
8
Simulation line: register_set: instruction temporary buffer
m_fu: sp, sfu, ldst_unit Reference: GPGPU-sim manual; Nvidia Fermi/Kepler architecture whitepaper
9
Instruction Set Architecture:
PTX: Parallel Thread eXecution , a pseudo-assembly instruction set ptxas SASS: a native GPU ISA (strength reduction, instruction scheduling, register allocation) PTXPlus: to extend PTX with the required features in order to provide a one-to-one mapping to SASS
10
Instruction Set Architecture:
11
Instruction Set Architecture:
//SASS S2R R0, SR_CTAid_X; S2R R2, SR_Tid_X; //PTX mov.u32 %r3, %ctaid.x; mov.u32 %r5, %tid.x;; //PTXPlus mad.lo.u16 $r0, %ctaid.x, 0x , $r0; mov.u16 $r4.lo, 0x ;
12
Thanks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.