D2MA: Accelerating Coarse-Grained Data Transfer for GPUs D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke University of Michigan PACT-23 August 27th, 2014
Achieving Peak GPU Performance: Theory and Practice Matrix Multiplication Not easy to fully utilize GPU capabilities! Peak CUBLAS SDK
A Quick Overview of GPUs Chip DRAM SMs Interconnect L2 $ Issue Decode Fetch Register File Result SPs Shared Memory LD/ST Result Result Data Data Writeback Data Result L1D $ …
A Quick Overview of GPUs SMs Interconnect … L2 $ Chip LD/ST Writeback Register File Shared Memory L1D $ Issue Decode Fetch SPs DRAM ~100’s of cycles
How do GPUs Achieve Great Performance? Effectively use available memory bandwidth Exploit data reuse when possible Cache Line SP SP SP SP Store Store Store Store
How do GPUs Achieve Great Performance? Effectively use available memory bandwidth Exploit data reuse when possible Regular, well coalesced memory accesses Cache Line Cache Line SP SP SP SP Store
Buffering to Optimize Bandwidth Chip DRAM SMs Interconnect L2 $ Issue Decode Fetch Register File Tile[0] ~100’s of cycles Tile[1] SPs Shared Memory Tile[2] LD/ST <10 cycles Writeback L1D $ Buffer data in fast Shared Memory …
Buffering Problem 1: Wasted Storage Chip DRAM SMs Interconnect L2 $ Issue Decode Fetch Register File Tile[0] Tile[2] Tile[1] Tile[0] Tile[1] SPs Shared Memory Tile[2] LD/ST Tile[0] Writeback Tile[0] L1D $ Tile[1] Roundabout path to Shared Memory Tile[2] Tile[0] Tile[1] Duplicated data in Shared Mem, Caches, Reg. File Tile[2] …
Buffering Problem 2: Code Expansion IADD R4.CC, R7, c [0x0] [0x150]; SHL.W R18, R7, 0x2; IMUL.U32.U32.HI R20, R7, 0x4; MOV R12, c [0x0] [0x150]; IADD.X R5, RZ, RZ; IADD R2.CC, R18, c [0x0] [0x148]; IADD.X R3, R20, c [0x0] [0x14c]; IADD R0, R0, R19; IMAD.U32.U32 R6.CC, R12, 0x2, R7; LD.E R14, [R2]; IADD.X R8, RZ, RZ; IMAD R10.CC, R12, 0x3, R7; SHL R21, R0, 0x2; IADD.X R9, RZ, RZ; IMAD.U32.U32 R11.CC, R12, 0x4, R7; STS [R21], R14; SHR.U32 R0, R4, 0x1e; SHL R22, R4, 0x2; IADD.X R4, RZ, RZ; IMAD R27.CC, R12, 0x5, R7; SHR.U32 R13, R6, 0x1e; SHL R24, R6, 0x2; IADD.X R6, RZ, RZ; ISCADD R23, R5, R0, 0x2; IMAD R0.CC, R12, 0x6, R7; IMAD R33.CC, R12, 0x7, R7; SHR.U32 R15, R10, 0x1e; SHL R26, R10, 0x2; SHR.U32 R10, R11, 0x1e; SHL R28, R11, 0x2; IADD.X R11, RZ, RZ; IADD R12.CC, R22, c [0x0] [0x148]; ISCADD R25, R8, R13, 0x2; IADD.X R13, R23, c [0x0] [0x14c]; IADD R8.CC, R24, c [0x0] [0x148]; SHR.U32 R7, R27, 0x1e; LD.E R13, [R12]; SHL R30, R27, 0x2; STS [R21+0x84], R13; ISCADD R27, R9, R15, 0x2; IADD.X R9, R25, c [0x0] [0x14c]; IADD R2.CC, R26, c [0x0] [0x148]; ISCADD R29, R4, R10, 0x2; IADD.X R3, R27, c [0x0] [0x14c]; IADD R4.CC, R28, c [0x0] [0x148]; SHR.U32 R10, R0, 0x1e; ISCADD R31, R6, R7, 0x2; ISCADD R32, R5, R10, 0x2; LD.E R9, [R8]; IADD.X R5, R29, c [0x0] [0x14c]; IADD R6.CC, R30, c [0x0] [0x148]; SHL R0, R0, 0x2; IADD.X R7, R31, c [0x0] [0x14c]; SHR.U32 R34, R33, 0x1e; IADD R10.CC, R0, c [0x0] [0x148]; SHL R33, R33, 0x2; LD.E R3, [R2]; ISCADD R34, R11, R34, 0x2; LD.E R5, [R4]; IADD.X R11, R32, c [0x0] [0x14c]; IADD R14.CC, R33, c [0x0] [0x148]; LD.E R6, [R6]; IADD.X R15, R34, c [0x0] [0x14c]; LD.E R8, [R10]; LD.E R2, [R14]; STS [R21+0x108], R9; STS [R21+0x18c], R3; STS [R21+0x210], R5; STS [R21+0x294], R6; STS [R21+0x318], R8; STS [R21+0x39c], R2; BAR.SYNC 0xf; __global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride) { __shared__ float tile[TILE_HEIGHT * STRIDE]; // Preliminary address calculations … float *tile_ptr = tile + <offset>; // Buffer into shared memory #pragma unroll for(unsigned int i = 0; i < TILE_SIZE; i++) tile_ptr[i * STRIDE] = src[i * ImgStride]; __syncthreads(); // Processing data } add.s64 %rl7, %rl5, %rl6; cvt.s64.s32 %rl6, %r13; shl.b64 %rl8, %rl7, 2; mov.u64 %rl9, __cuda_local_var_42177_35_non_const_block; add.s64 %rl10, %rl9, %rl8; cvta.to.global.u64 %rl11, %rl2; mul.wide.u32 %rl12, %r15, 4; add.s64 %rl13, %rl11, %rl12; ld.global.f32 %f1, [%rl13]; st.shared.f32 [%rl10], %f1; cvt.u64.u32 %rl14, %r1; add.s64 %rl15, %rl14, %rl4; shl.b64 %rl16, %rl15, 2; add.s64 %rl17, %rl11, %rl16; ld.global.f32 %f2, [%rl17]; st.shared.f32 [%rl10+132], %f2; shl.b32 %r21, %r1, 1; cvt.u64.u32 %rl18, %r21; add.s64 %rl19, %rl18, %rl4; shl.b64 %rl20, %rl19, 2; add.s64 %rl21, %rl11, %rl20; ld.global.f32 %f3, [%rl21]; st.shared.f32 [%rl10+264], %f3; mul.lo.s32 %r24, %r1, 3; cvt.u64.u32 %rl22, %r24; add.s64 %rl23, %rl22, %rl4; shl.b64 %rl24, %rl23, 2; add.s64 %rl25, %rl11, %rl24; ld.global.f32 %f4, [%rl25]; st.shared.f32 [%rl10+396], %f4; shl.b32 %r27, %r1, 2; cvt.u64.u32 %rl26, %r27; add.s64 %rl27, %rl26, %rl4; shl.b64 %rl28, %rl27, 2; add.s64 %rl29, %rl11, %rl28; ld.global.f32 %f5, [%rl29]; st.shared.f32 [%rl10+528], %f5; mul.lo.s32 %r30, %r1, 5; cvt.u64.u32 %rl30, %r30; add.s64 %rl31, %rl30, %rl4; shl.b64 %rl32, %rl31, 2; add.s64 %rl33, %rl11, %rl32; ld.global.f32 %f6, [%rl33]; st.shared.f32 [%rl10+660], %f6; mul.lo.s32 %r33, %r1, 6; cvt.u64.u32 %rl34, %r33; add.s64 %rl35, %rl34, %rl4; shl.b64 %rl36, %rl35, 2; add.s64 %rl37, %rl11, %rl36; ld.global.f32 %f7, [%rl37]; st.shared.f32 [%rl10+792], %f7; mul.lo.s32 %r36, %r1, 7; cvt.u64.u32 %rl38, %r36; add.s64 %rl39, %rl38, %rl4; shl.b64 %rl40, %rl39, 2; add.s64 %rl41, %rl11, %rl40; ld.global.f32 %f8, [%rl41]; st.shared.f32 [%rl10+924], %f8; bar.sync 15; Each tile transfer requires many arithmetic ops to calculate addresses Address generation consumes ~50% of tile transfer cycles CUDA 4 Lines PTX 59 Instructions SASS 73 Instructions
Objective A tool to help achieve better memory performance Inspired by Direct Memory Access (DMA) CPU ! ! DRAM DMA $
Objective A tool to help achieve better memory performance Inspired by Direct Memory Access (DMA) GPU SM ! Not interruptible! $ $ ! ! $ CPU ? $ DRAM DMA $ Heavy bookkeeping!
D2MA: The Big Picture GPU $ SM DRAM D2MA
D2MA: Data-Parallel Direct Memory Access SM Take advantage of regular memory accesses & unified L1D/Shared Memory space Decouple tile transfers from SM resources Simplify address generation Improve memory pipelining Direct path to shared memory Issue Decode Fetch Register File D2MA SPs Shared Memory LD/ST Writeback L1D $ MSHR Tile[0]
D2MA Programming Model Original Code D2MA-Optimized Code CUDA: 4 Lines __global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride) { __shared__ float tile[T_HEIGHT * T_STRIDE]; int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow; //process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE); for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE]; } __global__ void D2MAkernel2DCT(float *dst, float *src, int ImgStride) { __shared__ float tile[T_HEIGHT * T_STRIDE]; int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT, ImgStride) + blockIdx.x * T_WIDTH; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow; //process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE); for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE]; } CUDA: 4 Lines PTX: 59 Instructions CUDA: 4 Lines PTX: 12 Instructions #pragma unroll for(unsigned int i = 0; i < T_SIZE; i++) tile_ptr[i * T_STRIDE] = src[i * ImgStride]; __syncthreads(); d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0); Original Code D2MA-Optimized Code
D2MA Overview SM Shared Memory L1D $ D2MA Engine Fetch Register File Controller Issue Decode Fetch Register File Glob. Addr Shr. Addr # Elements Elem. Size Stride Buf. 0 Buf. 1 D2MA Buf. 2 Buf. 3 SPs Shared Memory LD/ST AGEN Logic Consistency Checker Writeback L1D $ MSHR
D2MA Operation: Configuration D2MA Engine SM Controller Issue Decode Fetch 0110101 0110110 Register File Glob. Addr Shr. Addr # Elements Elem. Size Config Config Stride Buf. 0 0x1020 0x20 64 4 1 Buf. 1 D2MA Buf. 2 Buf. 3 SPs Shared Memory LD/ST AGEN Logic Consistency Checker Writeback L1D $ MSHR d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0);
D2MA Operation: Addr. Generation D2MA Engine SM Controller Issue Decode Fetch 0111000 Register File Glob. Addr Shr. Addr # Elements Elem. Size Ignite #0 Stride Buf. 0 0x1020 0x1020 0x20 0x20 64 64 4 4 1 1 Buf. 1 D2MA Buf. 2 Buf. 3 SPs Shared Memory LD/ST AGEN Logic AGEN Logic Global Mem. AGEN Consistency Checker 0x1020 Control Shared Mem. AGEN 0x20 Writeback L1D $ MSHR d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0);
D2MA Operation: Memory Transfer D2MA Engine SM Controller Issue Decode Fetch Register File Glob. Addr Shr. Addr # Elements Elem. Size Stride Buf. 0 0x1020 0x20 64 4 1 Buf. 1 D2MA Buf. 2 Buf. 3 SPs Shared Memory LD/ST AGEN Logic AGEN Logic Global Mem. AGEN Consistency Checker 0x10A0 0x1020 Control Shared Mem. AGEN 0x20 0xA0 Writeback L1D $ MSHR Glob. Addr Shr. Addr … 0x2000 0xFF … 0x1020 0xFFFF 0x20 0xFF … … 0x10A0 0xFFFF 0xA0 0xFF … …
D2MA Operation: Memory Transfer D2MA Engine SM Controller Issue Decode Fetch Register File Glob. Addr Shr. Addr # Elements Elem. Size Stride Buf. 0 0x1020 0x20 64 4 1 Buf. 1 D2MA Buf. 2 Buf. 3 SPs Shared Memory LD/ST AGEN Logic AGEN Logic Global Mem. AGEN Consistency Checker Control Shared Mem. AGEN Writeback L1D $ MSHR &0xA0 &0x20 &0x10A0 &0x1020 Glob. Addr Shr. Addr … 0x2000 0xFF … 0x1020 0x1020 0x20 0x20 … … 0x10A0 0x10A0 0xA0 0xA0 … …
D2MA Operation: Enforcing Synchronization Thread Block 1 Thread Block 2 Thread Block 1 Thread Block 2 No syncthreads()! Start TX 1, Thread barrier syncthreads() Start TX 1 Independent code executes Start TX 2, Thread barrier Load from buffer Start TX 2 No warp ready to schedule Barrier satisfied, End TX 1 End TX 1 Load from buffer Code independent of buffer Re-exec load Load from buffer Programmer must guarantee consistency Synchronization handled transparently by H/W Without D2MA With D2MA
Experimental Evaluation GPGPU-Sim v3.2.1 Benchmarks from NVIDIA CUDA SDK, Rodinia Must perform shared memory buffering Number of SMs 15 Thread blocks/SM 8 Shared memory/SM 48 KB D2MA engines/SM 1 Controllers/engine Buffers/controller 4 AGEN/engine Warp scheduling policy Greedy-then-oldest L1 cache (size/assoc/block size) 16KB/4-way/128B L2 cache (size/assoc/block size) 768KB/16-way/128B L2/DRAM latency 240/440 cycles
Results: Performance Geomean speedup: 1.36x
Results: Cycle Breakdown Baseline D2MA Addr. Gen: improved by 98% Mem. TX: reduced by 66% Avg TX cycles: ~5x reduction
Results: Overheads Model of D2MA Engine synthesized using Synopsys Compared to NVIDIA GTX 480 Die area: 529 mm2 TDP: 250 W One D2MA Engine per SM (15 SMs): Area overhead: 0.016% Power overhead: 0.022%
Conclusion Programmer must optimize memory traffic to achieve good performance on GPUs Shared memory buffering improves b/w utilization Buffering still has overheads D2MA decouples tiled data buffering from existing SM resources Reduces costs of address generation by 98% Improves memory transfer times by 66% Performance improves by 1.36x Dynamic instructions executed reduced by 7% Enforces synchronization transparently Low area and power overheads (<0.03%)
Thank You! Questions? Image credits: http://www.opengraphicdesign.com/web/ajax-loading-graphics/ http://www.barriersandbollards.com/html-pages/mb50-1.png
D2MA: Accelerating Coarse-Grained Data Transfer for GPUs D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke University of Michigan PACT-23 August 27th, 2014
Special Addressing Modes Both are a complete tile of data to be stored in shared memory. Blank column mode helps alleviate shared memory bank conflicts Halo addressing mode makes use of halos simpler for the programmer Blank Column Mode Halo Addressing Mode
Results: Dynamic Instruction Count Reduction