D2MA: Accelerating Coarse-Grained Data Transfer for GPUs

Slides:

Advertisements

Similar presentations

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

SAGE: Self-Tuning Approximation for Graphics Engines

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

Sunpyo Hong, Hyesoon Kim

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

My Coordinates Office EM G.27 contact time:

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Zorua: A Holistic Approach to Resource Virtualization in GPUs

CS427 Multicore Architecture and Parallel Computing

Microbenchmarking the GT200 GPU

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

Chao Li Yi Yang HongwenDai Shenggen Yan Frank Mueller Huiyang Zhou

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

RegLess: Just-in-Time Operand Staging for GPUs

L18: CUDA, cont. Memory Hierarchy and Examples

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

DRAM Bandwidth Slide credit: Slides adapted from

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Operation of the Basic SM Pipeline

Mattan Erez The University of Texas at Austin

General Purpose Graphics Processing Units (GPGPUs)

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

D2MA: Accelerating Coarse-Grained Data Transfer for GPUs D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke University of Michigan PACT-23 August 27th, 2014

Achieving Peak GPU Performance: Theory and Practice Matrix Multiplication Not easy to fully utilize GPU capabilities! Peak CUBLAS SDK

A Quick Overview of GPUs Chip DRAM SMs Interconnect L2 $ Issue Decode Fetch Register File Result SPs Shared Memory LD/ST Result Result Data Data Writeback Data Result L1D $ …

A Quick Overview of GPUs SMs Interconnect … L2 $ Chip LD/ST Writeback Register File Shared Memory L1D $ Issue Decode Fetch SPs DRAM ~100’s of cycles

How do GPUs Achieve Great Performance? Effectively use available memory bandwidth Exploit data reuse when possible Cache Line SP SP SP SP Store Store Store Store

How do GPUs Achieve Great Performance? Effectively use available memory bandwidth Exploit data reuse when possible Regular, well coalesced memory accesses Cache Line Cache Line SP SP SP SP Store

Buffering to Optimize Bandwidth Chip DRAM SMs Interconnect L2 $ Issue Decode Fetch Register File Tile[0] ~100’s of cycles Tile[1] SPs Shared Memory Tile[2] LD/ST <10 cycles Writeback L1D $ Buffer data in fast Shared Memory …

Buffering Problem 1: Wasted Storage Chip DRAM SMs Interconnect L2 $ Issue Decode Fetch Register File Tile[0] Tile[2] Tile[1] Tile[0] Tile[1] SPs Shared Memory Tile[2] LD/ST Tile[0] Writeback Tile[0] L1D $ Tile[1] Roundabout path to Shared Memory Tile[2] Tile[0] Tile[1] Duplicated data in Shared Mem, Caches, Reg. File Tile[2] …

Buffering Problem 2: Code Expansion IADD R4.CC, R7, c [0x0] [0x150]; SHL.W R18, R7, 0x2; IMUL.U32.U32.HI R20, R7, 0x4; MOV R12, c [0x0] [0x150]; IADD.X R5, RZ, RZ; IADD R2.CC, R18, c [0x0] [0x148]; IADD.X R3, R20, c [0x0] [0x14c]; IADD R0, R0, R19; IMAD.U32.U32 R6.CC, R12, 0x2, R7; LD.E R14, [R2]; IADD.X R8, RZ, RZ; IMAD R10.CC, R12, 0x3, R7; SHL R21, R0, 0x2; IADD.X R9, RZ, RZ; IMAD.U32.U32 R11.CC, R12, 0x4, R7; STS [R21], R14; SHR.U32 R0, R4, 0x1e; SHL R22, R4, 0x2; IADD.X R4, RZ, RZ; IMAD R27.CC, R12, 0x5, R7; SHR.U32 R13, R6, 0x1e; SHL R24, R6, 0x2; IADD.X R6, RZ, RZ; ISCADD R23, R5, R0, 0x2; IMAD R0.CC, R12, 0x6, R7; IMAD R33.CC, R12, 0x7, R7; SHR.U32 R15, R10, 0x1e; SHL R26, R10, 0x2; SHR.U32 R10, R11, 0x1e; SHL R28, R11, 0x2; IADD.X R11, RZ, RZ; IADD R12.CC, R22, c [0x0] [0x148]; ISCADD R25, R8, R13, 0x2; IADD.X R13, R23, c [0x0] [0x14c]; IADD R8.CC, R24, c [0x0] [0x148]; SHR.U32 R7, R27, 0x1e; LD.E R13, [R12]; SHL R30, R27, 0x2; STS [R21+0x84], R13; ISCADD R27, R9, R15, 0x2; IADD.X R9, R25, c [0x0] [0x14c]; IADD R2.CC, R26, c [0x0] [0x148]; ISCADD R29, R4, R10, 0x2; IADD.X R3, R27, c [0x0] [0x14c]; IADD R4.CC, R28, c [0x0] [0x148]; SHR.U32 R10, R0, 0x1e; ISCADD R31, R6, R7, 0x2; ISCADD R32, R5, R10, 0x2; LD.E R9, [R8]; IADD.X R5, R29, c [0x0] [0x14c]; IADD R6.CC, R30, c [0x0] [0x148]; SHL R0, R0, 0x2; IADD.X R7, R31, c [0x0] [0x14c]; SHR.U32 R34, R33, 0x1e; IADD R10.CC, R0, c [0x0] [0x148]; SHL R33, R33, 0x2; LD.E R3, [R2]; ISCADD R34, R11, R34, 0x2; LD.E R5, [R4]; IADD.X R11, R32, c [0x0] [0x14c]; IADD R14.CC, R33, c [0x0] [0x148]; LD.E R6, [R6]; IADD.X R15, R34, c [0x0] [0x14c]; LD.E R8, [R10]; LD.E R2, [R14]; STS [R21+0x108], R9; STS [R21+0x18c], R3; STS [R21+0x210], R5; STS [R21+0x294], R6; STS [R21+0x318], R8; STS [R21+0x39c], R2; BAR.SYNC 0xf; __global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride) { __shared__ float tile[TILE_HEIGHT * STRIDE]; // Preliminary address calculations … float *tile_ptr = tile + <offset>; // Buffer into shared memory #pragma unroll for(unsigned int i = 0; i < TILE_SIZE; i++) tile_ptr[i * STRIDE] = src[i * ImgStride]; __syncthreads(); // Processing data } add.s64 %rl7, %rl5, %rl6; cvt.s64.s32 %rl6, %r13; shl.b64 %rl8, %rl7, 2; mov.u64 %rl9, __cuda_local_var_42177_35_non_const_block; add.s64 %rl10, %rl9, %rl8; cvta.to.global.u64 %rl11, %rl2; mul.wide.u32 %rl12, %r15, 4; add.s64 %rl13, %rl11, %rl12; ld.global.f32 %f1, [%rl13]; st.shared.f32 [%rl10], %f1; cvt.u64.u32 %rl14, %r1; add.s64 %rl15, %rl14, %rl4; shl.b64 %rl16, %rl15, 2; add.s64 %rl17, %rl11, %rl16; ld.global.f32 %f2, [%rl17]; st.shared.f32 [%rl10+132], %f2; shl.b32 %r21, %r1, 1; cvt.u64.u32 %rl18, %r21; add.s64 %rl19, %rl18, %rl4; shl.b64 %rl20, %rl19, 2; add.s64 %rl21, %rl11, %rl20; ld.global.f32 %f3, [%rl21]; st.shared.f32 [%rl10+264], %f3; mul.lo.s32 %r24, %r1, 3; cvt.u64.u32 %rl22, %r24; add.s64 %rl23, %rl22, %rl4; shl.b64 %rl24, %rl23, 2; add.s64 %rl25, %rl11, %rl24; ld.global.f32 %f4, [%rl25]; st.shared.f32 [%rl10+396], %f4; shl.b32 %r27, %r1, 2; cvt.u64.u32 %rl26, %r27; add.s64 %rl27, %rl26, %rl4; shl.b64 %rl28, %rl27, 2; add.s64 %rl29, %rl11, %rl28; ld.global.f32 %f5, [%rl29]; st.shared.f32 [%rl10+528], %f5; mul.lo.s32 %r30, %r1, 5; cvt.u64.u32 %rl30, %r30; add.s64 %rl31, %rl30, %rl4; shl.b64 %rl32, %rl31, 2; add.s64 %rl33, %rl11, %rl32; ld.global.f32 %f6, [%rl33]; st.shared.f32 [%rl10+660], %f6; mul.lo.s32 %r33, %r1, 6; cvt.u64.u32 %rl34, %r33; add.s64 %rl35, %rl34, %rl4; shl.b64 %rl36, %rl35, 2; add.s64 %rl37, %rl11, %rl36; ld.global.f32 %f7, [%rl37]; st.shared.f32 [%rl10+792], %f7; mul.lo.s32 %r36, %r1, 7; cvt.u64.u32 %rl38, %r36; add.s64 %rl39, %rl38, %rl4; shl.b64 %rl40, %rl39, 2; add.s64 %rl41, %rl11, %rl40; ld.global.f32 %f8, [%rl41]; st.shared.f32 [%rl10+924], %f8; bar.sync 15; Each tile transfer requires many arithmetic ops to calculate addresses Address generation consumes ~50% of tile transfer cycles CUDA 4 Lines PTX 59 Instructions SASS 73 Instructions

Objective A tool to help achieve better memory performance Inspired by Direct Memory Access (DMA) CPU ! ! DRAM DMA $

Objective A tool to help achieve better memory performance Inspired by Direct Memory Access (DMA) GPU SM ! Not interruptible! $ $ ! ! $ CPU ? $ DRAM DMA $ Heavy bookkeeping!

D2MA: The Big Picture GPU $ SM DRAM D2MA

D2MA: Data-Parallel Direct Memory Access SM Take advantage of regular memory accesses & unified L1D/Shared Memory space Decouple tile transfers from SM resources Simplify address generation Improve memory pipelining Direct path to shared memory Issue Decode Fetch Register File D2MA SPs Shared Memory LD/ST Writeback L1D $ MSHR Tile[0]

D2MA Programming Model Original Code D2MA-Optimized Code CUDA: 4 Lines __global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride) { __shared__ float tile[T_HEIGHT * T_STRIDE]; int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow; //process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE); for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE]; } __global__ void D2MAkernel2DCT(float *dst, float *src, int ImgStride) { __shared__ float tile[T_HEIGHT * T_STRIDE]; int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT, ImgStride) + blockIdx.x * T_WIDTH; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow; //process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE); for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE]; } CUDA: 4 Lines PTX: 59 Instructions CUDA: 4 Lines PTX: 12 Instructions #pragma unroll for(unsigned int i = 0; i < T_SIZE; i++) tile_ptr[i * T_STRIDE] = src[i * ImgStride]; __syncthreads(); d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0); Original Code D2MA-Optimized Code

D2MA Overview SM Shared Memory L1D $ D2MA Engine Fetch Register File Controller Issue Decode Fetch Register File Glob. Addr Shr. Addr # Elements Elem. Size Stride Buf. 0 Buf. 1 D2MA Buf. 2 Buf. 3 SPs Shared Memory LD/ST AGEN Logic Consistency Checker Writeback L1D $ MSHR

D2MA Operation: Configuration D2MA Engine SM Controller Issue Decode Fetch 0110101 0110110 Register File Glob. Addr Shr. Addr # Elements Elem. Size Config Config Stride Buf. 0 0x1020 0x20 64 4 1 Buf. 1 D2MA Buf. 2 Buf. 3 SPs Shared Memory LD/ST AGEN Logic Consistency Checker Writeback L1D $ MSHR d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0);

D2MA Operation: Addr. Generation D2MA Engine SM Controller Issue Decode Fetch 0111000 Register File Glob. Addr Shr. Addr # Elements Elem. Size Ignite #0 Stride Buf. 0 0x1020 0x1020 0x20 0x20 64 64 4 4 1 1 Buf. 1 D2MA Buf. 2 Buf. 3 SPs Shared Memory LD/ST AGEN Logic AGEN Logic Global Mem. AGEN Consistency Checker 0x1020 Control Shared Mem. AGEN 0x20 Writeback L1D $ MSHR d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0);

D2MA Operation: Memory Transfer D2MA Engine SM Controller Issue Decode Fetch Register File Glob. Addr Shr. Addr # Elements Elem. Size Stride Buf. 0 0x1020 0x20 64 4 1 Buf. 1 D2MA Buf. 2 Buf. 3 SPs Shared Memory LD/ST AGEN Logic AGEN Logic Global Mem. AGEN Consistency Checker 0x10A0 0x1020 Control Shared Mem. AGEN 0x20 0xA0 Writeback L1D $ MSHR Glob. Addr Shr. Addr … 0x2000 0xFF … 0x1020 0xFFFF 0x20 0xFF … … 0x10A0 0xFFFF 0xA0 0xFF … …

D2MA Operation: Memory Transfer D2MA Engine SM Controller Issue Decode Fetch Register File Glob. Addr Shr. Addr # Elements Elem. Size Stride Buf. 0 0x1020 0x20 64 4 1 Buf. 1 D2MA Buf. 2 Buf. 3 SPs Shared Memory LD/ST AGEN Logic AGEN Logic Global Mem. AGEN Consistency Checker Control Shared Mem. AGEN Writeback L1D $ MSHR &0xA0 &0x20 &0x10A0 &0x1020 Glob. Addr Shr. Addr … 0x2000 0xFF … 0x1020 0x1020 0x20 0x20 … … 0x10A0 0x10A0 0xA0 0xA0 … …

D2MA Operation: Enforcing Synchronization Thread Block 1 Thread Block 2 Thread Block 1 Thread Block 2 No syncthreads()! Start TX 1, Thread barrier syncthreads() Start TX 1 Independent code executes Start TX 2, Thread barrier Load from buffer Start TX 2 No warp ready to schedule Barrier satisfied, End TX 1 End TX 1 Load from buffer Code independent of buffer Re-exec load Load from buffer Programmer must guarantee consistency Synchronization handled transparently by H/W Without D2MA With D2MA

Experimental Evaluation GPGPU-Sim v3.2.1 Benchmarks from NVIDIA CUDA SDK, Rodinia Must perform shared memory buffering Number of SMs 15 Thread blocks/SM 8 Shared memory/SM 48 KB D2MA engines/SM 1 Controllers/engine Buffers/controller 4 AGEN/engine Warp scheduling policy Greedy-then-oldest L1 cache (size/assoc/block size) 16KB/4-way/128B L2 cache (size/assoc/block size) 768KB/16-way/128B L2/DRAM latency 240/440 cycles

Results: Performance Geomean speedup: 1.36x

Results: Cycle Breakdown Baseline D2MA Addr. Gen: improved by 98% Mem. TX: reduced by 66% Avg TX cycles: ~5x reduction

Results: Overheads Model of D2MA Engine synthesized using Synopsys Compared to NVIDIA GTX 480 Die area: 529 mm2 TDP: 250 W One D2MA Engine per SM (15 SMs): Area overhead: 0.016% Power overhead: 0.022%

Conclusion Programmer must optimize memory traffic to achieve good performance on GPUs Shared memory buffering improves b/w utilization Buffering still has overheads D2MA decouples tiled data buffering from existing SM resources Reduces costs of address generation by 98% Improves memory transfer times by 66% Performance improves by 1.36x Dynamic instructions executed reduced by 7% Enforces synchronization transparently Low area and power overheads (<0.03%)

Thank You! Questions? Image credits: http://www.opengraphicdesign.com/web/ajax-loading-graphics/ http://www.barriersandbollards.com/html-pages/mb50-1.png

D2MA: Accelerating Coarse-Grained Data Transfer for GPUs D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke University of Michigan PACT-23 August 27th, 2014

Special Addressing Modes Both are a complete tile of data to be stored in shared memory. Blank column mode helps alleviate shared memory bank conflicts Halo addressing mode makes use of halos simpler for the programmer Blank Column Mode Halo Addressing Mode

Results: Dynamic Instruction Count Reduction