VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
Optimizing for the GPU memory hierarchy Many-core processors need more care on memory access patterns to fully exploit parallelism in memory hierarchy Memory accesses from a warp can DRAM bursts if they are coalesed Memory accesses across warps can exploit MLP if they are well-distributed to all DRAM channels/banks Optimized Data Layout helps ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
Memory Controller Organization of a Many-Core Processor GTX280: 30 Stream Multiprocessors (SM) connected to 8-channel DRAM controllers through interconnect DRAM controllers are interleaved Within DRAM controllers (channels), DRAM banks are interleaved for incoming memory requests We approximate its DRAM channel/bank interleaving scheme through micro-benchmarking FIXME Channels are also interleaved ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
The Memory Controller Organization of a Massively Parallel Architecture: GTX280 GPU Address bit field A[5:2] together with thread index control memory coalescing Address bit field A[13:6] help steer a request to specific memory bank, we call it steering bits ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
Data Layout Transformation for the GPU To exploit parallelism in memory hierarchy, simultaneous Global Memory accesses from or across warps need to have the right address bit patterns at critical bit fields For accesses from the same (half-)warp: Patterns at those bit fields that controls coalescing match thread index For accesses across warps: Patterns at steering bit fields which are used to decode channel/banks should be as distinct as possible ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
Data Layout Transformation for the GPU Simultaneously active CUDA threads/blocks are assigned distinct thread/block index values. # of distinct indices is span; log2(span) ~ distinct bit patterns in LSBs To create desired address patterns: LSBs of CUDA thread/block index bits, with span > 1, should appear at critical address bit fields that decide coalescing, and steer DRAM channel/bank interleaving Data layout transformation changes offsets are calculated from indices when access grids. Small example ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
Data Layout Transformation: A motivating example For a grid G[512][512][4] of 32b floats, with grid accessing expr G[by][bx][0]: Assume 4 * ~8 active thread blocks in blockIdx.x and blockIdx.y Simultaneous accesses have following word addr. bit patterns in a row-majored layout How would a better address pattern look like? Given that A[11:4] (word addr) are steering bits A[22:16] A[15:13] A[12:7] A[6:4] A[3:0] by[8:2], mostly identical by[1:0], mostly distinct bx[8:3], mostly identical bx[2:0]; mostly distinct 0, identical ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
AoS vs. SoA Array of Structure: [z][y][x][e] F(z, y, x, e) = z * |Y| * |X| * |E| + y * |X| * |E| + x * |E| + e y=0 y=1 y=0 y=1 y=0 y=1 CFD calculation differ from Coulombic Potential in that each grid point contains a lot more state: 20X more in LBM. Structure of Array: [e][z][y][x] F(z, y, x, e) = e * |Z| * |Y| * |X|+ z * |Y| * |X| +y * |X| + x 4X faster than AoS on GTX280
Parallel PDE solvers for structured grids Target application: a class of PDE solvers Structured grid: data arranged/indexed in multidimensional grid Comes from discretizing physical space with Finite Difference Methods, Finite Volume Methods or alternative methods such as the lattice-Boltzmann Method Their computation patterns are considered memory-intensive, data-parallel Mention computation pattern ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
LBM: Lattice-Boltzmann Method Class of fluid dynamics simulation methods Our version is based on SPEC 470.lbm Each cell needs 20 words of state (18 flows + self + flags) ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
LBM Stream Collision Pseudocode Simulation grid laid out as a 3D array d of cell structures Or, a 4D array of [|Z|][|Y|][|X|][20] if we consider structure field name of a cell as the index of forth dimension Output for a cell is one flow element to each adjacent cell for each cell (x,y,z) in grid rho=d_in(z,y,x).e where e[0,20) for each neighbor cell in (z+dz,y+dy, x+dx) where |dx|+|dy|+|dz|=2 d_out(z,y,x).g(dx,dy,dz) = f(rho) end for end for Simplify -> picture: AoS/SoA/Tiled -> visualize access pattern ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
LBM: A DLT Example Lattice-Boltzman Method for fluid dynamics Compute Grid for LBM: Thread Block Dimension = 100 Grid Dimension = 100 * 130 “Physical” Grid for LBM: 130 (Z) * 100 (Y) * 100 (X) * 20 (e) of floats Grid accessing expressions in LN-CUDA: A[blockIdx.y(Type B)][blockIdx.x (Type B)][threadIdx.x (Type T)][e (Type I)] ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
Layout transformation Grid: n dimensional rectangular index space Flattening Function w.r.t. a grid: NnN, injective For the grid A0 in the running example, its Row Majored Layout FF = RMLA0 (k, j, i) = k * ny * nx + j * nx + i Transformations for an FF: Split an index at bit k (to chop up LSBs) Interchange two indices and dimensions (to shift LSBs to right position) Transformed FF still injective, while bit pattern of FFg(i) changed for grid g and index vector i Why injective? Before we can do layout transformation, we’d specify a means for separating subscripting and addressing of array elements Span: the number of distinct, concurrent values at any instance in runtime ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
Deciding Layout Classify each index expressions for grid-accessing expression as: Type T: of thread indices Type B: of block indices Type I: indices that do not belong to above Determine span of each thread/block indices based on classification above # of active thread blocks can be computed from occupancy (i.e. % of used HW warp contexts in an SM) Choose those indices that have span > 1 Shift only their log2(span) LSBs to the address bit fields that are used to decide channel/bank/coalescing Fill all critical address bit fields with index bit fields that are known to be distinct simultaneously ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
The best layout is neither SoA nor AoS Tiled Array of Structure, using lower bits in x and y indices, i.e. x3:0 and y3:0 as lowest dimensions: [z][y31:4][x31:4][e][y3:0][x3:0] F(z, y, x, e) = z * |Y|/24 * |X|/24 * |E| * 24 * 24 + y31:4 * |X|/24 * |E| * 24 * 24 + x31:4 * |E| * 24 * 24 + e * 2 4 * 24+ y3:0 * 24 + x3:0 6.4X faster than AoS, 1.6X faster than SoA on GTX280: Better utilization of data by neighboring cells This is a scalable layout: same layout works for very large objects. y=0 y=1 y=0 y=1 y=0 y=1 y=0 Some data layout optimal for some architecture might be harmful for other architectures Examples? CPU/GPU Assumption -> Robust process
ANY MORE QUESTIONS? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010