Download presentation
Presentation is loading. Please wait.
Published byElly Maurer Modified over 6 years ago
1
VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
2
Optimizing for the GPU memory hierarchy
Many-core processors need more care on memory access patterns to fully exploit parallelism in memory hierarchy Memory accesses from a warp can DRAM bursts if they are coalesed Memory accesses across warps can exploit MLP if they are well-distributed to all DRAM channels/banks Optimized Data Layout helps ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
3
Memory Controller Organization of a Many-Core Processor
GTX280: 30 Stream Multiprocessors (SM) connected to 8-channel DRAM controllers through interconnect DRAM controllers are interleaved Within DRAM controllers (channels), DRAM banks are interleaved for incoming memory requests We approximate its DRAM channel/bank interleaving scheme through micro-benchmarking FIXME Channels are also interleaved ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
4
The Memory Controller Organization of a Massively Parallel Architecture: GTX280 GPU
Address bit field A[5:2] together with thread index control memory coalescing Address bit field A[13:6] help steer a request to specific memory bank, we call it steering bits ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
5
Data Layout Transformation for the GPU
To exploit parallelism in memory hierarchy, simultaneous Global Memory accesses from or across warps need to have the right address bit patterns at critical bit fields For accesses from the same (half-)warp: Patterns at those bit fields that controls coalescing match thread index For accesses across warps: Patterns at steering bit fields which are used to decode channel/banks should be as distinct as possible ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
6
Data Layout Transformation for the GPU
Simultaneously active CUDA threads/blocks are assigned distinct thread/block index values. # of distinct indices is span; log2(span) ~ distinct bit patterns in LSBs To create desired address patterns: LSBs of CUDA thread/block index bits, with span > 1, should appear at critical address bit fields that decide coalescing, and steer DRAM channel/bank interleaving Data layout transformation changes offsets are calculated from indices when access grids. Small example ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
7
Data Layout Transformation: A motivating example
For a grid G[512][512][4] of 32b floats, with grid accessing expr G[by][bx][0]: Assume 4 * ~8 active thread blocks in blockIdx.x and blockIdx.y Simultaneous accesses have following word addr. bit patterns in a row-majored layout How would a better address pattern look like? Given that A[11:4] (word addr) are steering bits A[22:16] A[15:13] A[12:7] A[6:4] A[3:0] by[8:2], mostly identical by[1:0], mostly distinct bx[8:3], mostly identical bx[2:0]; mostly distinct 0, identical ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
8
AoS vs. SoA Array of Structure: [z][y][x][e]
F(z, y, x, e) = z * |Y| * |X| * |E| + y * |X| * |E| + x * |E| + e y=0 y=1 y=0 y=1 y=0 y=1 CFD calculation differ from Coulombic Potential in that each grid point contains a lot more state: 20X more in LBM. Structure of Array: [e][z][y][x] F(z, y, x, e) = e * |Z| * |Y| * |X|+ z * |Y| * |X| +y * |X| + x 4X faster than AoS on GTX280
9
Parallel PDE solvers for structured grids
Target application: a class of PDE solvers Structured grid: data arranged/indexed in multidimensional grid Comes from discretizing physical space with Finite Difference Methods, Finite Volume Methods or alternative methods such as the lattice-Boltzmann Method Their computation patterns are considered memory-intensive, data-parallel Mention computation pattern ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
10
LBM: Lattice-Boltzmann Method
Class of fluid dynamics simulation methods Our version is based on SPEC 470.lbm Each cell needs 20 words of state (18 flows + self + flags) ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
11
LBM Stream Collision Pseudocode
Simulation grid laid out as a 3D array d of cell structures Or, a 4D array of [|Z|][|Y|][|X|][20] if we consider structure field name of a cell as the index of forth dimension Output for a cell is one flow element to each adjacent cell for each cell (x,y,z) in grid rho=d_in(z,y,x).e where e[0,20) for each neighbor cell in (z+dz,y+dy, x+dx) where |dx|+|dy|+|dz|= d_out(z,y,x).g(dx,dy,dz) = f(rho) end for end for Simplify -> picture: AoS/SoA/Tiled -> visualize access pattern ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
12
LBM: A DLT Example Lattice-Boltzman Method for fluid dynamics
Compute Grid for LBM: Thread Block Dimension = 100 Grid Dimension = 100 * 130 “Physical” Grid for LBM: 130 (Z) * 100 (Y) * 100 (X) * 20 (e) of floats Grid accessing expressions in LN-CUDA: A[blockIdx.y(Type B)][blockIdx.x (Type B)][threadIdx.x (Type T)][e (Type I)] ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
13
Layout transformation
Grid: n dimensional rectangular index space Flattening Function w.r.t. a grid: NnN, injective For the grid A0 in the running example, its Row Majored Layout FF = RMLA0 (k, j, i) = k * ny * nx + j * nx + i Transformations for an FF: Split an index at bit k (to chop up LSBs) Interchange two indices and dimensions (to shift LSBs to right position) Transformed FF still injective, while bit pattern of FFg(i) changed for grid g and index vector i Why injective? Before we can do layout transformation, we’d specify a means for separating subscripting and addressing of array elements Span: the number of distinct, concurrent values at any instance in runtime ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
14
Deciding Layout Classify each index expressions for grid-accessing expression as: Type T: of thread indices Type B: of block indices Type I: indices that do not belong to above Determine span of each thread/block indices based on classification above # of active thread blocks can be computed from occupancy (i.e. % of used HW warp contexts in an SM) Choose those indices that have span > 1 Shift only their log2(span) LSBs to the address bit fields that are used to decide channel/bank/coalescing Fill all critical address bit fields with index bit fields that are known to be distinct simultaneously ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
15
The best layout is neither SoA nor AoS
Tiled Array of Structure, using lower bits in x and y indices, i.e. x3:0 and y3:0 as lowest dimensions: [z][y31:4][x31:4][e][y3:0][x3:0] F(z, y, x, e) = z * |Y|/24 * |X|/24 * |E| * 24 * y31:4 * |X|/24 * |E| * 24 * 24 + x31:4 * |E| * 24 * e * 2 4 * 24+ y3:0 * 24 + x3:0 6.4X faster than AoS, 1.6X faster than SoA on GTX280: Better utilization of data by neighboring cells This is a scalable layout: same layout works for very large objects. y=0 y=1 y=0 y=1 y=0 y=1 y=0 Some data layout optimal for some architecture might be harmful for other architectures Examples? CPU/GPU Assumption -> Robust process
16
ANY MORE QUESTIONS? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.