VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Advertisements

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
Sunpyo Hong, Hyesoon Kim
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Computer Engg, IIT(BHU)
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
Basic CUDA Programming
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Mattan Erez The University of Texas at Austin
L18: CUDA, cont. Memory Hierarchy and Examples
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
CUDA Parallelism Model
DRAM Bandwidth Slide credit: Slides adapted from
CS/EE 217 – GPU Architecture and Parallel Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Introduction to CUDA Programming
Parallel Computation Patterns (Reduction)
Programming Massively Parallel Processors Performance Considerations
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
L15: CUDA, cont. Memory Hierarchy and Examples
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
Introduction to Heterogeneous Parallel Computing
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Mattan Erez The University of Texas at Austin
ECE 8823A GPU Architectures Module 5: Execution and Resources - I
ECE 498AL Lecture 10: Control Flow
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
ECE 498AL Spring 2010 Lecture 10: Control Flow
6- General Purpose GPU Programming
© David Kirk/NVIDIA and Wen-mei W
Presentation transcript:

VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Optimizing for the GPU memory hierarchy Many-core processors need more care on memory access patterns to fully exploit parallelism in memory hierarchy Memory accesses from a warp can DRAM bursts if they are coalesed Memory accesses across warps can exploit MLP if they are well-distributed to all DRAM channels/banks Optimized Data Layout helps ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Memory Controller Organization of a Many-Core Processor GTX280: 30 Stream Multiprocessors (SM) connected to 8-channel DRAM controllers through interconnect DRAM controllers are interleaved Within DRAM controllers (channels), DRAM banks are interleaved for incoming memory requests We approximate its DRAM channel/bank interleaving scheme through micro-benchmarking FIXME Channels are also interleaved ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

The Memory Controller Organization of a Massively Parallel Architecture: GTX280 GPU Address bit field A[5:2] together with thread index control memory coalescing Address bit field A[13:6] help steer a request to specific memory bank, we call it steering bits ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Data Layout Transformation for the GPU To exploit parallelism in memory hierarchy, simultaneous Global Memory accesses from or across warps need to have the right address bit patterns at critical bit fields For accesses from the same (half-)warp: Patterns at those bit fields that controls coalescing match thread index For accesses across warps: Patterns at steering bit fields which are used to decode channel/banks should be as distinct as possible ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Data Layout Transformation for the GPU Simultaneously active CUDA threads/blocks are assigned distinct thread/block index values. # of distinct indices is span; log2(span) ~ distinct bit patterns in LSBs To create desired address patterns: LSBs of CUDA thread/block index bits, with span > 1, should appear at critical address bit fields that decide coalescing, and steer DRAM channel/bank interleaving Data layout transformation changes offsets are calculated from indices when access grids. Small example ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Data Layout Transformation: A motivating example For a grid G[512][512][4] of 32b floats, with grid accessing expr G[by][bx][0]: Assume 4 * ~8 active thread blocks in blockIdx.x and blockIdx.y Simultaneous accesses have following word addr. bit patterns in a row-majored layout How would a better address pattern look like? Given that A[11:4] (word addr) are steering bits A[22:16] A[15:13] A[12:7] A[6:4] A[3:0] by[8:2], mostly identical by[1:0], mostly distinct bx[8:3], mostly identical bx[2:0]; mostly distinct 0, identical ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

AoS vs. SoA Array of Structure: [z][y][x][e] F(z, y, x, e) = z * |Y| * |X| * |E| + y * |X| * |E| + x * |E| + e y=0 y=1 y=0 y=1 y=0 y=1 CFD calculation differ from Coulombic Potential in that each grid point contains a lot more state: 20X more in LBM. Structure of Array: [e][z][y][x] F(z, y, x, e) = e * |Z| * |Y| * |X|+ z * |Y| * |X| +y * |X| + x 4X faster than AoS on GTX280

Parallel PDE solvers for structured grids Target application: a class of PDE solvers Structured grid: data arranged/indexed in multidimensional grid Comes from discretizing physical space with Finite Difference Methods, Finite Volume Methods or alternative methods such as the lattice-Boltzmann Method Their computation patterns are considered memory-intensive, data-parallel Mention computation pattern ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

LBM: Lattice-Boltzmann Method Class of fluid dynamics simulation methods Our version is based on SPEC 470.lbm Each cell needs 20 words of state (18 flows + self + flags) ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

LBM Stream Collision Pseudocode Simulation grid laid out as a 3D array d of cell structures Or, a 4D array of [|Z|][|Y|][|X|][20] if we consider structure field name of a cell as the index of forth dimension Output for a cell is one flow element to each adjacent cell for each cell (x,y,z) in grid rho=d_in(z,y,x).e where e[0,20) for each neighbor cell in (z+dz,y+dy, x+dx) where |dx|+|dy|+|dz|=2 d_out(z,y,x).g(dx,dy,dz) = f(rho) end for end for Simplify -> picture: AoS/SoA/Tiled -> visualize access pattern ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

LBM: A DLT Example Lattice-Boltzman Method for fluid dynamics Compute Grid for LBM: Thread Block Dimension = 100 Grid Dimension = 100 * 130 “Physical” Grid for LBM: 130 (Z) * 100 (Y) * 100 (X) * 20 (e) of floats Grid accessing expressions in LN-CUDA: A[blockIdx.y(Type B)][blockIdx.x (Type B)][threadIdx.x (Type T)][e (Type I)] ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Layout transformation Grid: n dimensional rectangular index space Flattening Function w.r.t. a grid: NnN, injective For the grid A0 in the running example, its Row Majored Layout FF = RMLA0 (k, j, i) = k * ny * nx + j * nx + i Transformations for an FF: Split an index at bit k (to chop up LSBs) Interchange two indices and dimensions (to shift LSBs to right position) Transformed FF still injective, while bit pattern of FFg(i) changed for grid g and index vector i Why injective? Before we can do layout transformation, we’d specify a means for separating subscripting and addressing of array elements Span: the number of distinct, concurrent values at any instance in runtime ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Deciding Layout Classify each index expressions for grid-accessing expression as: Type T: of thread indices Type B: of block indices Type I: indices that do not belong to above Determine span of each thread/block indices based on classification above # of active thread blocks can be computed from occupancy (i.e. % of used HW warp contexts in an SM) Choose those indices that have span > 1 Shift only their log2(span) LSBs to the address bit fields that are used to decide channel/bank/coalescing Fill all critical address bit fields with index bit fields that are known to be distinct simultaneously ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

The best layout is neither SoA nor AoS Tiled Array of Structure, using lower bits in x and y indices, i.e. x3:0 and y3:0 as lowest dimensions: [z][y31:4][x31:4][e][y3:0][x3:0] F(z, y, x, e) = z * |Y|/24 * |X|/24 * |E| * 24 * 24 + y31:4 * |X|/24 * |E| * 24 * 24 + x31:4 * |E| * 24 * 24 + e * 2 4 * 24+ y3:0 * 24 + x3:0 6.4X faster than AoS, 1.6X faster than SoA on GTX280: Better utilization of data by neighboring cells This is a scalable layout: same layout works for very large objects. y=0 y=1 y=0 y=1 y=0 y=1 y=0 Some data layout optimal for some architecture might be harmful for other architectures Examples? CPU/GPU Assumption -> Robust process

ANY MORE QUESTIONS? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010