Download presentation
Presentation is loading. Please wait.
Published byKeyon Lester Modified over 9 years ago
1
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA
2
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 2 Objective Learn about algorithm patterns and principles –What are my threads? Where does my data go? This lecture goes over several patterns that have seen significant success with CUDA, and how they got there. –Input/Output Convolution –Bounded Input/Output Convolution –Stencil computation –Input/Input Convolution –Bounded Input/Input Convolution
3
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 3 Two Questions For every application, the start of a CUDA implementation begins with these two questions: What work does each thread do? What memory space should each piece of data go?
4
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 4 Assumptions Computation is independent (parallel) unless otherwise stated –i.e. Reductions are the only real presense of serialization An work unit (as presented) is reasonably small for one CUDA thread. –We won't be discussing cases where the “tasks” are just too large to fit into a thread. Global memory is big enough to hold your entire dataset –There are another level of issues to address for this case
5
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 5 A Few Commonalities: Reductions and Memory Patterns
6
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 6 Reduction patterns in CUDA Local –One thread performs an entire, unique reduction Matrix Multiplication In-Block –Threads only within a block contribute to a reduction Only slightly less efficient than local reduction, esp. on new hardware Global –Every thread contributes to the reduction: two subtypes Blocked: threads within a block contribute to the same reduction, allowing some component of block reduction –The more reduction you can do within a block, the better Scattered: threads in a block do not contribute to the same reduction
7
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 7 Mapping data into CUDA's memories Output must finally end up in global memory –No other way to communicate results to the rest of the world –Intermediate outputs can (and should) be stored in registers or shared memory Globally-shared input goes in constant memory –Run several kernels to process chunks at a time Input shared only by adjacent threads should be tiled into the shared memory –Matrix Multiplication tiles
8
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 8 Mapping data continued... Input not shared by adjacent threads should just be loaded from global memory –There are cases where shared memory is still useful. E.g. coalescing data structure loads from global memory Texture memory should really only be used if its specialized indexing features are useful –Just accessing it “normally” is usually not worth it –Applications needing specific features might find it helpful Linear interpolation (good for FP function lookup tables) Array bounds clipping or wraparound Subword type unpacking
9
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 9 Input/Output Convolution: e.g. MRI, Direct Summation CP
10
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 10 Generic Algorithm Description. Every input element contributes to every output element Each output element is dependent on all input elements Input contributions are combined through some reduction operation Assumptions: All input contributions and output elements are independent An interaction is reasonably small for one CUDA thread. 0 N-1M-1 0
11
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 11 What could each thread be assigned? Input elements O(N) threads Each contributes to M global reductions (scattered reduce) Output elements ~M threads Each reads N input elements, local reduction only Input/Output pairs O(N*M) threads Each thread contributes to one of M global reductions Pros and Cons to each possibility!. 0 N-1M-1 0
12
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 12 Thread Assignment Tradeoffs Input elements / global reductions –Usually ineffective, as it requires M global reductions Input/Output pairs –Effective when you need the extra parallelism –You can group threads into blocks based on input or output elements Basically a choice between blocked and scattered reductions Output elements / local reductions –Very effective if a reasonable amount of input can fit into the constant memory
13
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 13 What memory space does the data use? Output has to be in global memory If input is globally shared (threads assigned Output elements), constant memory is best. –Again, it's likely that the whole input won't fit in constant memory at once. –Break up your implementation into “mini-kernels”, each reading a chunk of the input at a time from constant memory. Even if constant memory doesn't make sense (threads assigned Input/Output pairs), shared memory can probably help some.
14
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 14 Bounded Input/Output Convolution e.g. Cutoff Summation (and Matrix Multiplication)
15
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 15 Generic problem description Input elements will contribute to a bounded range of output elements –Conversely, each output element is affected by a limited range/set of input elements Usually arises from cutoff distances in spatial representations –O(# output elements) instead of O(# output elements * # input elements) Cutoff
16
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 16 Revisiting thread-assignment tradeoffs Input elements / “global” reductions –The reductions that an input element affects are restricted –Might be reasonable, if the the “global” reduction can become mostly an in-block reduction. Input/Output pairs –Still most effective when you need the extra parallelism –Try as much as possible to keep conceptually “global” reductions as in-block reductions in reality If it works, this strategy will likely be very competetive Output elements / local reductions –Still most effective if feasible
17
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 17 Data Mapping? Input isn't globally shared anymore –Constant memory doesn't make sense because most threads won't need a particular input element Read “tiles” of input data relevant to a tile of output data into shared memory –Not all threads in the grid will need the data, but adjacent threads will with high probability
18
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 18 Stencil Computation: Fluid Dynamics, Image Convolution
19
Generic Algorithm Description Class of in-place applications where the next “step” for an element depends on a predetermined set of other (usually adjacent) elements. Dataset should either be double- buffered or red-black colored to prevent dependencies. T=0 T=1
20
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 20 Basic Questions again What does each thread do? One input component for the element? –Thread blocks compute one or a small number of elements The whole computation for one element? –Thread blocks compute tiles of elements Again, tradeoffs: mostly determined by how much work goes into an element for the next timestep –Directly related to the size of the stencil
21
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 21 What memory space? Depends on the app. –If adjacent elements share input values, ideal case for shared memory tiling –If entire thread blocks compute single elements, tiling doesn't help No intrablock sharing T=0 Overlapping input tiles Non-Overlapping output tiles T=1
22
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 22 What if basic tiling isn't good enough? Sometimes, the bandwidth of loading and storing tiles far outweighs the needed computation Multi-step kernels Larger input tiles, multiple steps within a block –Means there's some redundant computation for the edges of T=1 intermediate tiles T=0 Input tiles Overlapping intermediate tiles of T=1 T=2
23
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 23 Input/Input Convolution e.g. N-body Interaction
24
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 24 Generic Algorithm Description All input elements interact –Pairwise most common, sometimes even higher-degree Interactions usually contribute to reductions –Either a per-element reduction or a global reduction, depending on the app Examples: gravitational or electrical points in space, two-point autocorrelation function in astronomy Threads and data storage?
25
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 25 What does each thread do? If the reduction is per-element, this looks a lot like the input/output convolution case –Input pair is the new “element” –Apply tradeoffs from that case If the reduction is global, it looks a lot like a simple reduction. –Again: Input pair is the new “element” –Try to load and reduce many “elements” in-block
26
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 26 What memory space does the data use? Per-element reductions handled the same way as input/output convolution –Tile one copy of input through constant memory, read in the other from global memory Global reductions could be handled by loading input tile pairs into shared memory, and doing as much in-block reduction as possible –N^2 interactions per block for 2 N-element input tiles Should prevent memory bandwidth from being a bottleneck if N is reasonable. Input tiles Output (tiles or partial reduction)
27
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 27 Bounded Input/Input Convolutions e.g. N-body approximations (NAMD)
28
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 28 Generic Algorithm Input/Input convolution with some cutoff –O(N) instead of O(N^2) Similar approach to the bounded input/output convolution approach –Use some kind of spatial binning to reduce algorithmic complexity
29
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 29 Modifications from Unbounded Case An input “tile” is naturally defined by the binning process –A tile is all input elements in one (or several) bins Each tile has a limited number of other tiles to interact –For per-element reductions, this looks a lot like the bounded input/output convolution case Load a tile, and interact every other relevant tile within one block –For global reductions, this is essentially the unchanged from the unbounded case, except that fewer input-pairs are considered for reduction contributions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.