© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

Slides:

Advertisements

Similar presentations

Entity Relationship (E-R) Modeling

Advertisements

Using Matrices in Real Life

1 Concurrency: Deadlock and Starvation Chapter 6.

Constraint Satisfaction Problems

Analysis of Computer Algorithms

Chapter 1 The Study of Body Function Image PowerPoint

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Dynamic Programming Introduction Prof. Muhammad Saeed.

© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Chapter 11: Structure and Union Types Problem Solving & Program Design.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.

1 Processes and Threads Creation and Termination States Usage Implementations.

Environmental Remote Sensing GEOG 2021

SI23 Introduction to Computer Graphics

5.1 si31_2001 SI31 Advanced Computer Graphics AGR Lecture 5 A Simple Reflection Model.

Reductions Complexity ©D.Moshkovitz.

SE-292 High Performance Computing

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.

Addison Wesley is an imprint of © 2010 Pearson Addison-Wesley. All rights reserved. Chapter 10 Arrays and Tile Mapping Starting Out with Games & Graphics.

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Chapter 5 : Memory Management

1 RAID Overview n Computing speeds double every 3 years n Disk speeds cant keep up n Data needs higher MTBF than any component in system n IO.

Mehdi Naghavi Spring 1386 Operating Systems Mehdi Naghavi Spring 1386.

Chapter 18 Methodology – Monitoring and Tuning the Operational System Transparencies © Pearson Education Limited 1995, 2005.

Randomized Algorithms Randomized Algorithms CS648 1.

Shredder GPU-Accelerated Incremental Storage and Computation

David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.

Chapter 4 Memory Management Basic memory management Swapping

Advance Database Systems and Applications COMP 6521

ABC Technology Project

Multilevel Page Tables

Chapter 9 -- Simplification of Sequential Circuits.

Reconstruction from Voxels (GATE-540)

CS105 Introduction to Computer Concepts GATES and CIRCUITS

© Glenn Rowe AC Lab 2 A simple card game (twenty- one)

Online Algorithm Huaping Wang Apr.21

Chapter 4 Gates and Circuits.

Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.

Lecture plan Outline of DB design process Entity-relationship model

Processes Management.

Executional Architecture

Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.

25 seconds left…...

Determining How Costs Behave

We will resume in: 25 Minutes.

Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

PSSA Preparation.

Choosing an Order for Joins

From Approximative Kernelization to High Fidelity Reductions joint with Michael Fellows Ariel Kulik Frances Rosamond Technion Charles Darwin Univ. Hadas.

The Small World Phenomenon: An Algorithmic Perspective Speaker: Bradford Greening, Jr. Rutgers University – Camden.

Bart Jansen 1.  Problem definition  Instance: Connected graph G, positive integer k  Question: Is there a spanning tree for G with at least k leaves?

© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.

ECE 498AL Lecture 15: Reductions and Their Implementation

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

10/18: Lecture Topics Using spatial locality

Presentation transcript:

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 2 Objective Learn about algorithm patterns and principles –What are my threads? Where does my data go? This lecture goes over several patterns that have seen significant success with CUDA, and how they got there. –Input/Output Convolution –Bounded Input/Output Convolution –Stencil computation –Input/Input Convolution –Bounded Input/Input Convolution

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 3 Two Questions For every application, the start of a CUDA implementation begins with these two questions: What work does each thread do? What memory space should each piece of data go?

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 4 Assumptions Computation is independent (parallel) unless otherwise stated –i.e. Reductions are the only real presense of serialization An work unit (as presented) is reasonably small for one CUDA thread. –We won't be discussing cases where the “tasks” are just too large to fit into a thread. Global memory is big enough to hold your entire dataset –There are another level of issues to address for this case

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 5 A Few Commonalities: Reductions and Memory Patterns

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 6 Reduction patterns in CUDA Local –One thread performs an entire, unique reduction Matrix Multiplication In-Block –Threads only within a block contribute to a reduction Only slightly less efficient than local reduction, esp. on new hardware Global –Every thread contributes to the reduction: two subtypes Blocked: threads within a block contribute to the same reduction, allowing some component of block reduction –The more reduction you can do within a block, the better Scattered: threads in a block do not contribute to the same reduction

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 7 Mapping data into CUDA's memories Output must finally end up in global memory –No other way to communicate results to the rest of the world –Intermediate outputs can (and should) be stored in registers or shared memory Globally-shared input goes in constant memory –Run several kernels to process chunks at a time Input shared only by adjacent threads should be tiled into the shared memory –Matrix Multiplication tiles

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 8 Mapping data continued... Input not shared by adjacent threads should just be loaded from global memory –There are cases where shared memory is still useful. E.g. coalescing data structure loads from global memory Texture memory should really only be used if its specialized indexing features are useful –Just accessing it “normally” is usually not worth it –Applications needing specific features might find it helpful Linear interpolation (good for FP function lookup tables) Array bounds clipping or wraparound Subword type unpacking

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 9 Input/Output Convolution: e.g. MRI, Direct Summation CP

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 10 Generic Algorithm Description. Every input element contributes to every output element Each output element is dependent on all input elements Input contributions are combined through some reduction operation Assumptions: All input contributions and output elements are independent An interaction is reasonably small for one CUDA thread. 0 N-1M-1 0

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 11 What could each thread be assigned? Input elements O(N) threads Each contributes to M global reductions (scattered reduce) Output elements ~M threads Each reads N input elements, local reduction only Input/Output pairs O(N*M) threads Each thread contributes to one of M global reductions Pros and Cons to each possibility!. 0 N-1M-1 0

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 12 Thread Assignment Tradeoffs Input elements / global reductions –Usually ineffective, as it requires M global reductions Input/Output pairs –Effective when you need the extra parallelism –You can group threads into blocks based on input or output elements Basically a choice between blocked and scattered reductions Output elements / local reductions –Very effective if a reasonable amount of input can fit into the constant memory

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 13 What memory space does the data use? Output has to be in global memory If input is globally shared (threads assigned Output elements), constant memory is best. –Again, it's likely that the whole input won't fit in constant memory at once. –Break up your implementation into “mini-kernels”, each reading a chunk of the input at a time from constant memory. Even if constant memory doesn't make sense (threads assigned Input/Output pairs), shared memory can probably help some.

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 14 Bounded Input/Output Convolution e.g. Cutoff Summation (and Matrix Multiplication)

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 15 Generic problem description Input elements will contribute to a bounded range of output elements –Conversely, each output element is affected by a limited range/set of input elements Usually arises from cutoff distances in spatial representations –O(# output elements) instead of O(# output elements * # input elements) Cutoff

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 16 Revisiting thread-assignment tradeoffs Input elements / “global” reductions –The reductions that an input element affects are restricted –Might be reasonable, if the the “global” reduction can become mostly an in-block reduction. Input/Output pairs –Still most effective when you need the extra parallelism –Try as much as possible to keep conceptually “global” reductions as in-block reductions in reality If it works, this strategy will likely be very competetive Output elements / local reductions –Still most effective if feasible

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 17 Data Mapping? Input isn't globally shared anymore –Constant memory doesn't make sense because most threads won't need a particular input element Read “tiles” of input data relevant to a tile of output data into shared memory –Not all threads in the grid will need the data, but adjacent threads will with high probability

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 18 Stencil Computation: Fluid Dynamics, Image Convolution

Generic Algorithm Description Class of in-place applications where the next “step” for an element depends on a predetermined set of other (usually adjacent) elements. Dataset should either be double- buffered or red-black colored to prevent dependencies. T=0 T=1

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 20 Basic Questions again What does each thread do? One input component for the element? –Thread blocks compute one or a small number of elements The whole computation for one element? –Thread blocks compute tiles of elements Again, tradeoffs: mostly determined by how much work goes into an element for the next timestep –Directly related to the size of the stencil

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 21 What memory space? Depends on the app. –If adjacent elements share input values, ideal case for shared memory tiling –If entire thread blocks compute single elements, tiling doesn't help No intrablock sharing T=0 Overlapping input tiles Non-Overlapping output tiles T=1

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 22 What if basic tiling isn't good enough? Sometimes, the bandwidth of loading and storing tiles far outweighs the needed computation Multi-step kernels Larger input tiles, multiple steps within a block –Means there's some redundant computation for the edges of T=1 intermediate tiles T=0 Input tiles Overlapping intermediate tiles of T=1 T=2

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 23 Input/Input Convolution e.g. N-body Interaction

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 24 Generic Algorithm Description All input elements interact –Pairwise most common, sometimes even higher-degree Interactions usually contribute to reductions –Either a per-element reduction or a global reduction, depending on the app Examples: gravitational or electrical points in space, two-point autocorrelation function in astronomy Threads and data storage?

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 25 What does each thread do? If the reduction is per-element, this looks a lot like the input/output convolution case –Input pair is the new “element” –Apply tradeoffs from that case If the reduction is global, it looks a lot like a simple reduction. –Again: Input pair is the new “element” –Try to load and reduce many “elements” in-block

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 26 What memory space does the data use? Per-element reductions handled the same way as input/output convolution –Tile one copy of input through constant memory, read in the other from global memory Global reductions could be handled by loading input tile pairs into shared memory, and doing as much in-block reduction as possible –N^2 interactions per block for 2 N-element input tiles Should prevent memory bandwidth from being a bottleneck if N is reasonable. Input tiles Output (tiles or partial reduction)

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 27 Bounded Input/Input Convolutions e.g. N-body approximations (NAMD)

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 28 Generic Algorithm Input/Input convolution with some cutoff –O(N) instead of O(N^2) Similar approach to the bounded input/output convolution approach –Use some kind of spatial binning to reduce algorithmic complexity

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 29 Modifications from Unbounded Case An input “tile” is naturally defined by the binning process –A tile is all input elements in one (or several) bins Each tile has a limited number of other tiles to interact –For per-element reductions, this looks a lot like the bounded input/output convolution case Load a tile, and interact every other relevant tile within one block –For global reductions, this is essentially the unchanged from the unbounded case, except that fewer input-pairs are considered for reduction contributions