CUDA Tricks Presented by Damodaran Ramani. Synopsis Scan Algorithm Applications Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library Thrust:

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

Parallel Algorithms.
Section 5: More Parallel Algorithms
The University of Adelaide, School of Computer Science
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter
Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
CS 240A: Parallel Prefix Algorithms or Tricks with Trees
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
NVIDIA Research CUDA Specialized Libraries and Tools CIS 665 – GPU Programming.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
CUDA Linear Algebra Library and Next Generation
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Enhancing GPU for Scientific Computing Some thoughts.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
© NVIDIA and UC Davis 2008 Advanced Data-Parallel Programming: Data Structures and Algorithms John Owens UC Davis.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
GPU Broad Phase Collision Detection GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
GPU Architecture and Programming
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS Spring 2012.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Martin Kruliš by Martin Kruliš (v1.0)1.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
P&H Ap. A GPUs for Graphics and Computing. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
Appendix C Graphics and Computing GPUs
CUDA Interoperability with Graphical Environments
CS 6068 Parallel Computing Fall 2015 Week 4 – Sept 21
Lecture 16: Parallel Algorithms I
CS 179: GPU Programming Lecture 7.
Parallel Computation Patterns (Scan)
GPGPU: Parallel Reduction and Scan
Mattan Erez The University of Texas at Austin
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
Parallel build blocks.
Presentation transcript:

CUDA Tricks Presented by Damodaran Ramani

Synopsis Scan Algorithm Applications Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library Thrust: a Template Library for CUDA Applications CUDA FFT and BLAS libraries for the GPU

References Scan primitives for GPU Computing. Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens Presentation on scan primitives by Gary J. Katz based on the article Parallel Prefix Sum (Scan) with CUDA - Harris, Sengupta and Owens (GPU GEMS Chapter 39)

Introduction GPUs massively parallel processors Programmable parts of the graphics pipeline operates on primitives (vertices, fragments) These primitive programs spawn a thread for each primitive to keep the parallel processors full

Stream programming model (particle systems, image processing, grid-based fluid simulations, and dense matrix algebra) Fragment program operating on n fragments (accesses - O(n)) Problem arises when access requirements are complex (eg: prefix-sum – O(n 2 ))

Prefix-Sum Example in: out:

Trivial Sequential Implementation void scan(int* in, int* out, int n) { out[0] = 0; for (int i = 1; i < n; i++) out[i] = in[i-1] + out[i-1]; }

Scan: An Efficient Parallel Primitive Interested in finding efficient solutions to parallel problems in which each output requires global knowledge of the inputs. Why CUDA? (General Load-Store Memory Architecture, On-chip Shared Memory, Thread Synchronization)

Threads & Blocks GeForce 8800 GTX ( 16 multiprocessors, 8 processors each) CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD-parallel threads. Programmers specify the number of thread blocks and threads per block, and the hardware and drivers map thread blocks to parallel multiprocessors on the GPU. Within a thread block, threads can communicate through shared memory and cooperate through sync. Because only threads within the same block can cooperate via shared memory and thread synchronization,programmers must partition computation into multiple blocks.(complex programming, large performance benefits)

The Scan Operator Definition: The scan operation takes a binary associative operator with identity I, and an array of n elements [a 0, a 1, …, a n-1 ] and returns the array [I, a 0, (a 0 a 1 ), …, (a 0 a 1 … a n-2 )] Types – inclusive, exclusive, forward, backward

Parallel Scan for(d = 1; d < log 2 n; d++) for all k in parallel if( k >= 2 d ) x[out][k] = x[in][k – 2 d-1 ] + x[in][k] else x[out][k] = x[in][k] Complexity O(nlog 2 n)

A work efficient parallel scan Goal is a parallel scan that is O(n) instead of O(nlog 2 n) Solution: Balanced Trees: Build a binary tree on the input data and sweep it to and from the root. Binary tree with n leaves has d=log 2 n levels, each level d has 2 d nodes One add is performed per node, therefore O(n) add on a single traversal of the tree.

O(n) unsegmented scan Reduce/Up-Sweep for(d = 0; d < log 2 n-1; d++) for all k=0; k < n-1; k+=2 d+1 in parallel x[k+2 d+1 -1] = x[k+2 d -1] + x[k+2 d+1 -1] Down-Sweep x[n-1] = 0; for(d = log 2 n – 1; d >=0; d--) for all k = 0; k < n-1; k += 2 d+1 in parallel t = x[k + 2 d – 1] x[k + 2 d - 1] = x[k + 2 d+1 -1] x[k + 2 d+1 - 1] = t + x[k + 2 d+1 – 1]

Tree analogy x0x0 ∑(x 0..x 1 ) ∑(x 0..x 3 ) x2x2 x4x4 ∑(x 4..x 5 ) x6x6 ∑(x 0..x 7 ) x0x0 ∑(x 0..x 1 ) ∑(x 0..x 3 ) x2x2 x4x4 ∑(x 4..x 5 ) x6x6 0 x0x0 ∑(x 0..x 1 ) 0 x2x2 x4x4 ∑(x 4..x 5 ) x6x6 ∑(x 0..x 3 ) x0x0 0 ∑(x 0..x 1 ) x2x2 x4x4 ∑(x 0..x 3 ) x6x6 ∑(x 0..x 5 ) 0 ∑(x 0..x 2 ) ∑(x 0..x 4 ) ∑(x 0..x 6 ) x0x0 ∑(x 0..x 1 ) ∑(x 0..x 3 ) ∑(x 0..x 5 )

O(n) Segmented Scan Up-Sweep

Down-Sweep

Features of segmented scan 3 times slower than unsegmented scan Useful for building broad variety of applications which are not possible with unsegmented scan.

Primitives built on scan Enumerate enumerate([t f f t f t t]) = [ ] Exclusive scan of input vector Distribute (copy) distribute([a b c][d e]) = [a a a][d d] Inclusive scan of input vector Split and split-and-segment Split divides the input vector into two pieces, with all the elements marked false on the left side of the output vector and all the elements marked true on the right.

Applications Quicksort Sparse Matrix-Vector Multiply Tridiagonal Matrix Solvers and Fluid Simulation Radix Sort Stream Compaction Summed-Area Tables

Quicksort

Sparse Matrix-Vector Multiplication

Stream Compaction Definition: Extracts the ‘interest’ elements from an array of elements and places them continuously in a new array Uses: Collision Detection Sparse Matrix Compression ABADDEC ABAC FB B

Stream Compaction ABADDEC ABAC FB B ABADDECFB Input: We want to preserve the gray elements Set a ‘1’ in each gray input Scan Scatter gray inputs to output using scan result as scatter address

Radix Sort Using Scan Input Array e = Insert a 1 for all false sort keys f = Scan the 1s = = = = = = = = 8 t = index – f + Total Falses Total Falses = e[n-1] + f[n-1] d = b ? t : f b = least significant bit Scatter input using d as scatter address

Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library CUDPP is a library of data-parallel algorithm primitives such as parallel prefix- sum (”scan”), parallel sort and parallel reduction.parallel prefix- sum

CUDPP_DLL CUDPPResult cudppSparseMa trixVectorMultiply(CUDPPHandle sparse MatrixHandle,void * d_y,const void * d_x )CUDPPResult Perform matrix-vector multiply y = A*x for arbitrary sparse matrix A and vector x.

CUDPPScanConfig config; config.direction = CUDPP_SCAN_FORWARD; config.exclusivity = CUDPP_SCAN_EXCLUSIVE; config.op = CUDPP_ADD; config.datatype = CUDPP_FLOAT; config.maxNumElements = numElements; config.maxNumRows = 1; config.rowPitch = 0; cudppInitializeScan(&config); cudppScan(d_odata, d_idata, numElements, &config);

CUFFT No. of elements<8192 slower than fftw >8192, 5x speedup over threaded fftw and 10x over serial fftw.

CUBLAS Cuda Based Linear Algebra Subroutines Saxpy, conjugate gradient, linear solvers. 3D reconstruction of planetary nebulae. bs.de/publications/Fernandez08TechReport.pdf bs.de/publications/Fernandez08TechReport.pdf

GPU Variant 100 times faster than CPU version Matrix size is limited by graphics card memory and texture size. Although taking advantage of sparce matrices will help reduce memory consumption, sparse matrix storage is not implemented by CUBLAS.

Useful Links chFFT/ /2_0/docs/CUBLAS_Library_2.0.pdf