Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

Analysis of Algorithms

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Lecture 3: Parallel Algorithm Design

§7 Quicksort -- the fastest known sorting algorithm in practice 1. The Algorithm void Quicksort ( ElementType A[ ], int N ) { if ( N < 2 ) return; pivot.

Lower bound for sorting, radix sort COMP171 Fall 2005.

Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter

CUDA Tricks Presented by Damodaran Ramani. Synopsis Scan Algorithm Applications Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library Thrust:

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Parallel Prefix Computation Advanced Algorithms & Data Structures Lecture Theme 14 Prof. Dr. Th. Ottmann Summer Semester 2006.

Lower bound for sorting, radix sort COMP171 Fall 2006.

Lecture 5: Linear Time Sorting Shang-Hua Teng. Sorting Input: Array A[1...n], of elements in arbitrary order; array size n Output: Array A[1...n] of the.

Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.

CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.

Lecture 5: Master Theorem and Linear Time Sorting

Rossella Lau Lecture 7, DCO20105, Semester A, DCO Data structures and algorithms  Lecture 7: Big-O analysis Sorting Algorithms  Big-O analysis.

Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.

1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Heaps and heapsort COMP171 Fall 2005 Part 2. Sorting III / Slide 2 Heap: array implementation Is it a good idea to store arbitrary.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

IP Address Lookup Masoud Sabaei Assistant professor

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

© NVIDIA and UC Davis 2008 Advanced Data-Parallel Programming: Data Structures and Algorithms John Owens UC Davis.

Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.

Analysis of Algorithms These slides are a modified version of the slides used by Prof. Eltabakh in his offering of CS2223 in D term 2013.

Heapsort. Heapsort is a comparison-based sorting algorithm, and is part of the selection sort family. Although somewhat slower in practice on most machines.

GPU Broad Phase Collision Detection GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Data Structure Introduction.

Priority Queues and Heaps. October 2004John Edgar2  A queue should implement at least the first two of these operations:  insert – insert item at the.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Quick sort, lower bound on sorting, bucket sort, radix sort, comparison of algorithms, code, … Sorting: part 2.

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Lecture 9COMPSCI.220.FS.T Lower Bound for Sorting Complexity Each algorithm that sorts by comparing only pairs of elements must use at least 

HYPERCUBE ALGORITHMS-1

Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.

GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.

Compressing Bi-Level Images by Block Matching on a Tree Architecture Sergio De Agostino Computer Science Department Sapienza University of Rome ITALY.

Lecture 5 Algorithm Analysis Arne Kutzner Hanyang University / Seoul Korea.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.

Lecture 3: Parallel Algorithm Design

Top 50 Data Structures Interview Questions

CS 6068 Parallel Computing Fall 2015 Week 4 – Sept 21

Lecture 16: Parallel Algorithms I

CS 179: GPU Programming Lecture 7.

Bin Sort, Radix Sort, Sparse Arrays, and Stack-based Depth-First Search CSE 373, Copyright S. Tanimoto, 2002 Bin Sort, Radix.

Parallel Computation Patterns (Scan)

GPGPU: Parallel Reduction and Scan

Mattan Erez The University of Texas at Austin

© 2012 Elsevier, Inc. All rights reserved.

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

Bin Sort, Radix Sort, Sparse Arrays, and Stack-based Depth-First Search CSE 373, Copyright S. Tanimoto, 2001 Bin Sort, Radix.

ECE 498AL Lecture 15: Reductions and Their Implementation

Parallel build blocks.

Lower bound for sorting, radix sort

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Presentation transcript:

Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III

Scan  Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n elements [a 0, a 1, …, a n-1 ] and returns the array [I, a 0, (a 0 a 1 ), …, (a 0 a 1 … a n-2 )]  Example: [ ] [ ]

Sequential Scan out[0] = 0; for (k = 1; k < n; k++) out[k] = in[k-1] + out[k -1];  Performs n adds for an array length of n  Work Complexity is O(n)

Parallel Scan  Performs O(nlog 2 n) addition operations  Assumes there are as many processors as data elements for(d = 1; d < log 2 n; d++) for all k in parallel if( k >= 2 d ) x[k] = x[k – 2 d-1 ] + x[k]

Parallel Scan X0X0 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 ∑(x 0..x 0 )∑(x 0..x 1 )∑(x 1..x 2 )∑(x 2..x 3 )∑(x 3..x 4 )∑(x 4..x 5 )∑(x 5..x 6 )∑(x 6..x 7 ) ∑(x 0..x 0 )∑(x 0..x 1 )∑(x 0..x 2 )∑(x 0..x 3 )∑(x 1..x 4 )∑(x 2..x 5 )∑(x 3..x 6 )∑(x 4..x 7 ) ∑(x 0..x 0 )∑(x 0..x 1 )∑(x 0..x 2 )∑(x 0..x 3 )∑(x 0..x 4 )∑(x 0..x 5 )∑(x 0..x 6 )∑(x 0..x 7 ) D = 1 D = 2 D = 3 for(d = 1; d < log 2 n; d++) for all k in parallel if( k >= 2 d ) x[k] = x[k – 2 d-1 ] + x[k]

Parallel Scan  What’s the problem with this algorithm for the GPU? for(d = 1; d < log 2 n; d++) for all k in parallel if( k >= 2 d ) x[k] = x[k – 2 d-1 ] + x[k]

Parallel Scan  GPU needs to double buffer the array for(d = 1; d < log 2 n; d++) for all k in parallel if( k >= 2 d ) x[out][k] = x[in][k – 2 d-1 ] + x[in][k] else x[out][k] = x[in][k]

Issues with Current Implementation?  Only works for 512 elements (one thread block)  GPU has a complexity of O(nlog 2 n) ( CPU version is O(n) )

A work efficient parallel scan  Goal is a parallel scan that is O(n) instead of O(nlog 2 n)  Solution: Balanced Trees: Build a binary tree on the input data and sweep it to and from the root. Binary tree with n leaves has d=log 2 n levels, each level d has 2 d nodes One add is performed per node, therefore O(n) add on a single traversal of the tree.

Balanced Binary Trees Binary tree with n leaves has d=log 2 n levels, each level d has 2 d nodes One add is performed per node, therefore O(n) add on a single traversal of the tree. d = 0 d = 1 d = 3 d = 2 Tree for n = 8 Two Phase Algorithm 1.Up-sweep phase 2.Down-sweep phase

The Up-Sweep Phase for(d = 1; d < log 2 n-1; d++) for all k=0; k < n-1; 2 d+1 in parallel x[k+2 d+1 -1] = x[k+2 d -1] + x[k+2 d+1 -1] Where have we seen this before?

The Down-Sweep Phase x[n-1] = 0; for(d = log 2 n – 1; d >=0; d--) for all k = 0; k < n-1; k += 2 d+1 in parallel t = x[k + 2 d – 1] x[k + 2 d - 1] = x[k + 2 d+1 -1] x[k + 2 d+1 - 1] = t + x[k + 2 d+1 – 1] x0x0 ∑(x 0..x 1 ) ∑(x 0..x 3 ) x2x2 x4x4 ∑(x 4..x 5 ) x6x6 ∑(x 0..x 7 ) x0x0 ∑(x 0..x 1 ) ∑(x 0..x 3 ) x2x2 x4x4 ∑(x 4..x 5 ) x6x6 0 x0x0 ∑(x 0..x 1 ) 0 x2x2 x4x4 ∑(x 4..x 5 ) x6x6 ∑(x 0..x 3 ) x0x0 0 ∑(x 0..x 1 ) x2x2 x4x4 ∑(x 0..x 3 ) x6x6 ∑(x 0..x 5 ) 0 ∑(x 0..x 2 ) ∑(x 0..x 4 ) ∑(x 0..x 6 ) x0x0 ∑(x 0..x 1 ) ∑(x 0..x 3 ) ∑(x 0..x 5 )

Current Limitations  Array sizes are limited to 1024 elements  Array sizes must be a power of two

Alterations for Arbitrary Sized Arrays  Divide the large array into blocks that can be scanned by a single thread block  Scan each block and write the total sums of each block to another array of blocks  Scan the block sums, generating an array of block increments  The result is added to each of the element of their respective block Initial array of values Scan Block 0Scan Block 1 Scan Block 2 Scan Block 3 Final Array of Scanned Values Block Sums Scan Block Sums

Applications  Stream Compaction  Summed-Area Tables  Radix Sort

Stream Compaction Definition: Extracts the ‘interest’ elements from an array of elements and places them continuously in a new array  Uses: Collision Detection Sparse Matrix Compression ABADDEC ABAC FB B

Stream Compaction ABADDEC ABAC FB B ABADDECFB Input: We want to preserve the gray elements Set a ‘1’ in each gray input Scan Scatter gray inputs to output using scan result as scatter address

Summed Area Tables  Definition: A 2D table generated from an input image in which each entry in the table stores the sum of all pixels between the entry location and the lower- left corner of the input image  Uses: Can be used to perform filters of different widths at every pixel in the image in constant time per pixel

Summed Area Tables 1. Apply sum scan to all rows of the image 2. Transpose image 3. Apply a sum scan to all rows of the result

Radix Sort Initial Array Pass Pass Pass Pass Pass

Radix Sort Using Scan Input Array e = Insert a 1 for all false sort keys f = Scan the 1s = = = = = = = = 8 t = index – f + Total Falses Total Falses = e[n-1] + f[n-1] d = b ? t : f b = least significant bit Scatter input using d as scatter address

Radix Sort Using GPU  Partial Radix sort is performed once for each block.  Scan needs to be performed once for each bit  Partial sorts are then sorted together using bitonic sort

References These slides are directly based upon the following resource and are meant for education purposes only.  GPU Gems III, Chapter 39, Parallel Prefix Sum (Scan) with CUDA, Mark Harris, Shubhabrata Sengupta, John D. Owens