Split Primitive on the GPU

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

More on threads, shared memory, synchronization

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped.

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

GPU Programming David Monismith Based on Notes from the Udacity Parallel Programming (cs344) Course.

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

An Introduction to Programming with CUDA Paul Richmond

SAGE: Self-Tuning Approximation for Graphics Engines

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

GPU Programming with CUDA – Optimisation Mike Griffiths

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

Naga Shailaja Dasari Ranjan Desh Zubair M Old Dominion University Norfolk, Virginia, USA.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

© David Kirk/NVIDIA and Wen-mei W

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

CS/EE 217 – GPU Architecture and Parallel Programming

Scalable Primitives for Data Mapping and Movement on the GPU

Basic CUDA Programming

Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Programming Lecture 7.

CS/EE 217 – GPU Architecture and Parallel Programming

Parallel Computation Patterns (Scan)

Parallel Computation Patterns (Reduction)

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

ECE 498AL Lecture 10: Control Flow

ECE 498AL Spring 2010 Lecture 10: Control Flow

Chapter 4:Parallel Programming in CUDA C

Parallel Computation Patterns (Histogram)

6- General Purpose GPU Programming

Presentation transcript:

Split Primitive on the GPU

Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same category together

Split Sequential Algorithm I. Count the number of elements falling into each bin for each element x of list L do histogram[category(x)]++ [Possible Clashes on a category] II. Find starting index for each bin (Prefix Sum) for each category ‘m’ do startIndex[m] = startIndex[m – 1]+histogram[m-1] III. Assign each element to the output for each element x of list L do [Initialize localIndex[x]=0] itemIndex = localIndex[category(x)]++ [Possible Clashes on a category] globalIndex = startIndex[category(x)] outArray[globalIndex+itemIndex] = x

Split Operation in Parallel In order to parallelize the above split algorithm, we require a clash free method for building histogram on the GPU Above can be achieved on a parallel machine using one of the following two methods Personal Histograms for each processors, followed by merging the histograms Atomic Operations on Histogram array(s)

Global Memory Atomic Split Code : __global__ void globalHist ( unsigned int *histogram, int* gArray, int *category ) { int curElement; int curCategory; for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) curElement= gArray[blockIdx.x * blockDim.x * i + threadIdx.x]; curCategory = category[curElement]; atomicInc(&histogram[curCategory],99999); } Global Memory too slow to access Single Histogram in Global Memory (Number of clashes is data dependent) Overuse of Shared Memory limits the maximum number of categories to 64

Non-Atomic Approach (He et al.) A Histogram for each ‘Thread’ Combine all the histograms to get the final histogram __global__ void nonAtomicHistogram( int* gArray, int *category, unsigned int *tHistGlobal ) { int curElement, curCategory; __shared__ unsigned int tHist[NUMBINS*NUMTHREADS]; for ( int i=0; i < NUMBINS; i++ ) tHist[threadIDx.x*NUMBINS+i] = 0; for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) curElement = gArray[blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD + ( i * NUMTHREADS ) + threadIdx.x]; curCategory = category[curElement]; tHist[tx*NUMBINS+curCategory]++; } for ( int i=0; i<NUMBINS; i++ ) tHistGlobal[i * NUMBLOCKS * NUMTHREADS + blockIdx.x*NUMTHREADS + threadIdx.x] = tHist[tx*NUMBINS+i];

Shared Memory Atomic Global Atomic does not use the fast shared memory available Non-Atomic approach overuses the shared memory Incorporating atomic operations on fast shared memory may perform better compared to above two approaches Shared Memory Atomic can be performed using one of the below mentioned techniques H/W Atomic Operations Clash Serial Atomic Operations Thread Serial Atomic Operations

SM Atomic :: H/W Atomic Latest GPUs (G2xx and later) support atomic operations on the Shared Memory __global__ void histkernel ( unsigned int *blockHists, int* gArray, unsigned int *category ) { const int numThreads = blockDim.x * gridDim.x; extern __shared__ int sharedmem[]; unsigned int* s_Hist = (unsigned int *)&sharedmem; unsigned int curElement, curCategory; for(int pos = threadIdx.x; pos < NUMBINS; pos += blockDim.x) s_Hist[pos] = 0; __syncthreads(); for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) curElement = gArray[ blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD ) + ( i * NUMTHREADS ) + threadIdx.x]; curCategory = category[curElement]; atomicInc(&s_Hist[category],9999999); } blockHists[ blockIdx.x + gridDim.x * pos ] = s_Hist[pos];

SM Atomic :: Thread Serial Threads can be serialized within a ‘warp’ in order to avoid clashes. …………. for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) { curElement = gArray[ blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD + ( i * NUMTHREADS ) + threadIdx.x]; curCategory = category[curElement]; for ( int i=0; i < WARPSIZE; i++ ) if ( threadIdx.x == i ) s_Hist[curCategory]++; }

SM Atomic :: Clash Serial Each thread writes to the common histogram of the block until it succeeds. A Thread is tagged by its thread ID in order to find out if the thread successfully updated the histogram //Main for(int pos = globalTid; pos < NUMELEMENTS; pos += numThreads) { unsigned int curElement = gArray[pos]; unsigned int curCategory = category[curElement]; addData256(s_Hist, curCategory, threadTag); } //Clash serializing function for a Warp __device__ void addData256(volatile unsigned int *s_WarpHist, unsigned int data, unsigned int threadTag) unsigned int count; do{ count = s_WarpHist[data] & 0x07FFFFFFU; count = threadTag | (count + 1); s_WarpHist[data] = count; }while(s_WarpHist[data] != count);

Comparison of Histogram Methods for 16 Million Elements

Split using Shared Atomic Shared Atomic used to build Block-level histograms Parallel Prefix Sum used to compute starting index Split is performed by each block for same set of elements used in Step 1

Comparison of Split Methods Global Atomic suffers for low number of categories Non-Atomic can do maximum of 64 categories in one pass (multiple-pass for higher categories) Shared Atomic performs better than other 2 GPU methods and CPU for a wide range of categories Shared Memory limits maximum number of bins to 2048 (for power of 2 bins)

32 bit Bin broken into 4 sub-bins of 8 bits Multi Level Split Bins higher than 2K are broken into sub-bins Hierarchy of bins is created and split is performed at each level for different sub-bins Number of splits to be performed grow exponentially With 2 levels we can perform split for up to 4Million bins 8 bits 32 bit Bin broken into 4 sub-bins of 8 bits

Results for Bins up to 4 Million Multi Level Split performed on GTX280. Bins from 4K to 512K are handled with 2 passes and results for 1M and 2M bins for 1M elements are computed using 3 passes for better performance

MLS :: Right to Left Using an iterative approach requires constant number of splits at each level Highly scalable due to its iterative nature and ideal number of bins can be chosen for best performance Dividing the bins from Right-to-Left requires to preserve the order of elements from previous pass Complete list of elements is re-arranged at each level

Ordered Atomic Atomic operations perform safe reads/writes by serializing the clashes, but do not guarantee required order of operation Ordered atomic serializes the clashes in a fixed order provided by the user In case of a clash at higher levels in Right-to-Left Split, elements should be inserted in order of their existing position in the list

Split on 4 Billion bins Right to Left split can be used for splitting integers to 4 billion bins ( sorting? ) Integers can be sorted to desired number of bits ( Keys can be 8, 16, 24, 32 bit long, 64 bit too )

SplitSort Comparison with other GPU Sorting Implementations

Sorting 64 Bit numbers on the GPU

Conclusion Various histogram methods implemented on shared memory Split operation now handles millions and billions of bins using Left-to-Right and Right-to-Left methods of Multi-Level-Split Shared memory split operation faster and scalable than previous implementation (He et al.) Fastest Sorting achieved with extension of split to billions of bins Variable bit-length sorting helpful with keys of varying size ( bit-length )