Split Primitive on the GPU

Split Primitive on the GPU

Split Primitive Split can be defined as performing ::
append(x,List[category(x)]) for each x, List holds elements of same category together

Split Sequential Algorithm
I. Count the number of elements falling into each bin for each element x of list L do histogram[category(x)]++ [Possible Clashes on a category] II. Find starting index for each bin (Prefix Sum) for each category ‘m’ do startIndex[m] = startIndex[m – 1]+histogram[m-1] III. Assign each element to the output for each element x of list L do [Initialize localIndex[x]=0] itemIndex = localIndex[category(x)] [Possible Clashes on a category] globalIndex = startIndex[category(x)] outArray[globalIndex+itemIndex] = x

Split Operation in Parallel
In order to parallelize the above split algorithm, we require a clash free method for building histogram on the GPU Above can be achieved on a parallel machine using one of the following two methods Personal Histograms for each processors, followed by merging the histograms Atomic Operations on Histogram array(s)

Global Memory Atomic Split
Code : __global__ void globalHist ( unsigned int *histogram, int* gArray, int *category ) { int curElement; int curCategory; for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) curElement= gArray[blockIdx.x * blockDim.x * i + threadIdx.x]; curCategory = category[curElement]; atomicInc(&histogram[curCategory],99999); } Global Memory too slow to access Single Histogram in Global Memory (Number of clashes is data dependent) Overuse of Shared Memory limits the maximum number of categories to 64

Non-Atomic Approach (He et al.)
A Histogram for each ‘Thread’ Combine all the histograms to get the final histogram __global__ void nonAtomicHistogram( int* gArray, int *category, unsigned int *tHistGlobal ) { int curElement, curCategory; __shared__ unsigned int tHist[NUMBINS*NUMTHREADS]; for ( int i=0; i < NUMBINS; i++ ) tHist[threadIDx.x*NUMBINS+i] = 0; for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) curElement = gArray[blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD + ( i * NUMTHREADS ) + threadIdx.x]; curCategory = category[curElement]; tHist[tx*NUMBINS+curCategory]++; } for ( int i=0; i<NUMBINS; i++ ) tHistGlobal[i * NUMBLOCKS * NUMTHREADS + blockIdx.x*NUMTHREADS + threadIdx.x] = tHist[tx*NUMBINS+i];

Shared Memory Atomic Global Atomic does not use the fast shared memory available Non-Atomic approach overuses the shared memory Incorporating atomic operations on fast shared memory may perform better compared to above two approaches Shared Memory Atomic can be performed using one of the below mentioned techniques H/W Atomic Operations Clash Serial Atomic Operations Thread Serial Atomic Operations

SM Atomic :: H/W Atomic Latest GPUs (G2xx and later) support atomic operations on the Shared Memory __global__ void histkernel ( unsigned int *blockHists, int* gArray, unsigned int *category ) { const int numThreads = blockDim.x * gridDim.x; extern __shared__ int sharedmem[]; unsigned int* s_Hist = (unsigned int *)&sharedmem; unsigned int curElement, curCategory; for(int pos = threadIdx.x; pos < NUMBINS; pos += blockDim.x) s_Hist[pos] = 0; __syncthreads(); for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) curElement = gArray[ blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD ) + ( i * NUMTHREADS ) + threadIdx.x]; curCategory = category[curElement]; atomicInc(&s_Hist[category], ); } blockHists[ blockIdx.x + gridDim.x * pos ] = s_Hist[pos];

SM Atomic :: Thread Serial
Threads can be serialized within a ‘warp’ in order to avoid clashes. …………. for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) { curElement = gArray[ blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD + ( i * NUMTHREADS ) + threadIdx.x]; curCategory = category[curElement]; for ( int i=0; i < WARPSIZE; i++ ) if ( threadIdx.x == i ) s_Hist[curCategory]++; }

SM Atomic :: Clash Serial
Each thread writes to the common histogram of the block until it succeeds. A Thread is tagged by its thread ID in order to find out if the thread successfully updated the histogram //Main for(int pos = globalTid; pos < NUMELEMENTS; pos += numThreads) { unsigned int curElement = gArray[pos]; unsigned int curCategory = category[curElement]; addData256(s_Hist, curCategory, threadTag); } //Clash serializing function for a Warp __device__ void addData256(volatile unsigned int *s_WarpHist, unsigned int data, unsigned int threadTag) unsigned int count; do{ count = s_WarpHist[data] & 0x07FFFFFFU; count = threadTag | (count + 1); s_WarpHist[data] = count; }while(s_WarpHist[data] != count);

Comparison of Histogram Methods for 16 Million Elements

Split using Shared Atomic
Shared Atomic used to build Block-level histograms Parallel Prefix Sum used to compute starting index Split is performed by each block for same set of elements used in Step 1

Comparison of Split Methods
Global Atomic suffers for low number of categories Non-Atomic can do maximum of 64 categories in one pass (multiple-pass for higher categories) Shared Atomic performs better than other 2 GPU methods and CPU for a wide range of categories Shared Memory limits maximum number of bins to 2048 (for power of 2 bins)

32 bit Bin broken into 4 sub-bins of 8 bits
Multi Level Split Bins higher than 2K are broken into sub-bins Hierarchy of bins is created and split is performed at each level for different sub-bins Number of splits to be performed grow exponentially With 2 levels we can perform split for up to 4Million bins 8 bits 32 bit Bin broken into 4 sub-bins of 8 bits

Results for Bins up to 4 Million
Multi Level Split performed on GTX280. Bins from 4K to 512K are handled with 2 passes and results for 1M and 2M bins for 1M elements are computed using 3 passes for better performance

MLS :: Right to Left Using an iterative approach requires constant number of splits at each level Highly scalable due to its iterative nature and ideal number of bins can be chosen for best performance Dividing the bins from Right-to-Left requires to preserve the order of elements from previous pass Complete list of elements is re-arranged at each level

Ordered Atomic Atomic operations perform safe reads/writes by serializing the clashes, but do not guarantee required order of operation Ordered atomic serializes the clashes in a fixed order provided by the user In case of a clash at higher levels in Right-to-Left Split, elements should be inserted in order of their existing position in the list

Split on 4 Billion bins Right to Left split can be used for splitting integers to 4 billion bins ( sorting? ) Integers can be sorted to desired number of bits ( Keys can be 8, 16, 24, 32 bit long, 64 bit too )

SplitSort Comparison with other GPU Sorting Implementations

Sorting 64 Bit numbers on the GPU

Conclusion Various histogram methods implemented on shared memory
Split operation now handles millions and billions of bins using Left-to-Right and Right-to-Left methods of Multi-Level-Split Shared memory split operation faster and scalable than previous implementation (He et al.) Fastest Sorting achieved with extension of split to billions of bins Variable bit-length sorting helpful with keys of varying size ( bit-length )

Split Primitive on the GPU

Similar presentations

Presentation on theme: "Split Primitive on the GPU"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Split Primitive on the GPU

Similar presentations

Presentation on theme: "Split Primitive on the GPU"— Presentation transcript:

Similar presentations

About project

Feedback