Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter 39 - http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
Work Complexity In terms of data how many or what order of operations (Big-Oh) are required to perform a particular algorithm Example – summation Given an array of n elements, it takes O(n) operations to find the sum of those elements on a single core processor. We want parallel algorithms to be work efficient – to take the same amount of work (operations) as the serial version. This means that a parallelized version of sum should take O(n) operations (additions) to finish.
Reduction Work Complexity Recall – we have studied reduction In class we have seen that a sum reduction takes O(n) work to run to completion. The sum reduction takes the same order of work (operations) to complete as the serial version. Therefore, we say a sum reduction is work-efficient.
Prefix Sum (Scan) The prefix sum or scan operation is useful for computing cumulative sums. Example: Given the following array, compute the prefix sum. +---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | 0 1 2 3 4 5
Uses for Scan Sorting Lexical Analysis String Comparison Polynomial Evaluation Stream Compaction Building Histograms and Data Structures
Exclusive Prefix Sum (Exclusive Scan) Example: Given the following array, compute the prefix sum. +---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | 0 1 2 3 4 5 | 0 | 7 | 9 | 12| 13| 22|
Inclusive Prefix Sum (Inclusive Scan) Example: Given the following array, compute the prefix sum. +---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | 0 1 2 3 4 5 | 7 | 9 | 12| 13| 22| 24|
Naïve Prefix Sum (Exclusive Scan) Algorithm int arr[N] = {7, 2, 3, 1, 9, 2}; int prefixSum[N]; int i, j; prefixSum[0] = 0; for(i = 1; i < N; i++) prefixSum[i]=prefixSum[i-1]+arr[j-1];
Naïve Parallel Inclusive Scan (Hillis and Steele 1986) +---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | 0 1 2 3 4 5 | \ | \ | \ | \ | \ | V V V V V V | 7 | 9 | 5 | 4 | 10| 11| \ \ | \ | \ | | \ \| \| \| | \ \ \ \ | \ |\ |\ |\ | \ | \ | \ | \ | \| \| \| \| V V V V | 7 | 9 | 12| 13| 15| 15|
Parallel Scan Example Continued +---+---+---+---+---+---+ | 7 | 9 | 12| 13| 15| 15| 0 1 2 3 4 5 \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \| | \ \ | \ |\ | \ | \ | \| \| V V | 7 | 9 | 12| 13| 22| 24|
Naïve Parallel Scan const int N = 6; int log_of_N = log2(8); int arr[N] = {7, 2, 3, 1, 9, 2}; int scanSum[2][N]; int i, j; int in = 1, out = 0; memcpy(scanSum[out], arr, N*sizeof(int)); memset(scanSum[in], 0, int two_i = 0, two_i_1 = 0;
Naïve Parallel Scan for(i = 1; i <= log_of_N; i++) { two_i_1 = 1 << (i-1); two_i = 1 << i; out = 1 - out; in = 1 - out; for(j = 0; j < N; j++) if(j >= two_i_1) scanSum[out][j] = scanSum[in][j] + scanSum[in][j - two_i_1]; else scanSum[out][j] = scanSum[in][j]; }
Adding OpenMP Parallelism for(i = 1; i <= log_of_N; i++) { two_i_1 = 1 << (i-1); two_i = 1 << i; out = 1 - out; in = 1 - out; #pragma omp parallel for private(j) shared(scanSum, in, out) for(j = 0; j < N; j++) if(j >= two_i_1) scanSum[out][j] = scanSum[in][j] + scanSum[in][j - two_i_1]; else scanSum[out][j] = scanSum[in][j]; }
Naïve Parallel Scan Analysis Not work efficient Performs O(n lg n) additions Note that the serial version performs O(n) additions Requires O(lg n) steps This means we can complete a scan in O(lg n) time if we have O(n lg n) processors
Work Efficient Parallel Scan (Blelloch 1990) Goal – build a scan algorithm that avoids the extra O(lg n) work Solution – make use of balanced binary trees Build a balanced binary tree on the input data and sweep the tree to compute the prefix sum Consists of two parts Up Sweep Traverse tree from leaves to root computing partial sums during the traversal The root node (last element in the array) will hold the sum of all elements Down Sweep Traverse from root to leaves using the partial sums to compute the cumulative sums in place. For the exclusive prefix sum, we replace the root with zero
+---+---+---+---+---+---+---+---+ | 7 | 9 | 3 | 13| 9 | 11| 8 | 36| ^ /| / | / | / | / | | 7 | 9 | 3 | 13| 9 | 11| 8 | 23| ^ ^ /| /| / | / | / | / | | 7 | 9 | 3 | 4 | 9 | 11| 8 | 12| ^ ^ ^ ^ / | / | / | / | / | / | / | / | | 7 | 2 | 3 | 1 | 9 | 2 | 8 | 4 | 0 1 2 3 4 5 6 7 Blelloch Up Sweep
Blelloch Algorithm – Up Sweep for(i = 0; i <= log_of_N_1; i++) { two_i_p1 = 1 << (i+1); two_i = 1 << i; for(j = 0; j < N; j+=two_i_p1) scanSum[j + two_i_p1 - 1] = scanSum[j + two_i - 1] + scanSum[j + two_i_p1 - 1]; }
+---+---+---+---+---+---+---+---+ | 7 | 9 | 3 | 13| 9 | 11| 8 | 36| zero| V | 7 | 9 | 3 | 13| 9 | 11| 8 | 0 | \ / | \ / | / \ | zero / \ | V V | 7 | 9 | 3 | 0 | 9 | 11| 8 | 13| \ / | \ / | / \ | / \ | V V V V | 7 | 0 | 3 | 9 | 9 | 13| 8 | 24| \ /| \ /| \ /| \ /| / \| / \| / \| / \| V V V V V V V V | 0 | 7 | 9 | 12| 13| 22| 24| 32| 0 1 2 3 4 5 6 7 Blelloch Down Sweep
Blelloch Algorithm – Down Sweep scanSum[N-1] = 0; for(i = log_of_N_1; i >= 0; i--) { two_i_p1 = 1 << (i+1); two_i = 1 << i; for(j = 0; j < N; j+=two_i_p1) { long t = scanSum[j + two_i - 1]; scanSum[j + two_i - 1] = scanSum[j + two_i_p1 - 1]; scanSum[j + two_i_p1 - 1] = t + }
Analysis This algorithm does complete the summation in O(n) operations, giving the appearance of efficiency. Actually, the algorithm requires 2*(n-1) additions and n-1 swaps to run to completion. For large arrays, this algorithm will indeed out-perform the Hillis and Steele Algorithm.