Download presentation
Presentation is loading. Please wait.
Published byElinor Johnson Modified over 9 years ago
1
Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity
David Monismith CS599 Based upon notes from GPU Gems 3, Chapter
2
Work Complexity In terms of data how many or what order of operations (Big-Oh) are required to perform a particular algorithm Example – summation Given an array of n elements, it takes O(n) operations to find the sum of those elements on a single core processor. We want parallel algorithms to be work efficient – to take the same amount of work (operations) as the serial version. This means that a parallelized version of sum should take O(n) operations (additions) to finish.
3
Reduction Work Complexity
Recall – we have studied reduction In class we have seen that a sum reduction takes O(n) work to run to completion. The sum reduction takes the same order of work (operations) to complete as the serial version. Therefore, we say a sum reduction is work-efficient.
4
Prefix Sum (Scan) The prefix sum or scan operation is useful for computing cumulative sums. Example: Given the following array, compute the prefix sum. | 7 | 2 | 3 | 1 | 9 | 2 |
5
Uses for Scan Sorting Lexical Analysis String Comparison
Polynomial Evaluation Stream Compaction Building Histograms and Data Structures
6
Exclusive Prefix Sum (Exclusive Scan)
Example: Given the following array, compute the prefix sum. | 7 | 2 | 3 | 1 | 9 | 2 | | 0 | 7 | 9 | 12| 13| 22|
7
Inclusive Prefix Sum (Inclusive Scan)
Example: Given the following array, compute the prefix sum. | 7 | 2 | 3 | 1 | 9 | 2 | | 7 | 9 | 12| 13| 22| 24|
8
Naïve Prefix Sum (Exclusive Scan) Algorithm
int arr[N] = {7, 2, 3, 1, 9, 2}; int prefixSum[N]; int i, j; prefixSum[0] = 0; for(i = 1; i < N; i++) prefixSum[i]=prefixSum[i-1]+arr[j-1];
9
Naïve Parallel Inclusive Scan (Hillis and Steele 1986)
| 7 | 2 | 3 | 1 | 9 | 2 | | \ | \ | \ | \ | \ | V V V V V V | 7 | 9 | 5 | 4 | 10| 11| \ \ | \ | \ | | \ \| \| \| | \ \ \ \ | \ |\ |\ |\ | \ | \ | \ | \ | \| \| \| \| V V V V | 7 | 9 | 12| 13| 15| 15|
10
Parallel Scan Example Continued
| 7 | 9 | 12| 13| 15| 15| \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \| | \ \ | \ |\ | \ | \ | \| \| V V | 7 | 9 | 12| 13| 22| 24|
11
Naïve Parallel Scan const int N = 6; int log_of_N = log2(8);
int arr[N] = {7, 2, 3, 1, 9, 2}; int scanSum[2][N]; int i, j; int in = 1, out = 0; memcpy(scanSum[out], arr, N*sizeof(int)); memset(scanSum[in], 0, int two_i = 0, two_i_1 = 0;
12
Naïve Parallel Scan for(i = 1; i <= log_of_N; i++) { two_i_1 = 1 << (i-1); two_i = 1 << i; out = 1 - out; in = 1 - out; for(j = 0; j < N; j++) if(j >= two_i_1) scanSum[out][j] = scanSum[in][j] + scanSum[in][j - two_i_1]; else scanSum[out][j] = scanSum[in][j]; }
13
Adding OpenMP Parallelism
for(i = 1; i <= log_of_N; i++) { two_i_1 = 1 << (i-1); two_i = 1 << i; out = 1 - out; in = 1 - out; #pragma omp parallel for private(j) shared(scanSum, in, out) for(j = 0; j < N; j++) if(j >= two_i_1) scanSum[out][j] = scanSum[in][j] + scanSum[in][j - two_i_1]; else scanSum[out][j] = scanSum[in][j]; }
14
Naïve Parallel Scan Analysis
Not work efficient Performs O(n lg n) additions Note that the serial version performs O(n) additions Requires O(lg n) steps This means we can complete a scan in O(lg n) time if we have O(n lg n) processors
15
Work Efficient Parallel Scan (Blelloch 1990)
Goal – build a scan algorithm that avoids the extra O(lg n) work Solution – make use of balanced binary trees Build a balanced binary tree on the input data and sweep the tree to compute the prefix sum Consists of two parts Up Sweep Traverse tree from leaves to root computing partial sums during the traversal The root node (last element in the array) will hold the sum of all elements Down Sweep Traverse from root to leaves using the partial sums to compute the cumulative sums in place. For the exclusive prefix sum, we replace the root with zero
16
| 7 | 9 | 3 | 13| 9 | 11| 8 | 36| ^ /| / | / | / | / | | 7 | 9 | 3 | 13| 9 | 11| 8 | 23| ^ ^ /| /| / | / | / | / | | 7 | 9 | 3 | 4 | 9 | 11| 8 | 12| ^ ^ ^ ^ / | / | / | / | / | / | / | / | | 7 | 2 | 3 | 1 | 9 | 2 | 8 | 4 | Blelloch Up Sweep
17
Blelloch Algorithm – Up Sweep
for(i = 0; i <= log_of_N_1; i++) { two_i_p1 = 1 << (i+1); two_i = 1 << i; for(j = 0; j < N; j+=two_i_p1) scanSum[j + two_i_p1 - 1] = scanSum[j + two_i - 1] + scanSum[j + two_i_p1 - 1]; }
18
| 7 | 9 | 3 | 13| 9 | 11| 8 | 36| zero| V | 7 | 9 | 3 | 13| 9 | 11| 8 | 0 | \ / | \ / | / \ | zero / \ | V V | 7 | 9 | 3 | 0 | 9 | 11| 8 | 13| \ / | \ / | / \ | / \ | V V V V | 7 | 0 | 3 | 9 | 9 | 13| 8 | 24| \ /| \ /| \ /| \ /| / \| / \| / \| / \| V V V V V V V V | 0 | 7 | 9 | 12| 13| 22| 24| 32| Blelloch Down Sweep
19
Blelloch Algorithm – Down Sweep
scanSum[N-1] = 0; for(i = log_of_N_1; i >= 0; i--) { two_i_p1 = 1 << (i+1); two_i = 1 << i; for(j = 0; j < N; j+=two_i_p1) { long t = scanSum[j + two_i - 1]; scanSum[j + two_i - 1] = scanSum[j + two_i_p1 - 1]; scanSum[j + two_i_p1 - 1] = t + }
20
Analysis This algorithm does complete the summation in O(n) operations, giving the appearance of efficiency. Actually, the algorithm requires 2*(n-1) additions and n-1 swaps to run to completion. For large arrays, this algorithm will indeed out-perform the Hillis and Steele Algorithm.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.