Lecture 7.

Lecture 7

Announcements Section

Have you been to section;
why or why not? I have class and cannot make either time I have work and cannot make either time I went and found section helpful I went and did not find section helpful 3 Scott B. Baden / CSE 160 / Wi '16

What else can you say about section?
It’s not clear what the purpose of section is There other things I’d like to see covered in section I didn’t go Both A and B Both A and C 4 Scott B. Baden / CSE 160 / Wi '16

Sorting is fundamental algorithm in data processing
Parallel Sorting Sorting is fundamental algorithm in data processing  Given an unordered set of keys x0, x1,…, xN-1  Return the keys in sorted order The keys may be character strings, floating point numbers, integers, or any object for which the relations >, <, and == hold We’ll assume integers In practice, we sort on external media, i.e. disk, but we’ll consider in-memory sorting See: There are many parallel sorts. We’ll implement Merge Sort in A2 Scott B. Baden / CSE 160 / Wi '16

Serial Merge Sort algorithm
A divide and conquer algorithm We stop the recursion when we reach a certain size g Sort each piece with a fast local sort We merge data in odd-even pairs Each partner get the smallest (largest) N/P values, discards the rest Running time of the merge 4 2 7 8 5 1 3 6 g limit 4 2 7 8 5 1 3 6 4 2 7 8 5 1 3 6 2 4 7 8 1 3 5 6 1 2 3 4 5 6 7 8 in O(m+n), assuming 2 vectors of size m & n Dan Harvey, S. Oregon Univ. Scott B. Baden / CSE 160 / Wi '16

Why might the lists to be merged have different sizes?
A. Because the median value might not be in the middle Because the mean value might not be in the middle Both A&B Not sure Scott B. Baden / CSE 160 / Wi '16

Parallel Merge Sort algorithm
At each level of recursion we use x2 the number of cores as at the previous level 4 2 7 8 5 1 3 6 Thread limit 4 2 7 8 5 1 3 6 When we are running on all the cores, we stop spawning threads & switch to the serial merge sort algorithm As with the serial algorithm, we stop the recursion when we reach a certain size g Threads merge data in odd-even pairs The simplest algorithm uses a sequential merge, but we’ll implement a parallel merge But start with the serial merge! 4 2 7 8 5 1 3 6 2 4 7 8 1 3 5 6 1 2 3 4 5 6 7 8 Dan Harvey, S. Oregon Univ. Scott B. Baden / CSE 160 / Wi '16

Merge Sort with different values of g
4 2 7 8 5 1 3 6 4 2 7 8 5 1 3 6 Thread limit (2) Thread limit (2) 4 2 7 8 5 1 3 6 4 2 7 8 5 1 3 6 g=2 g=1 4 2 7 8 5 1 3 6 Serial sort Merge Merge 2 4 7 8 1 3 5 6 Merge 2 4 7 8 1 3 5 6 Merge 1 2 3 4 5 6 7 8 Merge 1 2 3 4 5 6 7 8 In general, N/g << N/# threads and you’ll reach the ‘g’ limit before the thread limit 36 Scott B. Baden / CSE 160 / Wi '16

Serial Merge -1 3 7 9 11 2 4 8 12 14 Thread 0 Thread 1 Merge Step
Thread 0 Thread 1 Merge Step Left most thread does the merging Sorts the merged list Parallelism diminishes as we move up the recursion tree There is only O(log n) parallelism, but if we stop the recursion before reaching the bottom of the tree, it’s much smaller Scott B. Baden / CSE 160 / Wi '16

What is the running time of parallel merge sort (with serial merge)?
A. O(N) C. O(N2) B. O(N log N) Scott B. Baden / CSE 160 / Wi '16

What should be done if the maximum limit on the number of threads is reached and the block size is still greater than g? We should continue to split the work further until we reach a block size of g, but without spawning any more threads We should switch to the serial MergeSort algorithm We should stop splitting the work and use some other sorting algorithm E. A & C D. A & B 1414 Scott B. Baden / CSE 160 / Wi '16

What should be done if the maximum limit on the number of threads is reached and the block size is still greater than g? We should continue to split the work further until we reach a block size of g, but without spawning any more threads We should switch to the serial MergeSort algorithm We should stop splitting the work and use some other sorting algorithm E. A & C D. A & B 1515 Scott B. Baden / CSE 160 / Wi '16

Merge Recall that we handled merge with just 1 thread
But as we return from the recursion we use fewer and fewer threads: at the top level, we are merging the entire list on just 1 thread As a result, there is Θ(lg N) parallelism There is a parallel merge algorithm that can do better 1616 Scott B. Baden / CSE 160 / Wi '16

Parallel Merge - Preliminaries
Assume we are merging N=m+n elements stored in two arrays A and B of length m and n, respectively Assume m ≥ n (switch A and B if necessary) Locate the median of A m-1 A B 1717 Scott B. Baden / CSE 160 / Wi '16

Parallel Merge Strategy
Search for the B[j] closest to, but not larger than, the A[m/2] (assumes no duplicates) Thus, when we insert A[m/2] between B[0:j-1] & B[j:n-1], the list remains sorted Recursively merge into a new array C[ ]  C[0:j+m/2-2] ← (A[0:m/2-1] , B[0:j-1])  C[0:j+m/2:N] ← (A[m/2+1:m-1] , B[j:n-1])  C[0:j+m/2-1] ← A[m/2] m/2 m-1 A A[0:m/2-1] A[m/2:m-1] Recursive merge Recursive merge B B[0:j-1] B[j+1:n-1] Charles Leiserson B[j] n j 9 Scott B. Baden / CSE 160 / Wi '16

B[0:j-1] & B[j:n-1], the list remains sorted
Parallel Merge - II Search for the B[j] closest to, but not larger than, the median (assumes no duplicates) Thus, when we insert A[m/2] between B[0:j-1] & B[j:n-1], the list remains sorted Recursively merge into a new array C[ ]  C[0:j+m/2-2] ← {A[0:m/2-1] , B[0:j-1]}  C[0:j+m/2:N] ← {A[m/2+1:m-1] , B[j:n-1]}  C[0:j+m/2-1] ← A[m/2] m/2 m-1 A A[0:m/2-1] A[m/2:m-1] Recursive merge Recursive merge B B[0:j-1] B[j+1:n-1] Charles Leiserson n j 10 Scott B. Baden / CSE 160 / Wi '16

Assuming that B[j] holds the value that is closest to the median of A (m/2), which are true?
All of A[0:m/2-1] are smaller than all of B[0:j] All of A[0:m/2-1] are smaller than all of B[j+1:n-1] All of B[0:j-1] are smaller than all of A[m/2:m-1] A & B 0 m/2 m-1 E. B & C A[0:m/2-1] A[m/2:m-1] A Binary search B B[0:j] B[j+1:n-1] n-1 j Charles Leiserson 11 Scott B. Baden / CSE 160 / Wi '16

Recursive Parallel Merge Performance
If there are N = m+n elements (m ≥ n), then the larger of the merges can merge as many as k*N elements,0 ≤ k ≤ 1 What is k and what is the worst case that establishes this bound? m-1 m/2 A A[0:m/2-1] A[m/2:m-1] Recursive merge Binary search Recursive merge B B[0:j-1] B[j+1:n-1] Charles Leiserson n-1 12 Scott B. Baden / CSE 160 / Wi '16

Recursive Parallel Merge Performance - II
If there are N = m+n elements (m ≥ n), then the larger of the recursive merges processes ¾N elements What is the worst case that establishes this bound? • Since m ≥ n, n = 2n/2 ≤ (m+n)/2 = N/2 In the worst case, we merge m/2 elements of A with all of B m/2 m-1 A A[0:m/2-1] A[m/2:m-1] Recursive merge Binary search Recursive merge B B[0:j-1] B[j+1:n-1] Charles Leiserson n-1 13 Scott B. Baden / CSE 160 / Wi '16

Recursive Parallel Merge Algorithm
void P_Merge(int *C, int *A, int *B, int m, int n) { if (m < n) { … thread(P_Merge,C,B,A,n,m); } else if (m + n is small enough) { 0 m/2 A[0:m/2-1] m-1 A A[m/2:m-1] SerialMerge(C,A,B,m,n); } else { B B[0:j-1] B[j+1:n-1] n-1 int m2 = m/2; int j = BinarySearch(A[m2], B, n); … thread(P_Merge,C, A, B, m2, j)); … thread(P_Merge,C+m2+j, A+m2, B+j, m-m2, nb-j); } Charles Leiserson 14 Scott B. Baden / CSE 160 / Wi '16

Assignment #1 Parallelize the provide serial merge sort code
Once running correctly, and you have conducted a strong scaling study… Implement parallel merge and determine how much it helps Do the merges without recursion, just parallelize by a factor of 2. If time, do the merge recursively 16 Scott B. Baden / CSE 160 / Wi '16

Performance Programming tips
Parallelism diminishes as we move up the recursion tree, so parallel merge will likely help much more at the higher levels (at the leaves, it’s not possible to merge in parallel) Payoff from parallelizing the divide and conquer will likely exceed that of replacing serial merge by parallel merge Performance programming tips  Stop the recursion at a threshold value g  There is an optimal g, depends on P • P = 1: N • P >1: < N The parallel part of the divide and conquer will usually stop before we reach the g limit 17 Scott B. Baden / CSE 160 / Wi '16

Why are factors limiting the benefit of parallel merge, assuming the non-recursive merge?
We get at most a factor of 2 speedup We move a lot of data relative to the work we do when merging C. Both Scott B. Baden / CSE 160 / Wi '16

Barrier synchronization
Today’s lecture Merge Sort Barrier synchronization Scott B. Baden / CSE 160 / Wi '16

Other kinds of data races
int64_t global_sum = 0; void sumIt(int TID) { mtx.lock(); sum += (TID+1); mtx.unlock(); if (TID == 0) cout << "Sum of 1 : " << NT << " = " << sum << endl; } % ./sumIt 5 # threads: 5 The sum of 1 to 5 is 1 After join returns, the sum of 1 to 5 is: 15 Scott B. Baden / CSE 160 / Wi '16

Why do we have a race condition?
int64_t global_sum = 0; void sumIt(int TID) { mtx.lock(); sum += (TID+1); mtx.unlock(); if (TID == 0) cout << ”Sum… “; } Threads are able to print out the sum before all have contributed to it The critical section cannot fix this problem The critical section should be removed A & B A&C Scott B. Baden / CSE 160 / Wi '16

Fixing the race - barrier synchronization
The sum was reported incorrectly because it was possible for thread 0 to read the value before other threads got a chance to add their contribution(true dependence) The barrier repairs this defect: no thread can move past the barrier until all have arrived, and hence have contributedto the sum int64_t global_sum = 0; void sumIt(int TID) { mtx.lock(); sum += (TID+1); mtx.unlock(); barrier(); if (TID == 0) cout << “Sum ”; } % ./sumIt 5 # threads: 5 The sum of 1 to 5 is 15 22 Scott B. Baden / CSE 160 / Wi '16

Barrier synchronization
wikipedia theknightskyracing.wordpress.com 23 Scott B. Baden / CSE 160 / Wi '16

Barrier synchronization An application of barrier synchronization
Today’s lecture Merge Sort Barrier synchronization An application of barrier synchronization Scott B. Baden / CSE 160 / Wi '16

Compare and exchange sorts
Simplest sort, AKA bubble sort • The fundamental operation is compare-exchange Compare-exchange(a[j] , a[j+1])  Swaps arguments if they are in decreasing order: (7,4) → (4, 7)  Satisfies the post-condition that a[j] ≤ a[j+1]  Returns false if a swap was made for i = 1 to N-2 do done = true; for j = 0 to N-i-1 do // Compare-exchange(a[j] , a[j+1]) if (a[j] > a[j+1]) { a[j]  a[j+1]; done=false; } end do if (done) break; end do Scott B. Baden / CSE 160 / Wi '16

Loop carried dependencies
We cannot parallelize bubble sort owing to the loop carried dependence in the inner loop The value of a[j] computed in iteration j depends on the a[i] computed in iterations 0, 1, …, j-1 for i = 1 to N-2 do done = true; for j = 0 to N-i-1 do done = Compare-exchange(a[j] , a[j+1]) end do if (done) break; end do Scott B. Baden / CSE 160 / Wi '16

Odd/Even sort If we re-order the comparisons we can parallelize the algorithm  number the points as even and odd  alternate between sorting the odd and even points This algorithm parallelizes since there are no loop carried dependences All the odd (even) points are decoupled ai ai-1 ai+1 Scott B. Baden / CSE 160 / Wi '16

Odd/Even sort in action
ai-1 ai ai+1 Introduction to Parallel Computing, Grama et al, 2nd Ed. Scott B. Baden / CSE 160 / Wi '16

The algorithm ai-1 ai for i = 0 to N-2 do done = true;
for j = 0 to N-1 by 2 do // Even done &= Compare-exchange(a[j] , a[j+1]); end do for j = 1 to N-1 by 2 do // Odd done &= Compare-exchange(a[j] , a[j+1]); end do if (done) break; ai-1 ai a i+1 end do // Bubble sort for i = 1 to N-1 do done = true; for j = 0 to N-i-1 do done = Compare-Exchange(a[j] , a[j+1]) if (done) break; end do end do Scott B. Baden / CSE 160 / Wi '16

What costs does odd/even sort add to the serial code?
A. More memory accesses More comparisons Both A& B Scott B. Baden / CSE 160 / Wi '16

Odd/Even Sort Code Where do we need synchronization?
Global bool AllDone; int OE = lo % 2; for (s = 0; s < MaxIter; s++) { int done = Sweep(Keys, OE, lo, hi); /* Odd phase */ done &= Sweep(Keys, 1-OE, lo, hi); /* Even phase */ (1) (2) (3) (4) (5) AllDone &= done; if (AllDone) break; (6) (7) (8) (9) bool Sweep(int *Keys, int OE, int lo, int hi){ int Hi = hi; if (TID == (NT-1) Hi - -; bool myDone = true; for (int i = OE+lo; i <= Hi; i+=2) { if (Keys[i] > Keys[i+1]){ Keys[i] ↔ Keys[i+1]; myDone = false; } return myDone ; } // End For Scott B. Baden / CSE 160 / Wi '16

Which barrier synchronization points can we remove?
Global bool AllDone; int OE = lo % 2; for (s = 0; s < MaxIter; s++) { barr.sync(); if (!TID) AllDone = true; int done = Sweep(Keys, OE, lo, hi); barr.sync(); done &= Sweep(Keys, 1-OE, lo, hi); mtx.lock(); AllDone &= done; mtx.unlock(); if (allDone) break; } ai ai-1 ai+1 A B // Odd phase C // Even phase D 32 Scott B. Baden / CSE 160 / Wi '16

Building a linear time barrier with locks
class barrier { int count, _NT; mutex arrival, mutex departure; public: Barrier(int NT=2): arrival(UNLOCKED), departure(LOCKED), count(0), _NT(NT) {}; void bsync( ){ arrival.lock( ); if (++count < NT) arrival.unlock( ); else departure.unlock( ); departure.lock( ); if (--count > 0) departure.unlock( ); arrival.unlock( ); } }; 33 Scott B. Baden / CSE 160 / Wi '16

Lecture 7.

Similar presentations

Presentation on theme: "Lecture 7."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 7.

Similar presentations

Presentation on theme: "Lecture 7."— Presentation transcript:

Similar presentations

About project

Feedback