1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.

Slides:



Advertisements
Similar presentations
MATH 224 – Discrete Mathematics
Advertisements

Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 6.
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
5/5/20151 Analysis of Algorithms Lecture 6&7: Master theorem and substitution method.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
September 12, Algorithms and Data Structures Lecture III Simonas Šaltenis Nykredit Center for Database Research Aalborg University
Divide-and-Conquer Recursive in structure –Divide the problem into several smaller sub-problems that are similar to the original but smaller in size –Conquer.
Chapter 4: Divide and Conquer Master Theorem, Mergesort, Quicksort, Binary Search, Binary Trees The Design and Analysis of Algorithms.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Analysis of Algorithms CS 477/677 Sorting – Part B Instructor: George Bebis (Chapter 7)
CIT 596 Complexity. Tight bounds Is mergesort O(n 4 )? YES! But it is silly to use that as your measure More common to hear it as….? What about quicksort?
Spring 2015 Lecture 5: QuickSort & Selection
Quicksort CS 3358 Data Structures. Sorting II/ Slide 2 Introduction Fastest known sorting algorithm in practice * Average case: O(N log N) * Worst case:
25 May Quick Sort (11.2) CSE 2011 Winter 2011.
Quicksort COMP171 Fall Sorting II/ Slide 2 Introduction * Fastest known sorting algorithm in practice * Average case: O(N log N) * Worst case: O(N.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Ch. 7 - QuickSort Quick but not Guaranteed. Ch.7 - QuickSort Another Divide-and-Conquer sorting algorithm… As it turns out, MERGESORT and HEAPSORT, although.
Divide-and-Conquer1 7 2  9 4   2  2 79  4   72  29  94  4.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.
Chapter 4: Divide and Conquer The Design and Analysis of Algorithms.
CS3381 Des & Anal of Alg ( SemA) City Univ of HK / Dept of CS / Helena Wong 4. Recurrences - 1 Recurrences.
CS 240A : Examples with Cilk++
Quicksort CIS 606 Spring Quicksort Worst-case running time: Θ(n 2 ). Expected running time: Θ(n lg n). Constants hidden in Θ(n lg n) are small.
1 CS 140 : Jan 27 – Feb 3, 2010 Multicore (and Shared Memory) Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++
Updates HW#1 has been delayed until next MONDAY. There were two errors in the assignment Merge sort runs in Θ(n log n). Insertion sort runs in Θ(n2).
CSC 2300 Data Structures & Algorithms March 20, 2007 Chapter 7. Sorting.
CS Main Questions Given that the computer is the Great Symbol Manipulator, there are three main questions in the field of computer science: What kinds.
Design and Analysis of Algorithms – Chapter 51 Divide and Conquer (I) Dr. Ying Lu RAIK 283: Data Structures & Algorithms.
Recurrences The expression: is a recurrence. –Recurrence: an equation that describes a function in terms of its value on smaller functions Analysis of.
Computer Algorithms Lecture 10 Quicksort Ch. 7 Some of these slides are courtesy of D. Plaisted et al, UNC and M. Nicolescu, UNR.
Recurrences The expression: is a recurrence. –Recurrence: an equation that describes a function in terms of its value on smaller functions BIL741: Advanced.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Lecture 2 We have given O(n 3 ), O(n 2 ), O(nlogn) algorithms for the max sub-range problem. This time, a linear time algorithm! The idea is as follows:
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
CS Algorithm Analysis1 Algorithm Analysis - CS 312 Professor Tony Martinez.
Divide-and-Conquer 7 2  9 4   2   4   7
Multithreaded Programming in Cilk L ECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence.
10/13/20151 CS 3343: Analysis of Algorithms Lecture 9: Review for midterm 1 Analysis of quick sort.
Analysis of Algorithms
Divide-and-Conquer1 7 2  9 4   2  2 79  4   72  29  94  4.
1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.
1 Computer Algorithms Lecture 7 Master Theorem Some of these slides are courtesy of D. Plaisted, UNC and M. Nicolescu, UNR.
2IL50 Data Structures Fall 2015 Lecture 2: Analysis of Algorithms.
1 CS 140 : Non-numerical Examples with Cilk++ Divide and conquer paradigm for Cilk++ Quicksort Mergesort Thanks to Charles E. Leiserson for some of these.
1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.
Divide and Conquer Andreas Klappenecker [based on slides by Prof. Welch]
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
10/25/20151 CS 3343: Analysis of Algorithms Lecture 6&7: Master theorem and substitution method.
Lecture 5 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
Divide and Conquer Andreas Klappenecker [based on slides by Prof. Welch]
Midterm Review 1. Midterm Exam Thursday, October 15 in classroom 75 minutes Exam structure: –TRUE/FALSE questions –short questions on the topics discussed.
Design & Analysis of Algorithms COMP 482 / ELEC 420 John Greiner
Lecture 5 Today, how to solve recurrences We learned “guess and proved by induction” We also learned “substitution” method Today, we learn the “master.
Divide and Conquer. Recall Divide the problem into a number of sub-problems that are smaller instances of the same problem. Conquer the sub-problems by.
Spring 2015 Lecture 2: Analysis of Algorithms
2IL50 Data Structures Spring 2016 Lecture 2: Analysis of Algorithms.
1 CS 240A : Divide-and-Conquer with Cilk++ Thanks to Charles E. Leiserson for some of these slides Divide & Conquer Paradigm Solving recurrences Sorting:
CS 240A: Shared Memory & Multicore Programming with Cilk++
Computer Sciences Department1. Sorting algorithm 4 Computer Sciences Department3.
CS6045: Advanced Algorithms Sorting Algorithms. Sorting Input: sequence of numbers Output: a sorted sequence.
Introduction to Algorithms: Divide-n-Conquer Algorithms
CS 140 : Numerical Examples on Shared Memory with Cilk++
CS 3343: Analysis of Algorithms
Multithreaded Programming in Cilk LECTURE 1
Introduction to CILK Some slides are from:
Introduction to CILK Some slides are from:
Presentation transcript:

1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism and overheads Greedy scheduling and parallel slackness Cilk runtime Thanks to Charles E. Leiserson for some of these slides

2 Potential Parallelism Because the Span Law dictates that T P ≥ T ∞, the maximum possible speedup given T 1 and T ∞ is T 1 /T ∞ =potential parallelism =the average amount of work per step along the span.

3 Sorting ∙Sorting is possibly the most frequently executed operation in computing! ∙Quicksort is the fastest sorting algorithm in practice with an average running time of O(N log N), (but O(N 2 ) worst case performance) ∙Mergesort has worst case performance of O(N log N) for sorting N elements ∙Both based on the recursive divide-and- conquer paradigm

4 QUICKSORT ∙Basic Quicksort sorting an array S works as follows:  If the number of elements in S is 0 or 1, then return.  Pick any element v in S. Call this pivot.  Partition the set S-{v} into two disjoint groups: ♦ S 1 = {x  S-{v} | x  v} ♦ S 2 = {x  S-{v} | x  v}  Return quicksort(S 1 ) followed by v followed by quicksort(S 2 )

5 QUICKSORT Select Pivot

6 QUICKSORT Partition around Pivot

7 QUICKSORT Quicksort recursively

8 Parallelizing Quicksort ∙Serial Quicksort sorts an array S as follows:  If the number of elements in S is 0 or 1, then return.  Pick any element v in S. Call this pivot.  Partition the set S-{v} into two disjoint groups: ♦ S 1 = {x  S-{v} | x  v} ♦ S 2 = {x  S-{v} | x  v}  Return quicksort(S 1 ) followed by v followed by quicksort(S 2 )

9 Parallel Quicksort (Basic) The second recursive call to qsort does not depend on the results of the first recursive call We have an opportunity to speed up the call by making both calls in parallel.

10 Actual Performance ∙./qsort cilk_set_worker_count 1 >> seconds ∙./qsort cilk_set_worker_count 16 >> seconds 5.93 ∙Speedup = T 1 /T 16 = 0.083/0.014 = 5.93

11 Actual Performance ∙./qsort cilk_set_worker_count 1 >> seconds ∙./qsort cilk_set_worker_count 16 >> seconds 5.93 ∙Speedup = T 1 /T 16 = 0.083/0.014 = 5.93 ∙./qsort cilk_set_worker_count 1 >> seconds ∙./qsort cilk_set_worker_count 16 >> 1.58 seconds 6.67 ∙Speedup = T 1 /T 16 = 10.57/1.58 = 6.67 Why not better???

12 Measure Work/Span Empirically ∙cilkview -w./qsort Work = Span = Burdened span = Parallelism = Burdened parallelism = #Spawn = #Atomic instructions = 14 ∙cilkview -w./qsort Work = Span = Burdened span = Parallelism = Burdened parallelism = #Spawn = #Atomic instructions = 8 workspan ws; ws.start(); sample_qsort(a, a + n); ws.stop(); ws.report(std::cout);

13 Analyzing Quicksort Quicksort recursively Assume we have a “great” partitioner that always generates two balanced sets

14 ∙Work: T 1 (n) = 2T 1 (n/2) + Θ(n) 2T 1 (n/2) = 4T 1 (n/4) + 2 Θ(n/2) …. n/2 T 1 (2) = n T 1 (1) + n/2 Θ(2) T 1 (n)= Θ(n lg n) ∙Span recurrence: T ∞ (n) = T ∞ (n/2) + Θ(n) Solves to T ∞ (n) = Θ(n) Analyzing Quicksort +

15 Analyzing Quicksort ∙Indeed, partitioning (i.e., constructing the array S 1 = {x  S-{v} | x  v}) can be accomplished in parallel in time Θ(lg n) ∙Which gives a span T ∞ (n) = Θ(lg 2 n ) ∙And parallelism Θ(n/lg n) ∙Basic parallel qsort can be found under $cilkpath/examples/qsort Parallelism: T 1 (n) T ∞ (n) = Θ(lg n) Not much ! Way better !

16 The Master Method (Optional) The Master Method for solving recurrences applies to recurrences of the form T(n) = a T(n/b) + f(n), where a ≥ 1, b > 1, and f is asymptotically positive. I DEA : Compare n log b a with f(n). *The base case is always T(n) =  (1) for sufficiently small n. *

17 Master Method — C ASE 1 n log b a ≫ f(n) T(n) = a T(n/b) + f(n) Specifically, f(n) = O(n log b a – ε ) for some const ε > 0 Solution: T(n) = Θ(n log b a ) Strassen matrix multiplication: a =7, b=2, f(n) = n 2  T 1 (n) = Θ(n log 2 7 ) = O(n 2.81 )

18 Master Method — C ASE 2 n log b a ≈ f(n) Specifically, f(n) = Θ(n log b a lg k n) for some const k ≥ 0 Solution: T(n) = Θ(n log b a lg k+1 n)) T(n) = a T(n/b) + f(n) quicksort work: a=2, b=2, f(n)=n, k=0  T 1 (n) = Θ(n lg n) qsort span: a=1, b=2, f(n)=lg n, k=1  T ∞ (n) = Θ(lg 2 n)

19 Master Method — C ASE 3 n log b a ≪ f(n) Specifically, f(n) = Ω(n log b a + ε ) for some const ε>0, and (regularity) a  f(n/b) ≤ c  f(n) for some const c<1 Solution: T(n) = Θ(f(n)) T(n) = a T(n/b) + f(n) Eg: qsort span (bad version) : a=1, b=2, f(n)=n  T ∞ (n) = Θ(n)

20 Master Method Summary CASE 1: f (n) = O(n log b a – ε ), constant ε > 0  T(n) = Θ(n log b a ). CASE 2: f (n) = Θ(n log b a lg k n), constant k  0  T(n) = Θ(n log b a lg k+1 n). CASE 3: f (n) = Ω(n log b a + ε ), constant ε > 0, and regularity condition  T(n) = Θ(f(n)). T(n) = a T(n/b) + f(n)

21 Potential Parallelism Because the Span Law dictates that T P ≥ T ∞, the maximum possible speedup given T 1 and T ∞ is T 1 /T ∞ =potential parallelism =the average amount of work per step along the span.

22 Three Tips on Parallelism 1.Minimize span to maximize parallelism. Try to generate 10 times more parallelism than processors for near-perfect linear speedup. 2.If you have plenty of parallelism, try to trade some if it off for reduced work overheads. 3.Use divide-and-conquer recursion or parallel loops rather than spawning one small thing off after another. for (int i=0; i<n; ++i) { cilk_spawn foo(i); } cilk_sync; for (int i=0; i<n; ++i) { cilk_spawn foo(i); } cilk_sync; cilk_for (int i=0; i<n; ++i) { foo(i); } cilk_for (int i=0; i<n; ++i) { foo(i); } Do this: Not this:

23 Three Tips on Overheads 1.Make sure that work/#spawns is not too small. Coarsen by using function calls and inlining near the leaves of recursion rather than spawning. 2.Parallelize outer loops if you can, not inner loops (otherwise, you’ll have high burdened parallelism, which includes runtime and scheduling overhead ). If you must parallelize an inner loop, coarsen it, but not too much. 500 iterations should be plenty coarse for even the most meager loop. Fewer iterations should suffice for “fatter” loops. 3.Use reducers only in sufficiently fat loops.

24 Scheduling ∙ Cilk allows the programmer to express potential parallelism in an application. ∙The Cilk scheduler maps strands onto processors dynamically at runtime. ∙Since on-line schedulers are complicated, we’ll explore the ideas with an off-line scheduler. Network … Memory I/O PPPP PPPP P P P P $$$ A strand is a sequence of instructions that doesn’t contain any parallel constructs A strand is a sequence of instructions that doesn’t contain any parallel constructs

25 Greedy Scheduling I DEA : Do as much as possible on every step. Definition: A strand is ready if all its predecessors have executed.

26 Greedy Scheduling I DEA : Do as much as possible on every step. Definition: A strand is ready if all its predecessors have executed. Complete step ∙≥ P strands ready. ∙Run any P. P = 3

27 Greedy Scheduling I DEA : Do as much as possible on every step. Definition: A strand is ready if all its predecessors have executed. Complete step ∙≥ P strands ready. ∙Run any P. P = 3 Incomplete step ∙< P strands ready. ∙Run all of them.

28 Theorem : Any greedy scheduler achieves T P  T 1 /P + T ∞. Analysis of Greedy Proof. ∙# complete steps  T 1 /P, since each complete step performs P work. ∙# incomplete steps  T ∞, since each incomplete step reduces the span of the unexecuted dag by 1. ■ P = 3

29 Optimality of Greedy Theorem. Any greedy scheduler achieves within a factor of 2 of optimal. Proof. Let T P * be the execution time produced by the optimal scheduler. Since T P * ≥ max{T 1 /P, T ∞ } by the Work and Span Laws, we have T P ≤ T 1 /P + T ∞ ≤ 2⋅max{T 1 /P, T ∞ } ≤ 2T P *. ■

30 Linear Speedup Theorem. Any greedy scheduler achieves near-perfect linear speedup whenever P ≪ T 1 /T ∞. Proof. Since P ≪ T 1 /T ∞ is equivalent to T ∞ ≪ T 1 /P, the Greedy Scheduling Theorem gives us T P ≤ T 1 /P + T ∞ ≈ T 1 /P. Thus, the speedup is T 1 /T P ≈ P. ■ Definition. The quantity T 1 /PT ∞ is called the parallel slackness.

31 Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack P spawn call P spawn PP call spawn call spawn call Call! Cilk Runtime System

32 P spawn call spawn P PP call spawn call spawn call Spawn! Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack Cilk Runtime System

33 P spawn call spawn P PP call spawn call spawn call spawn call Spawn! Call! Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack Cilk Runtime System

34 spawn call P spawn call spawn P PP call spawn call spawn call spawn Return! Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack Cilk Runtime System

35 spawn P call spawn P PP call spawn call spawn call spawn Return! Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack Cilk Runtime System

36 P spawn call spawn P PP call spawn call spawn call spawn When a worker runs out of work, it steals from the top of a random victim’s deque. Steal! Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack Cilk Runtime System

37 P spawn call spawn P PP call spawn call spawn call spawn Steal! Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack Cilk Runtime System When a worker runs out of work, it steals from the top of a random victim’s deque.

38 P spawn call spawn P PP call spawn call spawn call spawn Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack Cilk Runtime System When a worker runs out of work, it steals from the top of a random victim’s deque.

39 P spawn call spawn P PP call spawn call spawn call spawn Spawn! spawn Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack Cilk Runtime System When a worker runs out of work, it steals from the top of a random victim’s deque.

40 P spawn call spawn P PP call spawn call spawn call spawn Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack Cilk Runtime System When a worker runs out of work, it steals from the top of a random victim’s deque.

41 P spawn call spawn P PP call spawn call spawn call spawn Theorem : With sufficient parallelism, workers steal infrequently  linear speed-up. Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack Cilk Runtime System