CS 240A : Examples with Cilk++

Slides:



Advertisements
Similar presentations
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
Advertisements

CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Data Structures Lecture 9 Fang Yu Department of Management Information Systems National Chengchi University Fall 2010.
1 CS 240A : Breadth-first search in Cilk++ Thanks to Charles E. Leiserson for some of these slides.
Comp 122, Spring 2004 Divide and Conquer (Merge Sort)
September 12, Algorithms and Data Structures Lecture III Simonas Šaltenis Nykredit Center for Database Research Aalborg University
Divide-and-Conquer Recursive in structure –Divide the problem into several smaller sub-problems that are similar to the original but smaller in size –Conquer.
Chapter 4: Divide and Conquer Master Theorem, Mergesort, Quicksort, Binary Search, Binary Trees The Design and Analysis of Algorithms.
11 Computer Algorithms Lecture 6 Recurrence Ch. 4 (till Master Theorem) Some of these slides are courtesy of D. Plaisted et al, UNC and M. Nicolescu, UNR.
CIT 596 Complexity. Tight bounds Is mergesort O(n 4 )? YES! But it is silly to use that as your measure More common to hear it as….? What about quicksort?
1 Exceptions Exceptions oops, mistake correction oops, mistake correction Issues with Matrix and Vector Issues with Matrix and Vector Quicksort Quicksort.
Spring 2015 Lecture 5: QuickSort & Selection
1 Sorting Problem: Given a sequence of elements, find a permutation such that the resulting sequence is sorted in some order. We have already seen: –Insertion.
Recitation on analysis of algorithms. runtimeof MergeSort /** Sort b[h..k]. */ public static void mS(Comparable[] b, int h, int k) { if (h >= k) return;
1 Issues with Matrix and Vector Issues with Matrix and Vector Quicksort Quicksort Determining Algorithm Efficiency Determining Algorithm Efficiency Substitution.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
CS 253: Algorithms Chapter 7 Mergesort Quicksort Credit: Dr. George Bebis.
Divide-and-Conquer1 7 2  9 4   2  2 79  4   72  29  94  4.
Divide-and-Conquer1 7 2  9 4   2  2 79  4   72  29  94  4.
Data Structures, Spring 2004 © L. Joskowicz 1 Data Structures – LECTURE 3 Recurrence equations Formulating recurrence equations Solving recurrence equations.
Data Structures, Spring 2006 © L. Joskowicz 1 Data Structures – LECTURE 3 Recurrence equations Formulating recurrence equations Solving recurrence equations.
CS3381 Des & Anal of Alg ( SemA) City Univ of HK / Dept of CS / Helena Wong 4. Recurrences - 1 Recurrences.
Quicksort CIS 606 Spring Quicksort Worst-case running time: Θ(n 2 ). Expected running time: Θ(n lg n). Constants hidden in Θ(n lg n) are small.
Data Structures Review Session 1
Updates HW#1 has been delayed until next MONDAY. There were two errors in the assignment Merge sort runs in Θ(n log n). Insertion sort runs in Θ(n2).
CSC 2300 Data Structures & Algorithms March 20, 2007 Chapter 7. Sorting.
Recurrences Part 3. Recursive Algorithms Recurrences are useful for analyzing recursive algorithms Recurrence – an equation or inequality that describes.
Design and Analysis of Algorithms – Chapter 51 Divide and Conquer (I) Dr. Ying Lu RAIK 283: Data Structures & Algorithms.
Recurrences The expression: is a recurrence. –Recurrence: an equation that describes a function in terms of its value on smaller functions Analysis of.
Recurrences The expression: is a recurrence. –Recurrence: an equation that describes a function in terms of its value on smaller functions BIL741: Advanced.
1 Divide and Conquer Binary Search Mergesort Recurrence Relations CSE Lecture 4 – Algorithms II.
Divide-and-Conquer 7 2  9 4   2   4   7
Multithreaded Programming in Cilk L ECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence.
Analyzing Recursive Algorithms A recursive algorithm can often be described by a recurrence equation that describes the overall runtime on a problem of.
David Luebke 1 10/3/2015 CS 332: Algorithms Solving Recurrences Continued The Master Theorem Introduction to heapsort.
Copyright © 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin “ Introduction to the Design & Analysis of Algorithms, ” 2 nd ed., Ch. 1 Chapter.
10/13/20151 CS 3343: Analysis of Algorithms Lecture 9: Review for midterm 1 Analysis of quick sort.
Divide-and-Conquer1 7 2  9 4   2  2 79  4   72  29  94  4.
1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.
1 Computer Algorithms Lecture 7 Master Theorem Some of these slides are courtesy of D. Plaisted, UNC and M. Nicolescu, UNR.
2IL50 Data Structures Fall 2015 Lecture 2: Analysis of Algorithms.
1 CS 140 : Non-numerical Examples with Cilk++ Divide and conquer paradigm for Cilk++ Quicksort Mergesort Thanks to Charles E. Leiserson for some of these.
Project 2 due … Project 2 due … Project 2 Project 2.
1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
CMPT 438 Algorithms. Why Study Algorithms? Necessary in any computer programming problem ▫Improve algorithm efficiency: run faster, process more data,
Order Statistics David Kauchak cs302 Spring 2012.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
1 Sorting Algorithms Sections 7.1 to Comparison-Based Sorting Input – 2,3,1,15,11,23,1 Output – 1,1,2,3,11,15,23 Class ‘Animals’ – Sort Objects.
Recurrences David Kauchak cs161 Summer Administrative Algorithms graded on efficiency! Be specific about the run times (e.g. log bases) Reminder:
Recitation on analysis of algorithms. Formal definition of O(n) We give a formal definition and show how it is used: f(n) is O(g(n)) iff There is a positive.
Midterm Review 1. Midterm Exam Thursday, October 15 in classroom 75 minutes Exam structure: –TRUE/FALSE questions –short questions on the topics discussed.
CSC 413/513: Intro to Algorithms Solving Recurrences Continued The Master Theorem Introduction to heapsort.
Foundations II: Data Structures and Algorithms
Design & Analysis of Algorithms COMP 482 / ELEC 420 John Greiner
Lecture 5 Today, how to solve recurrences We learned “guess and proved by induction” We also learned “substitution” method Today, we learn the “master.
Divide and Conquer. Recall Divide the problem into a number of sub-problems that are smaller instances of the same problem. Conquer the sub-problems by.
Spring 2015 Lecture 2: Analysis of Algorithms
Recurrences (in color) It continues…. Recurrences When an algorithm calls itself recursively, its running time is described by a recurrence. When an algorithm.
2IL50 Data Structures Spring 2016 Lecture 2: Analysis of Algorithms.
1 CS 240A : Divide-and-Conquer with Cilk++ Thanks to Charles E. Leiserson for some of these slides Divide & Conquer Paradigm Solving recurrences Sorting:
CS 140 : Numerical Examples on Shared Memory with Cilk++
CS 3343: Analysis of Algorithms
CS 3343: Analysis of Algorithms
CS 3343: Analysis of Algorithms
Data Structures Review Session
Introduction to Algorithms
Algorithms Recurrences.
Divide-and-Conquer 7 2  9 4   2   4   7
Presentation transcript:

CS 240A : Examples with Cilk++ Divide & Conquer Paradigm Solving recurrences Sorting: Quicksort and Mergesort Graph traversal: Breadth-First Search Thanks to Charles E. Leiserson for some of these slides

Work and Span (Recap) TP = execution time on P processors T1 = work Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Work and Span (Recap) TP = execution time on P processors T1 = work T∞ = span* Speedup on p processors T1/Tp Parallelism T1/T∞ * Also called critical-path length or computational depth. 2

Sorting Sorting is possibly the most frequently executed operation in computing! Quicksort is the fastest sorting algorithm in practice with an average running time of O(N log N), (but O(N2) worst case performance) Mergesort has worst case performance of O(N log N) for sorting N elements Both based on the recursive divide-and-conquer paradigm

QUICKSORT Basic Quicksort sorting an array S works as follows: If the number of elements in S is 0 or 1, then return. Pick any element v in S. Call this pivot. Partition the set S-{v} into two disjoint groups: S1 = {x  S-{v} | x  v} S2 = {x  S-{v} | x  v} Return quicksort(S1) followed by v followed by quicksort(S2)

QUICKSORT 14 13 45 56 34 31 32 21 78 Select Pivot 14 13 45 56 34 31 32

QUICKSORT 14 13 45 56 34 31 32 21 78 Partition around Pivot 56 13 31

QUICKSORT 56 13 31 21 45 34 32 14 78 Quicksort recursively 13 14 21 31

Parallelizing Quicksort Serial Quicksort sorts an array S as follows: If the number of elements in S is 0 or 1, then return. Pick any element v in S. Call this pivot. Partition the set S-{v} into two disjoint groups: S1 = {x  S-{v} | x  v} S2 = {x  S-{v} | x  v} Return quicksort(S1) followed by v followed by quicksort(S2) Not necessarily so !

Parallel Quicksort (Basic) The second recursive call to qsort does not depend on the results of the first recursive call We have an opportunity to speed up the call by making both calls in parallel. template <typename T> void qsort(T begin, T end) { if (begin != end) { T middle = partition( begin, end, bind2nd( less<typename iterator_traits<T>::value_type>(), *begin ) ); cilk_spawn qsort(begin, middle); qsort(max(begin + 1, middle), end); cilk_sync; } 9

Performance Speedup = T1/T16 = 0.083/0.014 = 5.93 ./qsort 500000 -cilk_set_worker_count 1 >> 0.083 seconds ./qsort 500000 -cilk_set_worker_count 16 >> 0.014 seconds Speedup = T1/T16 = 0.083/0.014 = 5.93 ./qsort 50000000 -cilk_set_worker_count 1 >> 10.57 seconds ./qsort 50000000 -cilk_set_worker_count 16 >> 1.58 seconds Speedup = T1/T16 = 10.57/1.58 = 6.67

Measure Work/Span Empirically cilkscreen -w ./qsort 50000000 Work = 21593799861 Span = 1261403043 Burdened span = 1261600249 Parallelism = 17.1189 Burdened parallelism = 17.1162 #Spawn = 50000000 #Atomic instructions = 14 cilkscreen -w ./qsort 500000 Work = 178835973 Span = 14378443 Burdened span = 14525767 Parallelism = 12.4378 Burdened parallelism = 12.3116 #Spawn = 500000 #Atomic instructions = 8 workspan ws; ws.start(); sample_qsort(a, a + n); ws.stop(); ws.report(std::cout);

Analyzing Quicksort 56 13 31 21 45 34 32 14 78 Quicksort recursively 13 14 21 31 32 34 45 56 78 13 14 21 31 32 34 45 56 78 Assume we have a “great” partitioner that always generates two balanced sets

Analyzing Quicksort Work: Span recurrence: T∞(n) = T∞(n/2) + Θ(n) …. n/2 T1(2) = n T1(1) + n/2 Θ(2) T1(n) = Θ(n lg n) Span recurrence: T∞(n) = T∞(n/2) + Θ(n) Solves to T∞(n) = Θ(n) + Partitioning not parallel !

Analyzing Quicksort Parallelism: T1(n) T∞(n) = Θ(lg n) Not much ! Indeed, partitioning (i.e., constructing the array S1 = {x  S-{v} | x  v}) can be accomplished in parallel in time Θ(lg n) Which gives a span T∞(n) = Θ(lg2n ) And parallelism Θ(n/lg n) Basic parallel qsort can be found under $cilkpath/examples/qsort Way better !

The Master Method (Optional) The Master Method for solving recurrences applies to recurrences of the form T(n) = a T(n/b) + f(n) , where a ≥ 1, b > 1, and f is asymptotically positive. * Idea: Compare nlogba with f(n) . * The unstated base case is T(n) = (1) for sufficiently small n.

Master Method — CASE 1 T(n) = a T(n/b) + f(n) nlogba ≫ f(n) Specifically, f(n) = O(nlogba – ε) for some constant ε > 0 . Solution: T(n) = Θ(nlogba) .

Master Method — CASE 2 T(n) = a T(n/b) + f(n) nlogba ≈ f(n) Specifically, f(n) = Θ(nlogbalgkn) for some constant k ≥ 0. Solution: T(n) = Θ(nlogbalgk+1n)) . Ex(qsort): a =2, b=2, k=0  T1(n)=Θ(n lg n)

Master Method — CASE 3 T(n) = a T(n/b) + f(n) nlogba ≪ f(n) Specifically, f(n) = Ω(nlogba + ε) for some constant ε > 0, and f(n) satisfies the regularity condition that a f(n/b) ≤ c f(n) for some constant c < 1. Solution: T(n) = Θ(f(n)) . Example: Span of qsort

Master Method Summary T(n) = a T(n/b) + f(n) CASE 1: f (n) = O(nlogba – ε), constant ε > 0  T(n) = Θ(nlogba) . CASE 2: f (n) = Θ(nlogba lgkn), constant k  0  T(n) = Θ(nlogba lgk+1n) . CASE 3: f (n) = Ω(nlogba + ε), constant ε > 0, and regularity condition  T(n) = Θ(f(n)) .

MERGESORT Mergesort is an example of a recursive sorting algorithm. It is based on the divide-and-conquer paradigm It uses the merge operation as its fundamental component (which takes in two sorted sequences and produces a single sorted sequence) Simulation of Mergesort Drawback of mergesort: Not in-place (uses an extra temporary array)

Merging Two Sorted Arrays template <typename T> void Merge(T *C, T *A, T *B, int na, int nb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; } while (na>0) { while (nb>0) { Time to merge n elements = Θ(n). 3 12 19 46 4 14 21 23 3 12 19 46 4 14 21 23

Parallel Merge Sort A: input (unsorted) B: output (sorted) C: temporary template <typename T> void MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T* C = new T[n]; cilk_spawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); delete[] C; } 46 14 3 4 12 19 21 33 merge 46 33 3 12 19 4 14 21 merge 4 33 19 46 14 3 12 21 merge 14 46 19 3 12 33 4 21

Work of Merge Sort CASE 2: nlogba = nlog22 = n f(n) = Θ(nlogbalg0n) template <typename T> void MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T* C = new T[n]; cilk_spawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); delete[] C; } CASE 2: nlogba = nlog22 = n f(n) = Θ(nlogbalg0n) Work: T1(n) = 2T1(n/2) + Θ(n) = Θ(n lg n)

Span of Merge Sort CASE 3: nlogba = nlog21 = 1 f(n) = Θ(n) template <typename T> void MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T* C = new T[n]; cilk_spawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); delete[] C; } CASE 3: nlogba = nlog21 = 1 f(n) = Θ(n) Span: T∞(n) = T∞(n/2) + Θ(n) = Θ(n)

Parallelism of Merge Sort T1(n) = Θ(n lg n) Work: T∞(n) = Θ(n) Span: PUNY! Parallelism: T1(n) T∞(n) = Θ(lg n) We need to parallelize the merge!

Parallel Merge A B ≤ A[ma] ≥ A[ma] ≤ A[ma] ≥ A[ma] na ≥ nb Throw away at least na/2 ≥ n/4 ma = na/2 ≤ A[ma] ≥ A[ma] na A Recursive P_Merge Binary Search mb-1 mb B ≤ A[ma] ≥ A[ma] na ≥ nb nb Key Idea: If the total number of elements to be merged in the two arrays is n = na + nb, the total number of elements in the larger of the two recursive merges is at most (3/4) n .

Coarsen base cases for efficiency. Parallel Merge template <typename T> void P_Merge(T *C, T *A, T *B, int na, int nb) { if (na < nb) { P_Merge(C, B, A, nb, na); } else if (na==0) { return; } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); C[ma+mb] = A[ma]; cilk_spawn P_Merge(C, A, B, ma, mb); P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb); cilk_sync; } Coarsen base cases for efficiency.

Span of Parallel Merge CASE 2: nlogba = nlog4/31 = 1 template <typename T> void P_Merge(T *C, T *A, T *B, int na, int nb) { if (na < nb) { ⋮ int mb = BinarySearch(A[ma], B, nb); C[ma+mb] = A[ma]; cilk_spawn P_Merge(C, A, B, ma, mb); P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb); cilk_sync; } CASE 2: nlogba = nlog4/31 = 1 f(n) = Θ(nlogba lg1n) Try it in matlab: logba < 1 as long as a = 1 and b > 1 >> x = 0.1:0.1:10 >> y = log2(x); >> plot(x,y) T∞(3n/4) + Θ(lg n) Span: T∞(n) = = Θ(lg2n )

Work of Parallel Merge T1(αn) + T1((1-α)n) + Θ(lg n), template <typename T> void P_Merge(T *C, T *A, T *B, int na, int nb) { if (na < nb) { ⋮ int mb = BinarySearch(A[ma], B, nb); C[ma+mb] = A[ma]; cilk_spawn P_Merge(C, A, B, ma, mb); P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb); cilk_sync; } HAIRY! T1(αn) + T1((1-α)n) + Θ(lg n), where 1/4 ≤ α ≤ 3/4. Work: T1(n) = Claim: T1(n) = Θ(n).

Analysis of Work Recurrence Work: T1(n) = T1(αn) + T1((1-α)n) + Θ(lg n), where 1/4 ≤ α ≤ 3/4. Substitution method: Inductive hypothesis is T1(k) ≤ c1k – c2lg k, where c1,c2 > 0. Prove that the relation holds, and solve for c1 and c2. T1(n) = T1(n) + T1((1–)n) + (lg n) ≤ c1(n) – c2lg(n) + c1(1–)n – c2lg((1–)n) + (lg n)

Analysis of Work Recurrence Work: T1(n) = T1(αn) + T1((1-α)n) + Θ(lg n), where 1/4 ≤ α ≤ 3/4. T1(n) = T1(n) + T1((1–)n) + (lg n) ≤ c1(n) – c2lg(n) + c1(1–)n – c2lg((1–)n) + (lg n)

Analysis of Work Recurrence Work: T1(n) = T1(αn) + T1((1-α)n) + Θ(lg n), where 1/4 ≤ α ≤ 3/4. T1(n) = T1(n) + T1((1–)n) + Θ(lg n) ≤ c1(n) – c2lg(n) + c1(1–)n – c2lg((1–)n) + Θ(lg n) ≤ c1n – c2lg(n) – c2lg((1–)n) + Θ(lg n) ≤ c1n – c2 ( lg((1–)) + 2 lg n ) + Θ(lg n) c2lg(n) + c2lg((1–)n) = c2 (lg((1–)n2) = c2 (lg((1–)) lg(n2)) = c2 ( lg((1–)) + 2 lg n ) ≤ c1n – c2 lg n – (c2(lg n + lg((1–))) – Θ(lg n)) ≤ c1n – c2 lg n by choosing c2 large enough. Choose c1 large enough to handle the base case.

Parallelism of P_Merge Work: T1(n) = Θ(n) Span: T∞(n) = Θ(lg2n) Parallelism: T1(n) T∞(n) = Θ(n/lg2n)

Parallel Merge Sort CASE 2: nlogba = nlog22 = n f(n) = Θ(nlogba lg0n) template <typename T> void P_MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T C[n]; cilk_spawn P_MergeSort(C, A, n/2); P_MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; P_Merge(B, C, C+n/2, n/2, n-n/2); } CASE 2: nlogba = nlog22 = n f(n) = Θ(nlogba lg0n) Work: T1(n) = 2T1(n/2) + Θ(n) = Θ(n lg n)

Parallel Merge Sort CASE 2: nlogba = nlog21 = 1 f(n) = Θ(nlogba lg2n) template <typename T> void P_MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T C[n]; cilk_spawn P_MergeSort(C, A, n/2); P_MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; P_Merge(B, C, C+n/2, n/2, n-n/2); } CASE 2: nlogba = nlog21 = 1 f(n) = Θ(nlogba lg2n) Span: T∞(n) = T∞(n/2) + Θ(lg2n) = Θ(lg3n)

Parallelism of P_MergeSort T1(n) = Θ(n lg n) Work: T∞(n) = Θ(lg3n) Span: Parallelism: T1(n) T∞(n) = Θ(n/lg2n)

Breadth First Search Level-by-level graph traversal Serially complexity: Θ(m+n) 1 2 3 4 5 6 7 8 9 10 11 12 16 17 18 19 Graph: G(E,V) E: Set of edges (size m) V: Set of vertices (size n)

Breadth First Search Level-by-level graph traversal Serially complexity: Θ(m+n) 1 Level 1 1 2 3 4 5 6 7 8 9 10 11 12 16 17 18 19

Breadth First Search Level-by-level graph traversal Serially complexity: Θ(m+n) 1 Level 1 1 2 3 4 5 6 7 8 9 10 11 12 16 17 18 19 5 2 Level 2

Breadth First Search Level-by-level graph traversal Serially complexity: Θ(m+n) 1 Level 1 1 2 3 4 5 6 7 8 9 10 11 12 16 17 18 19 5 2 Level 2 9 6 3 Level 3

Breadth First Search Level-by-level graph traversal Serially complexity: Θ(m+n) 1 Level 1 1 2 3 4 5 6 7 8 9 10 11 12 16 17 18 19 5 2 Level 2 9 6 3 Level 3 16 10 7 4 Level 4

Breadth First Search Level-by-level graph traversal Serially complexity: Θ(m+n) 1 Level 1 1 2 3 4 5 6 7 8 9 10 11 12 16 17 18 19 5 2 Level 2 9 6 3 Level 3 16 16 10 7 4 4 Level 4 17 11 8 Level 5

Breadth First Search Level-by-level graph traversal Serially complexity: Θ(m+n) 1 Level 1 1 2 3 4 5 6 7 8 9 10 11 12 16 17 18 19 5 2 Level 2 9 6 3 Level 3 16 16 10 7 4 4 Level 4 17 11 8 Level 5 18 12 Level 6

Breadth First Search Who is parent(19)? If we use a queue for expanding the frontier? Does it actually matter? 1 Level 1 1 2 3 4 5 6 7 8 9 10 11 12 16 17 18 19 5 2 Level 2 9 6 3 Level 3 16 16 10 7 4 4 Level 4 17 11 8 Level 5 18 12 Level 6

Parallel BFS Way #1: A custom reducer Bag<T> has an associative reduce function that merges two sets Way #1: A custom reducer void BFS(Graph *G, Vertex root) { Bag<Vertex> frontier(root); while ( ! frontier.isEmpty() ) cilk::hyperobject< Bag<Vertex> > succbag(); cilk_for (int i=0; i< frontier.size(); i++) for( Vertex v in frontier[i].adjacency() ) if( ! v.unvisited() ) succbag() += v; } frontier = succbag.getValue(); operator+=(Vertex & rhs) also marks rhs “visited”

Parallel BFS Way #2: Concurrent writes + List reducer void BFS(Graph *G, Vertex root) { list<Vertex> frontier(root); Vertex * parent = new Vertex[n]; while ( ! frontier.isEmpty() ) cilk_for (int i=0; i< frontier.size(); i++) for( Vertex v in frontier[i].adjacency() ) if ( ! v.visited() ) parent[v] = frontier[i]; } ... An intentional data race How to generate the new frontier?

Parallel BFS Run cilk_for loop again void BFS(Graph *G, Vertex root) { ... while ( ! frontier.isEmpty() ) { hyperobject< reducer_list_append<Vertex> > succlist(); cilk_for (int i=0; i< frontier.size(); i++) for( Vertex v in frontier[i].adjacency() ) { if ( parent[v] == frontier[i] ) succlist.push_back(v); v.visit(); // Mark “visited” } frontier = succlist.getValue(); !v.visited() check is not necessary. Why?

Parallel BFS Each level is explored with Θ(1) span Graph G has at most d, at least d/2 levels Depending on the location of root d=diameter(G) T1(n) = Θ(m+n) Work: T∞(n) = Θ(d) Span: Parallelism: T1(n) T∞(n) = Θ((m+n)/d)

Parallel BFS Caveats d is usually small d = lg(n) for scale-free graphs But the degrees are not bounded  Parallel scaling will be memory-bound Lots of burdened parallelism, Loops are skinny Especially to the root and leaves of BFS-tree You are not “expected” to parallelize BFS part of Homework #4 You may do it for extra credit though 