Presentation is loading. Please wait.

Presentation is loading. Please wait.

Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit (Some images in this.

Similar presentations


Presentation on theme: "Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit (Some images in this."— Presentation transcript:

1 Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit (Some images in this lecture courtesy of Charles Leiserson) TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A

2 Art of Multiprocessor Programming2 2 How to write Parallel Apps? Split a program into parallel parts In an effective way Thread management

3 Art of Multiprocessor Programming3 3 Matrix Multiplication

4 Art of Multiprocessor Programming4 4 Matrix Multiplication

5 Art of Multiprocessor Programming5 5 Matrix Multiplication c ij =  k=0 N-1 a ki * b jk

6 Art of Multiprocessor Programming6 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}

7 Art of Multiprocessor Programming7 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} a thread

8 Art of Multiprocessor Programming8 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Which matrix entry to compute

9 Art of Multiprocessor Programming9 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Actual computation

10 Art of Multiprocessor Programming10 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); }

11 Art of Multiprocessor Programming11 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Create n x n threads

12 Art of Multiprocessor Programming12 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Start them

13 Art of Multiprocessor Programming13 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Wait for them to finish Start them

14 Art of Multiprocessor Programming14 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Wait for them to finish Start them What’s wrong with this picture?

15 Art of Multiprocessor Programming15 Thread Overhead Threads Require resources –Memory for stacks –Setup, teardown –Scheduler overhead Short-lived threads –Ratio of work versus overhead bad

16 Art of Multiprocessor Programming16 Thread Pools More sensible to keep a pool of –long-lived threads Threads assigned short-lived tasks –Run the task –Rejoin pool –Wait for next assignment

17 Art of Multiprocessor Programming17 Thread Pool = Abstraction Insulate programmer from platform –Big machine, big pool –Small machine, small pool Portable code –Works across platforms –Worry about algorithm, not platform

18 Art of Multiprocessor Programming18 ExecutorService Interface In java.util.concurrent –Task = Runnable object If no result value expected Calls run() method. –Task = Callable object If result value of type T expected Calls T call() method.

19 Art of Multiprocessor Programming19 Future Callable task = …; … Future future = executor.submit(task); … T value = future.get();

20 Art of Multiprocessor Programming20 Future Callable task = …; … Future future = executor.submit(task); … T value = future.get(); Submitting a Callable task returns a Future object

21 Art of Multiprocessor Programming21 Future Callable task = …; … Future future = executor.submit(task); … T value = future.get(); The Future’s get() method blocks until the value is available

22 Art of Multiprocessor Programming22 Future Runnable task = …; … Future future = executor.submit(task); … future.get();

23 Art of Multiprocessor Programming23 Future Runnable task = …; … Future future = executor.submit(task); … future.get(); Submitting a Runnable task returns a Future object

24 Art of Multiprocessor Programming24 Future Runnable task = …; … Future future = executor.submit(task); … future.get(); The Future’s get() method blocks until the computation is complete

25 Art of Multiprocessor Programming25 Note Executor Service submissions –Like New England traffic signs –Are purely advisory in nature The executor –Like the Boston driver –Is free to ignore any such advice –And could execute tasks sequentially …

26 Art of Multiprocessor Programming26 Matrix Addition

27 Art of Multiprocessor Programming27 Matrix Addition 4 parallel additions

28 Art of Multiprocessor Programming28 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(addTask(a 00,b 00 )); … Future f 11 = exec.submit(addTask(a 11,b 11 )); f 00.get(); …; f 11.get(); … }}

29 Art of Multiprocessor Programming29 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(addTask(a 00,b 00 )); … Future f 11 = exec.submit(addTask(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} Constant-time operation

30 Art of Multiprocessor Programming30 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(addTask(a 00,b 00 )); … Future f 11 = exec.submit(addTask(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} Submit 4 tasks

31 Art of Multiprocessor Programming31 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(addTask(a 00,b 00 )); … Future f 11 = exec.submit(addTask(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} Base case: add directly

32 Art of Multiprocessor Programming32 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(addTask(a 00,b 00 )); … Future f 11 = exec.submit(addTask(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} Let them finish

33 Art of Multiprocessor Programming33 Dependencies Matrix example is not typical Tasks are independent –Don’t need results of one task … –To complete another Often tasks are not independent

34 Art of Multiprocessor Programming34 Fibonacci 1 if n = 0 or 1 F(n) F(n-1) + F(n-2) otherwise Note –Potential parallelism –Dependencies

35 Art of Multiprocessor Programming35 Disclaimer This Fibonacci implementation is –Egregiously inefficient So don’t try this at home or job! –But illustrates our point How to deal with dependencies Exercise: –Make this implementation efficient!

36 Art of Multiprocessor Programming36 class FibTask implements Callable { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future left = exec.submit(new FibTask(arg-1)); Future right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Multithreaded Fibonacci

37 Art of Multiprocessor Programming37 class FibTask implements Callable { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future left = exec.submit(new FibTask(arg-1)); Future right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Multithreaded Fibonacci Parallel calls

38 Art of Multiprocessor Programming38 class FibTask implements Callable { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future left = exec.submit(new FibTask(arg-1)); Future right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Multithreaded Fibonacci Pick up & combine results

39 Art of Multiprocessor Programming39 The Blumofe-Leiserson DAG Model Multithreaded program is –A directed acyclic graph (DAG) –That unfolds dynamically Each node is –A single unit of work

40 Art of Multiprocessor Programming40 Fibonacci DAG fib(4)

41 Art of Multiprocessor Programming41 Fibonacci DAG fib(4) fib(3)

42 Art of Multiprocessor Programming42 Fibonacci DAG fib(4) fib(3)fib(2)

43 Art of Multiprocessor Programming43 Fibonacci DAG fib(4) fib(3)fib(2) fib(1)

44 Art of Multiprocessor Programming44 Fibonacci DAG fib(4) fib(3)fib(2) call get fib(2)fib(1)

45 Art of Multiprocessor Programming45 How Parallel is That? Define work: –Total time on one processor Define critical-path length: –Longest dependency path –Can’t beat that!

46 Unfolded DAG Art of Multiprocessor Programming46

47 Parallelism? Art of Multiprocessor Programming47 Serial fraction = 3/18 = 1/6 … Amdahl’s Law says speedup cannot exceed 6.

48 Work? Art of Multiprocessor Programming48 1 1 2 2 3 3 7 7 5 5 4 4 7 7 8 8 9 9 10 11 12 13 14 15 16 17 18 T 1 : time needed on one processor Just count the nodes …. T 1 = 18

49 Critical Path? Art of Multiprocessor Programming49 T ∞ : time needed on as many processors as you like

50 Critical Path? Art of Multiprocessor Programming50 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 Longest path …. T ∞ = 9 T ∞ : time needed on as many processors as you like

51 Art of Multiprocessor Programming51 Notation Watch T P = time on P processors T 1 = work (time on 1 processor) T ∞ = critical path length (time on ∞ processors)

52 Art of Multiprocessor Programming52 Simple Laws Work Law: T P ≥ T 1 /P –In one step, can’t do more than P work Critical Path Law: T P ≥ T ∞ –Can’t beat infinite resources

53 Art of Multiprocessor Programming53 Performance Measures Speedup on P processors –Ratio T 1 /T P –How much faster with P processors Linear speedup –T 1 /T P = Θ(P) Max speedup (average parallelism) –T 1 /T ∞

54 Sequential Composition Art of Multiprocessor Programming54 AB Work: T 1 (A) + T 1 (B) Critical Path: T ∞ (A) + T ∞ (B)

55 Parallel Composition Art of Multiprocessor Programming55 Work: T 1 (A) + T 1 (B) Critical Path: max{T ∞ (A), T ∞ (B)} AB

56 Art of Multiprocessor Programming56 Matrix Addition

57 Art of Multiprocessor Programming57 Matrix Addition 4 parallel additions

58 Art of Multiprocessor Programming58 Addition Let A P (n) be running time –For n x n matrix –on P processors For example –A 1 (n) is work –A ∞ (n) is critical path length

59 Art of Multiprocessor Programming59 Addition Work is A 1 (n) = 4 A 1 (n/2) + Θ(1) 4 spawned additions Partition, synch, etc

60 Art of Multiprocessor Programming60 Addition Work is A 1 (n) = 4 A 1 (n/2) + Θ(1) = Θ(n 2 ) Same as double-loop summation

61 Art of Multiprocessor Programming61 Addition Critical Path length is A ∞ (n) = A ∞ (n/2) + Θ(1) spawned additions in parallel Partition, synch, etc

62 Art of Multiprocessor Programming62 Addition Critical Path length is A ∞ (n) = A ∞ (n/2) + Θ(1) = Θ(log n)

63 Art of Multiprocessor Programming63 Matrix Multiplication Redux

64 Art of Multiprocessor Programming64 Matrix Multiplication Redux

65 Art of Multiprocessor Programming65 First Phase … 8 multiplications

66 Art of Multiprocessor Programming66 Second Phase … 4 additions

67 Art of Multiprocessor Programming67 Multiplication Work is M 1 (n) = 8 M 1 (n/2) + A 1 (n) 8 parallel multiplications Final addition

68 Art of Multiprocessor Programming68 Multiplication Work is M 1 (n) = 8 M 1 (n/2) + Θ(n 2 ) = Θ(n 3 ) Same as serial triple-nested loop

69 Art of Multiprocessor Programming69 Multiplication Critical path length is M ∞ (n) = M ∞ (n/2) + A ∞ (n) Half-size parallel multiplications Final addition

70 Art of Multiprocessor Programming70 Multiplication Critical path length is M ∞ (n) = M ∞ (n/2) + A ∞ (n) = M ∞ (n/2) + Θ(log n) = Θ(log 2 n)

71 Art of Multiprocessor Programming71 Parallelism M 1 (n)/ M ∞ (n) = Θ(n 3 /log 2 n) To multiply two 1000 x 1000 matrices –1000 3 /10 2 =10 7 Much more than number of processors on any real machine

72 Art of Multiprocessor Programming72 Shared-Memory Multiprocessors Parallel applications –Do not have direct access to HW processors Mix of other jobs –All run together –Come & go dynamically

73 Art of Multiprocessor Programming73 Ideal Scheduling Hierarchy Tasks Processors User-level scheduler

74 Art of Multiprocessor Programming74 Realistic Scheduling Hierarchy Tasks Threads Processors User-level scheduler Kernel-level scheduler

75 Art of Multiprocessor Programming75 For Example Initially, –All P processors available for application Serial computation –Takes over one processor –Leaving P-1 for us –Waits for I/O –We get that processor back ….

76 Art of Multiprocessor Programming76 Speedup Map threads onto P processes Cannot get P-fold speedup –What if the kernel doesn’t cooperate? Can try for speedup proportional to P

77 Art of Multiprocessor Programming77 Scheduling Hierarchy User-level scheduler –Tells kernel which threads are ready Kernel-level scheduler –Synchronous (for analysis, not correctness!) –Picks p i threads to schedule at step i

78 Greedy Scheduling A node is ready if predecessors are done Greedy: schedule as many ready nodes as possible Optimal scheduler is greedy (why?) But not every greedy scheduler is optimal Art of Multiprocessor Programming78

79 Greedy Scheduling Art of Multiprocessor Programming79 There are P processors Complete Step: >P nodes ready run any P Incomplete Step: < P nodes ready run them all

80 Art of Multiprocessor Programming80 Theorem For any greedy scheduler, T P ≤ T 1 /P + T ∞

81 Art of Multiprocessor Programming81 Theorem For any greedy scheduler, Actual time T P ≤ T 1 /P + T ∞

82 Art of Multiprocessor Programming82 Theorem For any greedy scheduler, No better than work divided among processors T P ≤ T 1 /P + T ∞

83 Art of Multiprocessor Programming83 Theorem For any greedy scheduler, No better than critical path length T P ≤ T 1 /P + T ∞

84 Art of Multiprocessor Programming84 Proof: Number of incomplete steps ≤ T 1 /P … … because each performs P work. Number of complete steps ≤ T 1 … … because each shortens the unexecuted critical path by 1 QED

85 Art of Multiprocessor Programming85 Near-Optimality Theorem: any greedy scheduler is within a factor of 2 of optimal. Remark: Optimality is NP-hard!

86 Art of Multiprocessor Programming86 Proof of Near-Optimality Let T P * be the optimal time. T P * ≥ max{T 1 /P, T ∞ } T P ≤ T 1 /P + T ∞ T P ≤ 2 max{T 1 /P,T ∞ } T P ≤ 2 T P * From work and critical path laws Theorem QED

87 Art of Multiprocessor Programming87 Work Distribution zzz…

88 Art of Multiprocessor Programming88 Work Dealing Yes!

89 Art of Multiprocessor Programming89 The Problem with Work Dealing D’oh!

90 Art of Multiprocessor Programming90 Work Stealing No work… Yes!

91 Art of Multiprocessor Programming91 Lock-Free Work Stealing Each thread has a pool of ready work Remove work without synchronizing If you run out of work, steal someone else’s Choose victim at random

92 Art of Multiprocessor Programming92 Local Work Pools Each work pool is a Double-Ended Queue

93 Art of Multiprocessor Programming93 Work DEQueue 1 pushBottom popBottom work 1. Double-Ended Queue

94 Art of Multiprocessor Programming94 Obtain Work Obtain work Run task until Blocks or terminates popBottom

95 Art of Multiprocessor Programming95 New Work Unblock node Spawn node pushBottom

96 Art of Multiprocessor Programming96 Whatcha Gonna do When the Well Runs Dry? @&%$!! empty

97 Art of Multiprocessor Programming97 Steal Work from Others Pick random thread’s DEQeueue

98 Art of Multiprocessor Programming98 Steal this Task! popTop

99 Art of Multiprocessor Programming99 Task DEQueue Methods –pushBottom –popBottom –popTop Never happen concurrently

100 Art of Multiprocessor Programming100 Task DEQueue Methods –pushBottom –popBottom –popTop Most common – make them fast (minimize use of CAS)

101 Art of Multiprocessor Programming101 Ideal Wait-Free Linearizable Constant time Fortune Cookie: “It is better to be young, rich and beautiful, than old, poor, and ugly”

102 Art of Multiprocessor Programming102 Compromise Method popTop may fail if –Concurrent popTop succeeds, or a –Concurrent popBottom takes last task Blame the victim!

103 Art of Multiprocessor Programming103 Dreaded ABA Problem top CAS

104 Art of Multiprocessor Programming104 Dreaded ABA Problem top

105 Art of Multiprocessor Programming105 Dreaded ABA Problem top

106 Art of Multiprocessor Programming106 Dreaded ABA Problem top

107 Art of Multiprocessor Programming107 Dreaded ABA Problem top

108 Art of Multiprocessor Programming108 Dreaded ABA Problem top

109 Art of Multiprocessor Programming109 Dreaded ABA Problem top

110 Art of Multiprocessor Programming110 Dreaded ABA Problem top CAS Uh-Oh … Yes!

111 Art of Multiprocessor Programming111 Fix for Dreaded ABA stamp top bottom

112 Art of Multiprocessor Programming112 Bounded DEQueue public class BDEQueue { AtomicStampedReference top; volatile int bottom; Runnable[] tasks; … }

113 Art of Multiprocessor Programming113 Bounded DEQueue public class BDEQueue { AtomicStampedReference top; volatile int bottom; Runnable[] tasks; … } Index & Stamp (synchronized)

114 Art of Multiprocessor Programming114 Bounded DEQueue public class BDEQueue { AtomicStampedReference top; volatile int bottom; Runnable[] deq; … } index of bottom task no need to synchronize memory barrier needed

115 Art of Multiprocessor Programming115 Bounded DEQueue public class BDEQueue { AtomicStampedReference top; volatile int bottom; Runnable[] tasks; … } Array holding tasks

116 Art of Multiprocessor Programming116 pushBottom() public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } … }

117 Art of Multiprocessor Programming117 pushBottom() public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } … } Bottom is the index to store the new task in the array

118 Art of Multiprocessor Programming118 pushBottom() public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } … } Adjust the bottom index stamp top bottom

119 Art of Multiprocessor Programming119 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; }

120 Art of Multiprocessor Programming120 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Read top (value & stamp)

121 Art of Multiprocessor Programming121 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Compute new value & stamp

122 Art of Multiprocessor Programming122 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Quit if queue is empty stamp top bottom

123 Art of Multiprocessor Programming123 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Try to steal the task stamp top CAS bottom

124 Art of Multiprocessor Programming124 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Give up if conflict occurs

125 Art of Multiprocessor Programming125 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; } Take Work

126 Art of Multiprocessor Programming126 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; } 126 Take Work Make sure queue is non-empty

127 Art of Multiprocessor Programming127 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; } Take Work Prepare to grab bottom task

128 Art of Multiprocessor Programming128 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; } Take Work Read top, & prepare new values

129 Art of Multiprocessor Programming129 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} 129 Take Work If top & bottom one or more apart, no conflict stamp top bottom

130 Art of Multiprocessor Programming130 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} 130 Take Work At most one item left stamp top bottom

131 Art of Multiprocessor Programming131 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} 131 Take Work Try to steal last task. Reset bottom because the DEQueue will be empty even if unsuccessful (why?)

132 Art of Multiprocessor Programming132 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} Take Work stamp top CAS bottom I win CAS

133 Art of Multiprocessor Programming133 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} Take Work stamp top CAS bottom If I lose CAS, thief must have won…

134 Art of Multiprocessor Programming134 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} Take Work Failed to get last task (bottom could be less than top) Must still reset top and bottom since deque is empty

135 Art of Multiprocessor Programming135 Old English Proverb “May as well be hanged for stealing a sheep as a goat” From which we conclude: –Stealing was punished severely –Sheep were worth more than goats

136 Art of Multiprocessor Programming136 Variations Stealing is expensive –Pay CAS –Only one task taken What if –Move more than one task –Randomly balance loads?

137 Art of Multiprocessor Programming137 Work Balancing 5 2 b 2+5 c / 2 = 3 d 2+5 e /2=4

138 Art of Multiprocessor Programming138 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}}

139 Art of Multiprocessor Programming139 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Keep running tasks

140 Art of Multiprocessor Programming140 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} With probability 1/|queue|

141 Art of Multiprocessor Programming141 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Choose random victim

142 Art of Multiprocessor Programming142 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Lock queues in canonical order

143 Art of Multiprocessor Programming143 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Rebalance queues

144 Art of Multiprocessor Programming144 Work Stealing & Balancing Clean separation between app & scheduling layer Works well when number of processors fluctuates. Works on “black-box” operating systems

145 Art of Multiprocessor Programming145 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike 2.5 License You are free: –to Share — to copy, distribute and transmit the work –to Remix — to adapt the work Under the following conditions: –Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). –Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to –http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.

146 Art of Multiprocessor Programming146 TOM MRAVLOO RID L E D


Download ppt "Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit (Some images in this."

Similar presentations


Ads by Google