Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
CHAPTER 2 ALGORITHM ANALYSIS 【 Definition 】 An algorithm is a finite set of instructions that, if followed, accomplishes a particular task. In addition,
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Multithreaded Programs in Java. Tasks and Threads A task is an abstraction of a series of steps – Might be done in a separate thread – Java libraries.
Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Introduction Companion slides for
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR.
Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit (Some images in this.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Art of Multiprocessor Programming1 Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy.
ESE Einführung in Software Engineering X. CHAPTER Prof. O. Nierstrasz Wintersemester 2005 / 2006.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
© 2004, D. J. Foreman 2-1 Concurrency, Processes and Threads.
Scheduler Activations Jeff Chase. Threads in a Process Threads are useful at user-level – Parallelism, hide I/O latency, interactivity Option A (early.
Chapter 91 Memory Management Chapter 9   Review of process from source to executable (linking, loading, addressing)   General discussion of memory.
Two-Process Systems TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A AA Companion slides for Distributed Computing.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
Multicore Programming
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
Threads in Java. History  Process is a program in execution  Has stack/heap memory  Has a program counter  Multiuser operating systems since the sixties.
Week 5 - Monday.  What did we talk about last time?  Linked list implementations  Stacks  Queues.
© 2004, D. J. Foreman 2-1 Concurrency, Processes and Threads.
Java Threads. What is a Thread? A thread can be loosely defined as a separate stream of execution that takes place simultaneously with and independently.
CS 346 – Chapter 4 Threads –How they differ from processes –Definition, purpose Threads of the same process share: code, data, open files –Types –Support.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
ICS 313: Programming Language Theory Chapter 13: Concurrency.
1 CSCI 6900: Design, Implementation, and Verification of Concurrent Software Eileen Kraemer August 24 th, 2010 The University of Georgia.
Java Thread and Memory Model
Department of Computer Science and Software Engineering
Concurrent Queues Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
Programming Language Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Static Process Scheduling
Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice.
Concurrent Stacks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Concurrency and Performance Based on slides by Henri Casanova.
The Relative Power of Synchronization Operations Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
CILK: An Efficient Multithreaded Runtime System
Multithreading / Concurrency
Concurrent Objects Companion slides for
Futures, Scheduling, and Work Distribution
Byzantine-Resilient Colorless Computaton
Futures, Scheduling, and Work Distribution
Simulations and Reductions
Load Balancing and Multithreaded Programming
Multithreaded Programming
Background and Motivation
Futures, Scheduling, and Work Distribution
CSE 153 Design of Operating Systems Winter 19
Shared Counters and Parallelism
Programming with Shared Memory Specifying parallelism
Nir Shavit Multiprocessor Synchronization Spring 2003
Presentation transcript:

Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Art of Multiprocessor Programming© Herlihy – Shavit How to write Parallel Apps? How to –split a program into parallel parts –In an effective way –Thread management

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication c ij =  k=0 N-1 a ki * b jk

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} a thread

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Which matrix entry to compute

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Actual computation

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); }

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Create nxn threads

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Start them

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Wait for them to finish Start them

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Wait for them to finish What’s wrong with this picture? Start them

Art of Multiprocessor Programming© Herlihy – Shavit Thread Overhead Threads Require resources –Memory for stacks –Setup, teardown Scheduler overhead Worse for short-lived threads

Art of Multiprocessor Programming© Herlihy – Shavit Thread Pools More sensible to keep a pool of long- lived threads Threads assigned short-lived tasks –Runs the task –Rejoins pool –Waits for next assignment

Art of Multiprocessor Programming© Herlihy – Shavit Thread Pool = Abstraction Insulate programmer from platform –Big machine, big pool –And vice-versa Portable code –Runs well on any platform –No need to mix algorithm/platform concerns

Art of Multiprocessor Programming© Herlihy – Shavit ExecutorService Interface In java.util.concurrent –Task = Runnable object If no result value expected Calls run() method. –Task = Callable object If result value of type T expected Calls T call() method.

Art of Multiprocessor Programming© Herlihy – Shavit Future Callable task = …; … Future future = executor.submit(task); … T value = future.get();

Art of Multiprocessor Programming© Herlihy – Shavit Future Callable task = …; … Future future = executor.submit(task); … T value = future.get(); Submitting a Callable task returns a Future object

Art of Multiprocessor Programming© Herlihy – Shavit Future Callable task = …; … Future future = executor.submit(task); … T value = future.get(); The Future’s get() method blocks until the value is available

Art of Multiprocessor Programming© Herlihy – Shavit Future Runnable task = …; … Future future = executor.submit(task); … future.get();

Art of Multiprocessor Programming© Herlihy – Shavit Future Runnable task = …; … Future future = executor.submit(task); … future.get(); Submitting a Runnable task returns a Future object

Art of Multiprocessor Programming© Herlihy – Shavit Future Runnable task = …; … Future future = executor.submit(task); … future.get(); The Future’s get() method blocks until the computation is complete

Art of Multiprocessor Programming© Herlihy – Shavit Note Executor Service submissions –Like New England traffic signs –Are purely advisory in nature The executor –Like the New England driver –Is free to ignore any such advice –And could execute tasks sequentially …

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Addition

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Addition 4 parallel additions

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(add(a 00,b 00 )); … Future f 11 = exec.submit(add(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} This is not real Java code (see notes)

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(add(a 00,b 00 )); … Future f 11 = exec.submit(add(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} Base case: add directly

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(add(a 00,b 00 )); … Future f 11 = exec.submit(add(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} Constant-time operation

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(add(a 00,b 00 )); … Future f 11 = exec.submit(add(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} Submit 4 tasks

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(add(a 00,b 00 )); … Future f 11 = exec.submit(add(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} Let them finish

Art of Multiprocessor Programming© Herlihy – Shavit Dependencies Matrix example is not typical Tasks are independent –Don’t need results of one task … –To complete another Often tasks are not independent

Art of Multiprocessor Programming© Herlihy – Shavit Fibonacci 1 if n = 0 or 1 F(n) F(n-1) + F(n-2) otherwise Note –potential parallelism –Dependencies

Art of Multiprocessor Programming© Herlihy – Shavit Disclaimer This Fibonacci implementation is –Egregiously inefficient So don’t deploy it! –But illustrates our point How to deal with dependencies Exercise: –Make this implementation efficient!

Art of Multiprocessor Programming© Herlihy – Shavit Multithreaded Fibonacci class FibTask implements Callable { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future left = exec.submit(new FibTask(arg-1)); Future right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}}

Art of Multiprocessor Programming© Herlihy – Shavit Multithreaded Fibonacci class FibTask implements Callable { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future left = exec.submit(new FibTask(arg-1)); Future right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Parallel calls

Art of Multiprocessor Programming© Herlihy – Shavit Multithreaded Fibonacci class FibTask implements Callable { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future left = exec.submit(new FibTask(arg-1)); Future right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Pick up & combine results

Art of Multiprocessor Programming© Herlihy – Shavit Dynamic Behavior Multithreaded program is –A directed acyclic graph (DAG) –That unfolds dynamically Each node is –A single unit of work

Art of Multiprocessor Programming© Herlihy – Shavit Fib DAG fib(4) fib(3)fib(2) submit get fib(2)fib(1)

Art of Multiprocessor Programming© Herlihy – Shavit Arrows Reflect Dependencies fib(4) fib(3)fib(2) submit get fib(2)fib(1)

Art of Multiprocessor Programming© Herlihy – Shavit How Parallel is That? Define work: –Total time on one processor Define critical-path length: –Longest dependency path –Can’t beat that!

Art of Multiprocessor Programming© Herlihy – Shavit Fib Work fib(4)fib(3)fib(2) fib(1)

Art of Multiprocessor Programming© Herlihy – Shavit Fib Work work is

Art of Multiprocessor Programming© Herlihy – Shavit Fib Critical Path fib(4)

Art of Multiprocessor Programming© Herlihy – Shavit Fib Critical Path fib(4) Critical path length is

Art of Multiprocessor Programming© Herlihy – Shavit Notation Watch T P = time on P processors T 1 = work (time on 1 processor) T ∞ = critical path length (time on ∞ processors)

Art of Multiprocessor Programming© Herlihy – Shavit Simple Bounds T P ≥ T 1 /P –In one step, can’t do more than P work T P ≥ T ∞ –Can’t beat infinite resources

Art of Multiprocessor Programming© Herlihy – Shavit More Notation Watch Speedup on P processors –Ratio T 1 /T P –How much faster with P processors Linear speedup –T 1 /T P = Θ(P) Max speedup (average parallelism) –T 1 /T ∞

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Addition

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Addition 4 parallel additions

Art of Multiprocessor Programming© Herlihy – Shavit Addition Let A P (n) be running time –For n x n matrix –on P processors For example –A 1 (n) is work –A ∞ (n) is critical path length

Art of Multiprocessor Programming© Herlihy – Shavit Addition Work is A 1 (n) = 4 A 1 (n/2) + Θ(1) 4 spawned additions Partition, synch, etc

Art of Multiprocessor Programming© Herlihy – Shavit Addition Work is A 1 (n) = 4 A 1 (n/2) + Θ(1) = Θ(n 2 ) Same as double-loop summation

Art of Multiprocessor Programming© Herlihy – Shavit Addition Critical Path length is A ∞ (n) = A ∞ (n/2) + Θ(1) spawned additions in parallel Partition, synch, etc

Art of Multiprocessor Programming© Herlihy – Shavit Addition Critical Path length is A ∞ (n) = A ∞ (n/2) + Θ(1) = Θ(log n)

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication Redux

Art of Multiprocessor Programming© Herlihy – Shavit Matrix Multiplication Redux

Art of Multiprocessor Programming© Herlihy – Shavit First Phase … 8 multiplications

Art of Multiprocessor Programming© Herlihy – Shavit Second Phase … 4 additions

Art of Multiprocessor Programming© Herlihy – Shavit Multiplication Work is M 1 (n) = 8 M 1 (n/2) + A 1 (n) 8 parallel multiplications Final addition

Art of Multiprocessor Programming© Herlihy – Shavit Multiplication Work is M 1 (n) = 8 M 1 (n/2) + Θ(n 2 ) = Θ(n 3 ) Same as serial triple-nested loop

Art of Multiprocessor Programming© Herlihy – Shavit Multiplication Critical path length is M ∞ (n) = M ∞ (n/2) + A ∞ (n) Half-size parallel multiplications Final addition

Art of Multiprocessor Programming© Herlihy – Shavit Multiplication Critical path length is M ∞ (n) = M ∞ (n/2) + A ∞ (n) = M ∞ (n/2) + Θ(log n) = Θ(log 2 n)

Art of Multiprocessor Programming© Herlihy – Shavit Parallelism M 1 (n)/ M ∞ (n) = Θ(n 3 /log 2 n) To multiply two 1000 x 1000 matrices – /10 2 =10 7 Much more than number of processors on any real machine

Art of Multiprocessor Programming© Herlihy – Shavit Shared-Memory Multiprocessors Parallel applications –Do not have direct access to HW processors Mix of other jobs –All run together –Come & go dynamically

Art of Multiprocessor Programming© Herlihy – Shavit Ideal Scheduling Hierarchy Tasks Processors User-level scheduler

Art of Multiprocessor Programming© Herlihy – Shavit Realistic Scheduling Hierarchy Tasks Threads Processors User-level scheduler Kernel-level scheduler

Art of Multiprocessor Programming© Herlihy – Shavit For Example Initially, –All P processors available for application Serial computation –Takes over one processor –Leaving P-1 for us –Waits for I/O –We get that processor back ….

Art of Multiprocessor Programming© Herlihy – Shavit Speedup Map threads onto P processes Cannot get P-fold speedup –What if the kernel doesn’t cooperate? Can try for speedup proportional to –time-averaged number of processors the kernel gives us

Art of Multiprocessor Programming© Herlihy – Shavit Scheduling Hierarchy User-level scheduler –Tells kernel which threads are ready Kernel-level scheduler –Synchronous (for analysis, not correctness!) –Picks p i threads to schedule at step i –Processor average over T steps is:

Art of Multiprocessor Programming© Herlihy – Shavit Greed is Good Greedy scheduler –Schedules as much as it can –At each time step Optimal schedule is greedy (why?) But not every greedy schedule is optimal

Art of Multiprocessor Programming© Herlihy – Shavit Theorem Greedy scheduler ensures that T ≤ T 1 /P A + T ∞ (P-1)/P A

Art of Multiprocessor Programming© Herlihy – Shavit Deconstructing T ≤ T 1 /P A + T ∞ (P-1)/P A

Art of Multiprocessor Programming© Herlihy – Shavit Deconstructing T ≤ T 1 /P A + T ∞ (P-1)/P A Actual time

Art of Multiprocessor Programming© Herlihy – Shavit Deconstructing T ≤ T 1 /P A + T ∞ (P-1)/P A Work divided by processor average

Art of Multiprocessor Programming© Herlihy – Shavit Deconstructing T ≤ T 1 /P A + T ∞ (P-1)/P A Cannot do better than critical path length

Art of Multiprocessor Programming© Herlihy – Shavit Deconstructing T ≤ T 1 /P A + T ∞ (P-1)/P A The higher the average the better it is …

Art of Multiprocessor Programming© Herlihy – Shavit Proof Strategy Bound this!

Art of Multiprocessor Programming© Herlihy – Shavit Put Tokens in Buckets idlework Processor found work Processor available but couldn’t find work

Art of Multiprocessor Programming© Herlihy – Shavit At the end …. idlework Total #tokens =

Art of Multiprocessor Programming© Herlihy – Shavit At the end …. idlework T 1 tokens

Art of Multiprocessor Programming© Herlihy – Shavit Must Show idlework ≤ T ∞ (P-1) tokens

Art of Multiprocessor Programming© Herlihy – Shavit Idle Steps An idle step is one where there is at least one idle processor Only time idle tokens are generated Focus on idle steps

Art of Multiprocessor Programming© Herlihy – Shavit Every Move You Make … Scheduler is greedy At least one node ready Number of idle threads in one idle step –At most p i -1 ≤ P-1 How many idle steps?

Art of Multiprocessor Programming© Herlihy – Shavit Unexecuted sub-DAG fib(4) fib(3)fib(2) submit get fib(2) fib(1)

Art of Multiprocessor Programming© Herlihy – Shavit fib(1) Unexecuted sub-DAG fib(4) fib(3)fib(2) fib(1) Longest path

Art of Multiprocessor Programming© Herlihy – Shavit Unexecuted sub-DAG fib(4) fib(3)fib(2) fib(1) Last node ready to execute

Art of Multiprocessor Programming© Herlihy – Shavit Every Step You Take … Consider longest path in unexecuted sub-DAG at step i At least one node in path ready Length of path shrinks by at least one at each step Initially, path is T ∞ So there are at most T ∞ idle steps

Art of Multiprocessor Programming© Herlihy – Shavit Counting Tokens At most P-1 idle threads per step At most T ∞ steps So idle bucket contains at most –T ∞ (P-1) tokens Both buckets contain –T 1 + T ∞ (P-1) tokens

Art of Multiprocessor Programming© Herlihy – Shavit Recapitulating

Art of Multiprocessor Programming© Herlihy – Shavit Turns Out This bound is within a factor of 2 of optimal Actual optimal is NP-complete

Art of Multiprocessor Programming© Herlihy – Shavit Work Distribution zzz…

Art of Multiprocessor Programming© Herlihy – Shavit Work Dealing Yes!

Art of Multiprocessor Programming© Herlihy – Shavit The Problem with Work Dealing D’oh!

Art of Multiprocessor Programming© Herlihy – Shavit Work Stealing No work… zzz Yes!

Art of Multiprocessor Programming© Herlihy – Shavit Lock-Free Work Stealing Each thread has a pool of ready work Remove work without synchronizing If you run out of work, steal someone else’s Choose victim at random

Art of Multiprocessor Programming© Herlihy – Shavit Local Work Pools Each work pool is a Double-Ended Queue

Art of Multiprocessor Programming© Herlihy – Shavit Work DEQueue 1 pushBottom popBottom work 1. Double-Ended Queue

Art of Multiprocessor Programming© Herlihy – Shavit Obtain Work Obtain work Run thread until Blocks or terminates popBottom

Art of Multiprocessor Programming© Herlihy – Shavit New Work Unblock node Spawn node pushBottom

Art of Multiprocessor Programming© Herlihy – Shavit Whatcha Gonna do When the Well Runs empty

Art of Multiprocessor Programming© Herlihy – Shavit Steal Work from Others Pick random guy’s DEQeueue

Art of Multiprocessor Programming© Herlihy – Shavit Steal this Thread! popTop

Art of Multiprocessor Programming© Herlihy – Shavit Thread DEQueue Methods –pushBottom –popBottom –popTop Never happen concurrently

Art of Multiprocessor Programming© Herlihy – Shavit Thread DEQueue Methods –pushBottom –popBottom –popTop These most common – make them fast (minimize use of CAS)

Art of Multiprocessor Programming© Herlihy – Shavit Ideal Wait-Free Linearizable Constant time Fortune Cookie: “It is better to be young, rich and beautiful, than old, poor, and ugly”

Art of Multiprocessor Programming© Herlihy – Shavit Compromise Method popTop may fail if –Concurrent popTop succeeds, or a –Concurrent popBottom takes last work Blame the victim!

Art of Multiprocessor Programming© Herlihy – Shavit Dreaded ABA Problem top CAS

Art of Multiprocessor Programming© Herlihy – Shavit Dreaded ABA Problem top

Art of Multiprocessor Programming© Herlihy – Shavit Dreaded ABA Problem top

Art of Multiprocessor Programming© Herlihy – Shavit Dreaded ABA Problem top

Art of Multiprocessor Programming© Herlihy – Shavit Dreaded ABA Problem top

Art of Multiprocessor Programming© Herlihy – Shavit Dreaded ABA Problem top

Art of Multiprocessor Programming© Herlihy – Shavit Dreaded ABA Problem top

Art of Multiprocessor Programming© Herlihy – Shavit Dreaded ABA Problem top CAS Uh-Oh … Yes!

Art of Multiprocessor Programming© Herlihy – Shavit Fix for Dreaded ABA stamp top bottom

Art of Multiprocessor Programming© Herlihy – Shavit Bounded DEQueue public class BDEQueue { AtomicStampedReference top; volatile int bottom; Runnable[] tasks; … }

Art of Multiprocessor Programming© Herlihy – Shavit Bounded DQueue public class BDEQueue { AtomicStampedReference top; volatile int bottom; Runnable[] tasks; … } Index & Stamp (synchronized)

Art of Multiprocessor Programming© Herlihy – Shavit Bounded DEQueue public class BDEQueue { AtomicStampedReference top; volatile int bottom; Runnable[] deq; … } Index of bottom thread (no need to synchronize The effect of a write must be seen – so we need a memory barrier)

Art of Multiprocessor Programming© Herlihy – Shavit Bounded DEQueue public class BDEQueue { AtomicStampedReference top; volatile int bottom; Runnable[] tasks; … } Array holding tasks

Art of Multiprocessor Programming© Herlihy – Shavit pushBottom() public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } … }

Art of Multiprocessor Programming© Herlihy – Shavit pushBottom() public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } … } Bottom is the index to store the new task in the array

Art of Multiprocessor Programming© Herlihy – Shavit pushBottom() public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } … } Adjust the bottom index stamp top bottom

Art of Multiprocessor Programming© Herlihy – Shavit Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; }

Art of Multiprocessor Programming© Herlihy – Shavit Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Read top (value & stamp)

Art of Multiprocessor Programming© Herlihy – Shavit Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Compute new value & stamp

Art of Multiprocessor Programming© Herlihy – Shavit Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Quit if queue is empty stamp top bottom

Art of Multiprocessor Programming© Herlihy – Shavit Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Try to steal the task stamp top CAS bottom

Art of Multiprocessor Programming© Herlihy – Shavit Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Give up if conflict occurs

Art of Multiprocessor Programming© Herlihy – Shavit Take Work Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; }

Art of Multiprocessor Programming© Herlihy – Shavit Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work Make sure queue is non-empty

Art of Multiprocessor Programming© Herlihy – Shavit Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work Prepare to grab bottom task

Art of Multiprocessor Programming© Herlihy – Shavit Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work Read top, & prepare new values

Art of Multiprocessor Programming© Herlihy – Shavit Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } return null; } Take Work If top & bottom 1 or more apart, no conflict stamp top bottom

Art of Multiprocessor Programming© Herlihy – Shavit Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work At most one item left stamp top bottom

Art of Multiprocessor Programming© Herlihy – Shavit Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work Try to steal last item. In any case reset Bottom because the DEQueue will be empty even if unsucessful (why?)

Art of Multiprocessor Programming© Herlihy – Shavit Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work stamp top CAS bottom I win CAS

Art of Multiprocessor Programming© Herlihy – Shavit Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work stamp top CAS bottom I lose CAS Thief must have won…

Art of Multiprocessor Programming© Herlihy – Shavit Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work failed to get last item Must still reset top

Art of Multiprocessor Programming© Herlihy – Shavit Old English Proverb “May as well be hanged for stealing a sheep as a goat” From which we conclude –Stealing was punished severely –Sheep were worth more than goats

Art of Multiprocessor Programming© Herlihy – Shavit Variations Stealing is expensive –Pay CAS –Only one thread taken What if –Randomly balance loads?

Art of Multiprocessor Programming© Herlihy – Shavit Work Balancing 5 2 b2+5 c/2=3 d2+5e/2=4

Art of Multiprocessor Programming© Herlihy – Shavit Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}}

Art of Multiprocessor Programming© Herlihy – Shavit Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Keep running tasks

Art of Multiprocessor Programming© Herlihy – Shavit Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} With probability 1/|queue|

Art of Multiprocessor Programming© Herlihy – Shavit Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Choose random victim

Art of Multiprocessor Programming© Herlihy – Shavit Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Lock queues in canonical order

Art of Multiprocessor Programming© Herlihy – Shavit Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Rebalance queues

Art of Multiprocessor Programming© Herlihy – Shavit Work Stealing & Balancing Clean separation between app & scheduling layer Works well when number of processors fluctuates. Works on “black-box” operating systems

Art of Multiprocessor Programming© Herlihy – Shavit TOM MRAVLOO RID L E D

Art of Multiprocessor Programming© Herlihy – Shavit This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike 2.5 License You are free: –to Share — to copy, distribute and transmit the work –to Remix — to adapt the work Under the following conditions: –Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). –Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to – Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.