Art of Multiprocessor Programming1 Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
CHAPTER 2 ALGORITHM ANALYSIS 【 Definition 】 An algorithm is a finite set of instructions that, if followed, accomplishes a particular task. In addition,
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Multithreaded Programs in Java. Tasks and Threads A task is an abstraction of a series of steps – Might be done in a separate thread – Java libraries.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
Art of Multiprocessor Programming1 Concurrent Queues Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.
Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR.
Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit (Some images in this.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.
Concurrency - 1 Tasking Concurrent Programming Declaration, creation, activation, termination Synchronization and communication Time and delays conditional.
Tutorial 2 Adventures in Threading presented by: Antonio Maiorano Paul Di Marco.
50.003: Elements of Software Construction Week 5 Basics of Threads.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
Week 5 - Monday.  What did we talk about last time?  Linked list implementations  Stacks  Queues.
1 Concurrent Languages – Part 1 COMP 640 Programming Languages.
Java Threads. What is a Thread? A thread can be loosely defined as a separate stream of execution that takes place simultaneously with and independently.
1 Web Based Programming Section 8 James King 12 August 2003.
CS 346 – Chapter 4 Threads –How they differ from processes –Definition, purpose Threads of the same process share: code, data, open files –Types –Support.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
מרצה: יהודה אפק מגיש: ערן שרגיאן. OutlineOutline Quick reminder of the Stack structure. The Unbounded Lock-Free Stack. The Elimination Backoff Stack.
Lecture 20: Parallelism & Concurrency CS 62 Spring 2013 Kim Bruce & Kevin Coogan CS 62 Spring 2013 Kim Bruce & Kevin Coogan Some slides based on those.
Multithreading in Java Sameer Singh Chauhan Lecturer, I. T. Dept., SVIT, Vasad.
ICS 313: Programming Language Theory Chapter 13: Concurrency.
Spring/2002 Distributed Software Engineering C:\unocourses\4350\slides\DefiningThreads 1 Reusing threads.
1 CSCI 6900: Design, Implementation, and Verification of Concurrent Software Eileen Kraemer August 24 th, 2010 The University of Georgia.
Java Thread and Memory Model
Department of Computer Science and Software Engineering
Concurrent Queues Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Static Process Scheduling
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice.
Concurrent Stacks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used.
Scalable lock-free Stack Algorithm Wael Yehia York University February 8, 2010.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Concurrency and Performance Based on slides by Henri Casanova.
L6: Threads “the future is parallel” COMP206, Geoff Holmes and Bernhard Pfahringer.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Single Instruction Multiple Threads
Linked Data Structures
CILK: An Efficient Multithreaded Runtime System
A brief intro to: Parallelism, Threads, and Concurrency
Multi Threading.
Atomic Operations in Hardware
Atomic Operations in Hardware
Task Scheduling for Multicore CPUs and NUMA Systems
Futures, Scheduling, and Work Distribution
Distributed Algorithms (22903)
Distributed Algorithms (22903)
Multithreading Chapter 23.
Futures, Scheduling, and Work Distribution
Unit-2 Divide and Conquer
Load Balancing and Multithreaded Programming
Multithreaded Programming
Background and Motivation
Futures, Scheduling, and Work Distribution
Memory System Performance Chapter 3
Programming with Shared Memory Specifying parallelism
Introduction to CILK Some slides are from:
Presentation transcript:

Art of Multiprocessor Programming1 Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640 University of Pennsylvania

Art of Multiprocessor Programming22 How to write Parallel Apps? How to –split a program into parallel parts –In an effective way –Thread management

Art of Multiprocessor Programming33 Matrix Multiplication

Art of Multiprocessor Programming44 Matrix Multiplication c ij =  k=0 N-1 a ki * b jk

Art of Multiprocessor Programming5 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}

Art of Multiprocessor Programming6 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} a thread

Art of Multiprocessor Programming7 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Which matrix entry to compute

Art of Multiprocessor Programming8 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Actual computation

Art of Multiprocessor Programming9 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); }

Art of Multiprocessor Programming10 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Create n x n threads

Art of Multiprocessor Programming11 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Start them

Art of Multiprocessor Programming12 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Wait for them to finish Start them

Art of Multiprocessor Programming13 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Wait for them to finish Start them What’s wrong with this picture?

Art of Multiprocessor Programming14 Thread Overhead Threads Require resources –Memory for stacks –Setup, teardown Scheduler overhead Worse for short-lived threads

Art of Multiprocessor Programming15 Thread Pools More sensible to keep a pool of long- lived threads Threads assigned short-lived tasks –Runs the task –Rejoins pool –Waits for next assignment

Art of Multiprocessor Programming16 Thread Pool = Abstraction Insulate programmer from platform –Big machine, big pool –And vice-versa Portable code –Runs well on any platform –No need to mix algorithm/platform concerns

Art of Multiprocessor Programming17 ExecutorService Interface In java.util.concurrent –Task = Runnable object If no result value expected Calls run() method. –Task = Callable object If result value of type T expected Calls T call() method.

Art of Multiprocessor Programming18 Future Callable task = …; … Future future = executor.submit(task); … T value = future.get();

Art of Multiprocessor Programming19 Future Callable task = …; … Future future = executor.submit(task); … T value = future.get(); Submitting a Callable task returns a Future object

Art of Multiprocessor Programming20 Future Callable task = …; … Future future = executor.submit(task); … T value = future.get(); The Future’s get() method blocks until the value is available

Art of Multiprocessor Programming21 Future Runnable task = …; … Future future = executor.submit(task); … future.get();

Art of Multiprocessor Programming22 Future Runnable task = …; … Future future = executor.submit(task); … future.get(); Submitting a Runnable task returns a Future object

Art of Multiprocessor Programming23 Future Runnable task = …; … Future future = executor.submit(task); … future.get(); The Future’s get() method blocks until the computation is complete

Art of Multiprocessor Programming24 Matrix Addition

Art of Multiprocessor Programming25 Matrix Addition 4 parallel additions

Art of Multiprocessor Programming26 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(add(a 00,b 00 )); … Future f 11 = exec.submit(add(a 11,b 11 )); f 00.get(); …; f 11.get(); … }}

Art of Multiprocessor Programming27 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(add(a 00,b 00 )); … Future f 11 = exec.submit(add(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} Base case: add directly

Art of Multiprocessor Programming28 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(add(a 00,b 00 )); … Future f 11 = exec.submit(add(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} Constant-time operation

Art of Multiprocessor Programming29 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(add(a 00,b 00 )); … Future f 11 = exec.submit(add(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} Submit 4 tasks

Art of Multiprocessor Programming30 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future f 00 = exec.submit(add(a 00,b 00 )); … Future f 11 = exec.submit(add(a 11,b 11 )); f 00.get(); …; f 11.get(); … }} Let them finish

Art of Multiprocessor Programming31 Dependencies Matrix example is not typical Tasks are independent –Don’t need results of one task … –To complete another Often tasks are not independent

Art of Multiprocessor Programming32 Fibonacci 1 if n = 0 or 1 F(n) F(n-1) + F(n-2) otherwise Note –potential parallelism –Dependencies

Art of Multiprocessor Programming33 Disclaimer This Fibonacci implementation is –Egregiously inefficient So don’t deploy it! –But illustrates our point How to deal with dependencies Exercise: –Make this implementation efficient!

Art of Multiprocessor Programming34 class FibTask implements Callable { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future left = exec.submit(new FibTask(arg-1)); Future right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Multithreaded Fibonacci

Art of Multiprocessor Programming35 class FibTask implements Callable { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future left = exec.submit(new FibTask(arg-1)); Future right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Multithreaded Fibonacci Parallel calls

Art of Multiprocessor Programming36 class FibTask implements Callable { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future left = exec.submit(new FibTask(arg-1)); Future right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Multithreaded Fibonacci Pick up & combine results

Art of Multiprocessor Programming37 Dynamic Behavior Multithreaded program is –A directed acyclic graph (DAG) –That unfolds dynamically Each node is –A single unit of work

Art of Multiprocessor Programming38 Fib DAG fib(4) fib(3)fib(2) submit get fib(2)fib(1)

Art of Multiprocessor Programming39 Arrows Reflect Dependencies fib(4) fib(3)fib(2) submit get fib(2)fib(1)

Art of Multiprocessor Programming40 How Parallel is That? Define work: –Total time on one processor Define critical-path length: –Longest dependency path –Can’t beat that!

Art of Multiprocessor Programming41 Fib Work fib(4)fib(3)fib(2) fib(1)

Art of Multiprocessor Programming42 Fib Work work is

Art of Multiprocessor Programming43 Fib Critical Path fib(4)

Art of Multiprocessor Programming44 Fib Critical Path fib(4) Critical path length is

Art of Multiprocessor Programming45 Notation Watch T P = time on P processors T 1 = work (time on 1 processor) T ∞ = critical path length (time on ∞ processors)

Art of Multiprocessor Programming46 Simple Bounds T P ≥ T 1 /P –In one step, can’t do more than P work T P ≥ T ∞ –Can’t beat infinite resources

Art of Multiprocessor Programming47 More Notation Watch Speedup on P processors –Ratio T 1 /T P –How much faster with P processors Linear speedup –T 1 /T P = Θ(P) Max speedup (average parallelism) –T 1 /T ∞

Art of Multiprocessor Programming48 Matrix Addition

Art of Multiprocessor Programming49 Matrix Addition 4 parallel additions

Art of Multiprocessor Programming50 Addition Let A P (n) be running time –For n x n matrix –on P processors For example –A 1 (n) is work –A ∞ (n) is critical path length

Art of Multiprocessor Programming51 Addition Work is A 1 (n) = 4 A 1 (n/2) + Θ(1) 4 spawned additions Partition, synch, etc

Art of Multiprocessor Programming52 Addition Work is A 1 (n) = 4 A 1 (n/2) + Θ(1) = Θ(n 2 ) Same as double-loop summation

Art of Multiprocessor Programming53 Addition Critical Path length is A ∞ (n) = A ∞ (n/2) + Θ(1) spawned additions in parallel Partition, synch, etc

Art of Multiprocessor Programming54 Addition Critical Path length is A ∞ (n) = A ∞ (n/2) + Θ(1) = Θ(log n)

Art of Multiprocessor Programming55 Matrix Multiplication Redux

Art of Multiprocessor Programming56 Matrix Multiplication Redux

Art of Multiprocessor Programming57 First Phase … 8 multiplications

Art of Multiprocessor Programming58 Second Phase … 4 additions

Art of Multiprocessor Programming59 Multiplication Work is M 1 (n) = 8 M 1 (n/2) + A 1 (n) 8 parallel multiplications Final addition

Art of Multiprocessor Programming60 Multiplication Work is M 1 (n) = 8 M 1 (n/2) + Θ(n 2 ) = Θ(n 3 ) Same as serial triple-nested loop

Art of Multiprocessor Programming61 Multiplication Critical path length is M ∞ (n) = M ∞ (n/2) + A ∞ (n) Half-size parallel multiplications Final addition

Art of Multiprocessor Programming62 Multiplication Critical path length is M ∞ (n) = M ∞ (n/2) + A ∞ (n) = M ∞ (n/2) + Θ(log n) = Θ(log 2 n)

Art of Multiprocessor Programming63 Parallelism M 1 (n)/ M ∞ (n) = Θ(n 3 /log 2 n) To multiply two 1000 x 1000 matrices – /10 2 =10 7 Much more than number of processors on any real machine

Art of Multiprocessor Programming64 Work Distribution zzz…

Art of Multiprocessor Programming65 Work Dealing Yes!

Art of Multiprocessor Programming66 The Problem with Work Dealing D’oh!

Art of Multiprocessor Programming67 Work Stealing No work… zzz Yes!

Art of Multiprocessor Programming68 Lock-Free Work Stealing Each thread has a pool of ready work Remove work without synchronizing If you run out of work, steal someone else’s Choose victim at random

Art of Multiprocessor Programming69 Local Work Pools Each work pool is a Double-Ended Queue

Art of Multiprocessor Programming70 Work DEQueue 1 pushBottom popBottom work 1. Double-Ended Queue

Art of Multiprocessor Programming71 Obtain Work Obtain work Run task until Blocks or terminates popBottom

Art of Multiprocessor Programming72 New Work Unblock node Spawn node pushBottom

Art of Multiprocessor Programming73 Whatcha Gonna do When the Well Runs empty

Art of Multiprocessor Programming74 Steal Work from Others Pick random thread’s DEQeueue

Art of Multiprocessor Programming75 Steal this Task! popTop

Art of Multiprocessor Programming76 Task DEQueue Methods –pushBottom –popBottom –popTop Never happen concurrently

Art of Multiprocessor Programming77 Task DEQueue Methods –pushBottom –popBottom –popTop These most common – make them fast (minimize use of CAS)

Art of Multiprocessor Programming78 Ideal Wait-Free Linearizable Constant time

Art of Multiprocessor Programming79 Compromise Method popTop may fail if –Concurrent popTop succeeds, or a –Concurrent popBottom takes last work Blame the victim!

Art of Multiprocessor Programming80 Dreaded ABA Problem top CAS

Art of Multiprocessor Programming81 Dreaded ABA Problem top

Art of Multiprocessor Programming82 Dreaded ABA Problem top

Art of Multiprocessor Programming83 Dreaded ABA Problem top

Art of Multiprocessor Programming84 Dreaded ABA Problem top

Art of Multiprocessor Programming85 Dreaded ABA Problem top

Art of Multiprocessor Programming86 Dreaded ABA Problem top

Art of Multiprocessor Programming87 Dreaded ABA Problem top CAS Uh-Oh … Yes!

Art of Multiprocessor Programming88 Fix for Dreaded ABA stamp top bottom

Art of Multiprocessor Programming89 Bounded DEQueue public class BDEQueue { AtomicStampedReference top; volatile int bottom; Runnable[] tasks; … }

Art of Multiprocessor Programming90 Bounded DQueue public class BDEQueue { AtomicStampedReference top; volatile int bottom; Runnable[] tasks; … } Index & Stamp (synchronized)

Art of Multiprocessor Programming91 Bounded DEQueue public class BDEQueue { AtomicStampedReference top; volatile int bottom; Runnable[] deq; … } Index of bottom task no need to synchronize Do need memory barrier

Art of Multiprocessor Programming92 Bounded DEQueue public class BDEQueue { AtomicStampedReference top; volatile int bottom; Runnable[] tasks; … } Array holding tasks

Art of Multiprocessor Programming93 pushBottom() public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } … }

Art of Multiprocessor Programming94 pushBottom() public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } … } Bottom is the index to store the new task in the array

Art of Multiprocessor Programming95 pushBottom() public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } … } Adjust the bottom index stamp top bottom

Art of Multiprocessor Programming96 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; }

Art of Multiprocessor Programming97 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Read top (value & stamp)

Art of Multiprocessor Programming98 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Compute new value & stamp

Art of Multiprocessor Programming99 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Quit if queue is empty stamp top bottom

Art of Multiprocessor Programming100 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Try to steal the task stamp top CAS bottom

Art of Multiprocessor Programming101 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Give up if conflict occurs

Art of Multiprocessor Programming102 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work

Art of Multiprocessor Programming103 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } 103 Take Work Make sure queue is non-empty

Art of Multiprocessor Programming104 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work Prepare to grab bottom task

Art of Multiprocessor Programming105 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work Read top, & prepare new values

Art of Multiprocessor Programming106 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } return null; } 106 Take Work If top & bottom 1 or more apart, no conflict stamp top bottom

Art of Multiprocessor Programming107 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } 107 Take Work At most one item left stamp top bottom

Art of Multiprocessor Programming108 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } 108 Take Work Try to steal last task. Always reset bottom because the DEQueue will be empty even if unsuccessful (why?)

Art of Multiprocessor Programming109 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work stamp top CAS bottom I win CAS

Art of Multiprocessor Programming110 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work stamp top CAS bottom If I lose CAS Thief must have won…

Art of Multiprocessor Programming111 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; } Take Work failed to get last task Must still reset top

Art of Multiprocessor Programming112 Variations Stealing is expensive –Pay CAS –Only one thread taken What if –Randomly balance loads?

Art of Multiprocessor Programming113 Work Stealing & Balancing Clean separation between app & scheduling layer Works well when number of processors fluctuates. Works on “black-box” operating systems