Load Balancing and Multithreaded Programming

Slides:



Advertisements
Similar presentations
Mutual Exclusion – SW & HW By Oded Regev. Outline: Short review on the Bakery algorithm Short review on the Bakery algorithm Black & White Algorithm Black.
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.
Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR.
Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit (Some images in this.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Art of Multiprocessor Programming1 Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy.
Tirgul 9 Amortized analysis Graph representation.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
Week 5 - Monday.  What did we talk about last time?  Linked list implementations  Stacks  Queues.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Monitors and Blocking Synchronization Dalia Cohn Alperovich Based on “The Art of Multiprocessor Programming” by Herlihy & Shavit, chapter 8.
Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice.
Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Java Thread Programming
CILK: An Efficient Multithreaded Runtime System
EECE 310: Software Engineering
Parallelism and Concurrency
EMERALDS Landon Cox March 22, 2017.
Topic 3 (Textbook - Chapter 3) Processes
Topological Sort In this topic, we will discuss: Motivations
Atomic Operations in Hardware
Atomic Operations in Hardware
Other Important Synchronization Primitives
Task Scheduling for Multicore CPUs and NUMA Systems
Futures, Scheduling, and Work Distribution
Distributed Algorithms (22903)
Algorithm Analysis CSE 2011 Winter September 2018.
Distributed Algorithms (22903)
Distributed Algorithms (22903)
L21: Putting it together: Tree Search (Ch. 6)
Recitation 2: Synchronization, Shared memory, Matrix Transpose
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Futures, Scheduling, and Work Distribution
Synchronization Lecture 23 – Fall 2017.
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Race Conditions & Synchronization
Multithreaded Programming in Cilk LECTURE 1
CS 201 Fundamental Structures of Computer Science
Thread Implementation Issues
Introduction to CILK Some slides are from:
Background and Motivation
Barrier Synchronization
Implementing Mutual Exclusion
Futures, Scheduling, and Work Distribution
Implementing Mutual Exclusion
CSE373: Data Structures & Algorithms Implementing Union-Find
CSE 153 Design of Operating Systems Winter 19
CS333 Intro to Operating Systems
Collision Handling Collisions occur when different elements are mapped to the same cell.
Programming with Shared Memory Specifying parallelism
B-Trees.
CS703 – Advanced Operating Systems
Introduction to CILK Some slides are from:
Software Engineering and Architecture
Nir Shavit Multiprocessor Synchronization Spring 2003
Presentation transcript:

Load Balancing and Multithreaded Programming Nir Shavit Multiprocessor Synchronization Spring 2003

How to write Parallel Apps? Multithreaded Programming Programming model Programming language (Cilk) Well-developed theory Successful practice 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Why We Care Interesting in its own right Scheduler Ideal application for Lock-free data structures 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} *Cilk Code (Java Code in Notes) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Parallel method call 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Wait for children to complete 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Safe to use children’s values 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Note Spawn & synch operators The scheduler Like Israeli traffic signs Are purely advisory in nature The scheduler Like the Israeli driver Has complete freedom to decide 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Dynamic Behavior Multithreaded program is A thread is A directed acyclic graph (DAG) That unfolds dynamically A thread is Maximal sequence of instructions Without spawn, sync, or return 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Fib DAG sync spawn fib(4) fib(3) fib(2) fib(1) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Arrows Reflect Dependencies fib(4) sync spawn fib(3) fib(2) fib(2) fib(1) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

How Parallel is That? Define work: Define critical-path length: Total time on one processor Define critical-path length: Longest dependency path Can’t beat that! 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Fib Work fib(4) fib(3) fib(2) fib(2) fib(1) fib(1) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Fib Work 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 work is 17 16 17 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Fib Critical Path fib(4) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Critical path length is 8 Fib Critical Path fib(4) 1 8 2 7 3 4 6 Critical path length is 8 5 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Notation Watch TP = time on P processors T1 = work (time on 1 processor) T∞ = critical path length (time on ∞ processors) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Simple Bounds TP ≥ T1/P TP ≥ T∞ In one step, can’t do more than P work Can’t beat infinite resources 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

More Notation Watch Speedup on P processors Linear speedup Ratio T1/TP How much faster with P processors Linear speedup T1/TP = Θ(P) Max speedup (average parallelism) T1/T∞ 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Remarks Graph nodes have out-degree ≤ 2 Unique Starting node Ending node 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Matrix Multiplication 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Matrix Multiplication Each n-by-n matrix multiplication 8 multiplications 4 additions Of n/2-by-n/2 submatrices 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Addition int add(Matrix C, Matrix T, int n) { if (n == 1) { C[1,1] = C[1,1] + T[1,1]; } else { partition C, T into half-size submatrices; spawn add(C11,T11,n/2); spawn add(C12,T12,n/2); spawn add(C21,T21,n/2); spawn add(C22,T22,n/2) sync(); }} 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Addition Let AP(n) be running time For example For n x n matrix on P processors For example A1(n) is work A∞(n) is critical path length 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Addition Work is Partition, synch, etc 4 spawned additions A1(n) = 4 A1(n/2) + Θ(1) Partition, synch, etc 4 spawned additions 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Same as double-loop summation Addition Work is A1(n) = 4 A1(n/2) + Θ(1) = Θ(n2) Same as double-loop summation 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

spawned additions in parallel Critical Path length is A∞(n) = A∞(n/2) + Θ(1) spawned additions in parallel Partition, synch, etc 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Addition Critical Path length is A∞(n) = A∞(n/2) + Θ(1) = Θ(log n) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Multiplication int mult(Matrix C, Matrix A, Matrix B, int n) { if (n == 1) { C[1,1] = A[1,1]·B[1,1]; } else { allocate temporary n·n matrix T; partition A,B,C,T into half-size submatrices; … 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Multiplication (con’t) spawn mult(C11,A11,B11,n/2); spawn mult(C12,A11,B12,n/2); spawn mult(C21,A21,B11,n/2); spawn mult(C22,A22,B12,n/2) spawn mult(T11,A11,B21,n/2); spawn mult(T12,A12,B22,n/2); spawn mult(T21,A21,B21,n/2); spawn mult(T22,A22,B22,n/2) sync(); spawn add(C,T,n); }} 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

8 spawned mulitplications Multiplication Work is M1(n) = 8 M1(n/2) + A1(n) Final addition 8 spawned mulitplications 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Same as serial triple-nested loop Multiplication Work is M1(n) = 8 M1(n/2) + Θ(n2) = Θ(n3) Same as serial triple-nested loop 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Half-size parallel multiplications Critical path length is M∞(n) = M∞(n/2) + A∞(n) Final addition Half-size parallel multiplications 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Multiplication Critical path length is M∞(n) = M∞(n/2) + A∞(n) = M∞(n/2) + Θ(log n) = Θ(log2 n) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Parallelism M1(n)/ M∞(n) = Θ(n3/log2 n) To multiply two 1000 x 1000 matrices 10003/102=107 Much more than number of processors on any real machine 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Shared-Memory Multiprocessors Parallel applications Java Cilk, etc. Mix of other jobs All run together Come & go dynamically 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Scheduling Ideally, In real life, User-level scheduler Maps threads to dedicated processors In real life, Maps threads to fixed number of processes Kernel-level scheduler Maps processes to dynamic pool of processors 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

For Example Initially, Serial computation All P processors available for application Serial computation Takes over one processor Leaving P-1 for us Waits for I/O We get that processor back …. 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Speedup Map threads onto P processes Cannot get P-fold speedup What if the kernel doesn’t cooperate? Can try for PA-fold speedup PA is time-averaged number of processors the kernel gives us 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

8-processor Sun Ultra Enterprise 5000. Static Load Balancing 8 7 6 5 speedup ideal mm(1024) lu(2048) barnes(16K,10) heat(4K,512,100) 4 8-processor Sun Ultra Enterprise 5000. 3 2 1 1 4 8 12 16 20 24 28 32 processes 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Dynamic Load Balancing 8 7 6 ideal mm(1024) lu(2048) barnes(16K,10) heat(4K,512,100) msort(32M) ray() 5 speedup 4 8-processor Sun Ultra Enterprise 5000. 3 2 1 1 4 8 8 12 12 16 16 20 24 28 32 processes 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Scheduling Hierarchy User-level scheduler Kernel-level scheduler Tells kernel which processes are ready Kernel-level scheduler Synchronous (for analysis, not correctness!) Picks pi threads to schedule at step i Time-weighted average is: 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Greed is Good Greedy scheduler Schedules as much as it can At each time step 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Theorem Greedy scheduler ensures actual time T ≤ T1/PA + T∞(P-1)/PA 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Proof Strategy Bound this! 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Put Tokens in Buckets Thread scheduled and executed Thread scheduled but not executed work idle 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

At the end …. Total #tokens = work idle 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

At the end …. T1 tokens work idle 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Must Show ≤ T∞(P-1) tokens work idle 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Every Move You Make … Scheduler is greedy At least one node ready Number of idle threads in one step At most pi-1 ≤ P-1 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Every Step You Take … Consider longest path in unexecuted sub-DAG at step i At least one node in path ready Length of path shrinks by at least one at each step Initially, path is T∞ So there are at most T∞ idle steps 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Counting Tokens At most P-1 idle threads per step At most T∞ steps So idle bucket contains at most T∞(P-1) tokens Both buckets contain T1 + T∞(P-1) tokens 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Recapitulating 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Turns Out This bound is within a constant factor of optimal Actual optimal is NP-complete 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Work Sharing Process generates new threads Migrate them elsewhere In hopes of balancing the load 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Work Stealing If a process runs out of work It steals work from another If everyone busy, no migration Idle process incurs synchronization cost 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Lock-Free Work Stealing Each process has a pool of ready threads Remove thread without synchronizing If you run out of threads, steal someone else’s Choose victim at random 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Work DEQueue1 threads pushBottom popBottom 1. Double-Ended Queue 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Obtain Work popBottom Obtain work Run thread until Blocks or terminates popBottom 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

New Work pushBottom Unblock node Spawn node 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Whatcha Gonna do When the Well Runs Dry? @&%$!! empty 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Steal this Thread! popTop 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Never happen concurrently Thread DEQueue Methods pushBottom popBottom popTop Never happen concurrently 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Yield Processes spin trying to steal, but all DEQueues are empty Each process yields processor between steal attempts Gives victims chance to do work 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Performance Without Yield 8 7 6 ideal mm(1024) lu(2048) barnes(16K,10) heat(4K,512,100) msort(32M) ray() 5 speedup 4 3 2 1 1 4 8 12 16 20 24 28 32 processes 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Ideal Wait-Free Linearizable Constant time Fortune Cookie: “It is better to be young, rich and beautiful, than old, poor, and ugly! 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Compromise Method popTop may signal abort if Concurrent popTop succeeds Concurrent popBottom takes last thread Blame the victim! 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Dreaded ABA Problem top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Dreaded ABA Problem top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Dreaded ABA Problem top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Dreaded ABA Problem top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Dreaded ABA Problem top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Dreaded ABA Problem top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Dreaded ABA Problem Uh-Oh … CAS Yes! top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Dreaded ABA Fix tag top bottom 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

half index & half tag to avoid ABA Code public class DEQueue { longRMWregister top; // tag & top int bottom; // bottom thread index Thread[] deq; // array of threads … } half index & half tag to avoid ABA 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Dreaded ABA Problem Fix // extract tag field from top private int TAG_MASK = 0xFFFF0000; private int TAG_SHIFT = 16; private int getTag(int i) { return ((i & TAG_MASK) >> TAG_SHIFT); } 0x00210032 index tag 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Code public class DEQueue { … void pushBottom(Thread t){ this.deq[this.bottom] = t; this.bottom++; } 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Code Thread popTop() throws Abort { long oldTop = this.top.read(); int bottom = this.bottom; if (bottom < getIndex(oldTop)) // empty return null; Thread t = this.deq[getIndex(oldTop)]; long newTop = setIndex(oldTop, getIndex(oldTop)+1); if (this.top.CAS(oldTop, newTop)) return t; throw new Abort(); }…} 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Make sure queue non-empty Code Thread popTop() throws Abort { int oldTop = this.top.read(); int bottom = this.bottom; if (bottom < getIndex(oldTop)) // empty return null; Thread t = this.deq[getIndex(oldTop)]; long newTop = setIndex(oldTop, getIndex(oldTop)+1); if (this.top.CAS(oldTop, newTop)) return t; throw new Abort(); }…} Make sure queue non-empty 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Get old and new top values Code Thread popTop() throws Abort { int oldTop = this.top.read(); int bottom = this.bottom; if (bottom < getIndex(oldTop)) // empty return null; Thread t = this.deq[getIndex(oldTop)]; long newTop = setIndex(oldTop, getIndex(oldTop)+1); if (this.top.CAS(oldTop, newTop)) return t; throw new Abort(); }…} Get old and new top values 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Code Install new top value Thread popTop() throws Abort { int oldTop = this.top; int bottom = this.bottom; if (bottom < getIndex(oldTop)) // empty return null; Thread t = this.deq[getIndex(oldTop)]; int newTop = oldTop; newTop = setIndex(oldTop, getIndex(oldTop)+1); if (this.top.CAS(oldTop, newTop)) return t; throw new Abort(); }…} Install new top value 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Code Thread popBottom() { if (this.bottom == 0) return null; Thread t = this.deq[this.bottom]; long oldTop = this.top.read(); if (this.bottom > getIndex(oldTop)) return t; long newTop = makeTop(getTag(oldTop),0); this.bottom = 0; if (this.bottom == getIndex(oldTop)) if (this.top.CAS(oldTop, newTop)) return t; this.top.write(newTop); } 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Make sure queue non-empty Code Thread popBottom() { if (this.bottom == 0) return null; this.bottom--; Thread t = this.deq[this.bottom]; long oldTop = this.top.read(); if (this.bottom > getIndex(oldTop)) return t; long newTop = makeTop(getTag(oldTop),0); this.bottom = 0; if (this.bottom == getIndex(oldTop)) if (this.top.CAS(oldTop, newTop)) return t; this.top.write(newTop); } Make sure queue non-empty 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Code Grab bottom thread Thread popBottom() { if (this.bottom == 0) return null; this.bottom--; Thread t = this.deq[this.bottom]; long oldTop = this.top.read(); if (this.bottom > getIndex(oldTop)) return t; long newTop = makeTop(getTag(oldTop),0); this.bottom = 0; if (this.bottom == getIndex(oldTop)) if (this.top.CAS(oldTop, newTop)) return t; this.top.write(newTop); } Grab bottom thread 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

If not near top, we’re done Code Thread popBottom() { if (this.bottom == 0) return null; this.bottom--; Thread t = this.deq[this.bottom]; long oldTop = this.top.read(); if (this.bottom > getIndex(oldTop)) return t; long newTop = makeTop(getTag(oldTop),0); this.bottom = 0; if (this.bottom == getIndex(oldTop)) if (this.top.CAS(oldTop, newTop)) return t; this.top.write(newTop); } If not near top, we’re done 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Code Reset top & bottom Thread popBottom() { if (this.bottom == 0) return null; this.bottom--; Thread t = this.deq[this.bottom]; long oldTop = this.top.read(); if (this.bottom > getIndex(oldTop)) return t; long newTop = makeTop(getTag(oldTop),0); this.bottom = 0; if (this.bottom == getIndex(oldTop)) if (this.top.CAS(oldTop, newTop)) return t; this.top.write(newTop); } Reset top & bottom 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Summary so Far Multithreaded structures Scheduling Work Critical path length Parallelism Scheduling Work stealing Lock-free DEQueue 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Lock-Free Work Stealing OK even if the number of processes exceeds the number of processors or when the number of processors grows and shrinks over time. No need for “non-commercial” operating-system support, such as gang scheduling or process control. 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Old English Proverb “May as well be hanged for stealing a sheep as a goat” From which we conclude Stealing was punished severely Sheep were worth more than goats 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

But Wait, There’s More! Stealing is expensive What if CAS Only one thread taken What if We could steal more each time? Say, up to half? 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Review Double-ended queue (DEQueue) Local thread Remove/add thread without CAS If top and bottom > 1 apart 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Consensus If top and bottom are close Local thread and thief contend Need consensus to resolve In a sequence of k pushes or pops Number of CAS operations is Θ(1) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Consensus Stealing half increases uncertainty Consensus on half the queue? In a sequence of k pushes or pops Number of CAS operations is Θ(k) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

New Idea We can get down to Θ(log k) How: limit uncertainty to when queue size passes a power of 2! Keep a “half-point” counter Thief resets counter Local thread changes counter at power-of-2 boundary 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

The Big Picture Steal-range Up to 2i can be stolen atomically Previous-Steal-Range tag top last Up to 2i can be stolen atomically tag top At least 2i outside steal range last Bottom somewhere in group of 2i+1 bottom 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Steal Range stealRange tag: defeats ABA problem top: index of top-most item in DEQueue stealLast: last item to be stolen tag top last stealRange 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

When to Steal? Steal on empty Steal probabilistically if (shouldBalance()) { Process victim = randomProcess(); tryToSteal(victim); } Steal on empty Steal probabilistically Probability decreases as queue increases Steal when queue size passes threshold 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Before PushBottom tag top 3 last bottom 14 30-Nov-18 tag top 3 last bottom 14 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

After PushBottom tag top last 7 bottom 15 30-Nov-18 tag top last 7 bottom 15 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Update stealRange boolean updateStealRange() { if (size is a power of two || theft occurred) { // Try to update the stealRange int newSize = Math.max(1, power of 2 closest to half); long oldRange=this.stealRange; int tag = getTag(oldRange.stealRange); int top = getTop(oldRange.stealRange); long newRange= makeStealRange(tag+1, top, top+newSize-1)); boolean ok=this.stealRange.CAS(oldRange,newRange); if (ok) this.prevStealRange = newRange; return ok; } return true;} 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Readjust when queue size is power of two Update stealRange boolean updateStealRange() { if (size is a power of two || theft occurred) { // Try to update the stealRange int newSize = Math.max(1, power of 2 closest to half); long oldRange=this.stealRange; int tag = getTag(oldRange.stealRange); int top = getTop(oldRange.stealRange); long newRange= makeStealRange(tag+1, top, top+newSize-1)); boolean ok = this.stealRange.CAS(oldRange,newRange); if (ok) prevStealRange = newRange; return ok; } return true; Readjust when queue size is power of two 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Readjust when thief has taken some threads Update stealRange boolean updateStealRange() { if (size is a power of two || theft occurred) { // Try to update the stealRange int newSize = Math.max(1, power of 2 closest to half); long oldRange=this.stealRange; int tag = getTag(oldRange.stealRange); int top = getTop(oldRange.stealRange); long newRange= makeStealRange(tag+1, top, top+newSize-1)); boolean ok = this.stealRange.CAS(oldRange,newRange); if (ok) prevStealRange = newRange; return ok; } return true; Readjust when thief has taken some threads 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

New range size is roughly half Update stealRange boolean updateStealRange() { if (size is a power of two || theft occurred) { // Try to update the stealRange int newSize = Math.max(1, power of 2 closest to half); long oldRange=this.stealRange; int tag = getTag(oldRange.stealRange); int top = getTop(oldRange.stealRange); long newRange= makeStealRange(tag+1, top, top+newSize)); boolean ok = this.stealRange.CAS(oldRange,newRange); if (ok) prevStealRange = newRange; return ok; } return true; New range size is roughly half 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Try to update stealRange to reflect the new size boolean updateStealRange() { if (size is a power of two || theft occurred) { // Try to update the stealRange int newSize = Math.max(1, power of 2 closest to half); long oldRange=this.stealRange; int tag = getTag(oldRange.stealRange); int top = getTop(oldRange.stealRange); long newRange= makeStealRange(tag+1, top, top+newSize-1)); boolean ok = this.stealRange.CAS(oldRange,newRange); if (ok) this.prevStealRange = newRange; return ok; } return true; Try to update stealRange to reflect the new size 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Update stealRange If update succeeded, save a copy boolean updateStealRange() { if (size is a power of two || theft occurred) { // Try to update the stealRange int newSize = Math.max(1, power of 2 closest to half); long oldRange=this.stealRange; int tag = getTag(oldRange.stealRange); int top = getTop(oldRange.stealRange); long newRange= makeStealRange(tag+1, top, top+newSize-1)); boolean ok=this.stealRange.CAS(oldRange,newRange); if (ok) this.prevStealRange = newRange; return ok; } return true; If update succeeded, save a copy of updated range, to identify future thefts 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

pushBottom Code public void pushBottom(Thread t, throws Full { if (this.getSize() == QUEUE_SIZE) throw new Full(); this.deq[this.bottom] = t; this.bottom=(++this.bottom) % QUEUE_SIZE; updateStealRange(); } 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

pushBottom Code public void pushBottom(Thread t, throws Full { if (this.getSize() == QUEUE_SIZE) throw new Full(); this.deq[this.bottom] = t; this.bottom=(++this.bottom) % QUEUE_SIZE; updateStealRange(); } Thread to push 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

pushBottom Code public void pushBottom(Thread t, throws Full { if (this.getSize() == QUEUE_SIZE) throw new Full(); this.deq[this.bottom] = t; this.bottom=(++this.bottom) % QUEUE_SIZE; updateStealRange(); } Are we full? 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

pushBottom Code public void pushBottom(Thread t, throws Full { if (this.getSize() == QUEUE_SIZE) throw new Full(); this.deq[this.bottom] = t; this.bottom=(++this.bottom) % QUEUE_SIZE; updateStealRange(); } Push thread 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Update StealRange, if required pushBottom Code public void pushBottom(Thread t, throws Full { if (this.getSize() == QUEUE_SIZE) throw new Full(); this.deq[this.bottom] = t; this.bottom=(++this.bottom) % QUEUE_SIZE; updateStealRange(); } Update StealRange, if required 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Before PopBottom tag top last 7 bottom 15 30-Nov-18 tag top last 7 bottom 15 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

After PopBottom tag top 3 last bottom 14 30-Nov-18 tag top 3 last bottom 14 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

popBottom (Part One) public Object popBottom() throws Abort { if (this.getSize() == 0) return null; if (!updateStealRange()) throw new Abort(); if (this.bottom == 0) this.bottom = QUEUE_SIZE-1; else --this.bottom; Object t = this.deq[this.bottom]; … 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

popBottom (Part One) Bail if queue is empty public Object popBottom() throws Abort { if (this.getSize() == 0) return null; if (!updateStealRange()) throw new Abort(); if (this.bottom == 0) this.bottom = QUEUE_SIZE-1; else --this.bottom; Object t = this.deq[this.bottom]; … Bail if queue is empty 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

popBottom (Part One) Panic if unable to fix stealRange public Object popBottom() throws Abort { if (this.getSize() == 0) return null; if (!updateStealRange()) throw new Abort(); if (this.bottom == 0) this.bottom = QUEUE_SIZE-1; else --this.bottom; Object t = this.deq[this.bottom]; … Panic if unable to fix stealRange 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

popBottom (Part One) Tentatively pop a thread public Object popBottom() throws Abort { if (this.getSize() == 0) return null; if (!updateStealRange()) throw new Abort(); if (this.bottom == 0) this.bottom = QUEUE_SIZE-1; else --this.bottom; Object t = this.deq[this.bottom]; … Tentatively pop a thread 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

popBottom (Part Two) public Object popBottom() throws Abort { … long oldStealRange = this.stealRange; int rangeTop = getTop(oldStealRange); int rangeBot = getLast(oldStealRange); if (rangeBot == EMPTY) { this.bottom = 0; // last thread already stolen return null; } else if (this.bottom != rangeBot) return t; // no need to synchronize else { … 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Deconstruct stealRange popBottom (Part Two) public Object popBottom() throws Abort { … long oldStealRange = this.stealRange; int rangeTop = getTop(oldStealRange); int rangeBot = getLast(oldStealRange); if (rangeBot == EMPTY) { this.bottom = 0; // last thread already stolen return null; } else if (this.bottom != rangeBot) return t; // no need to synchronize else { … Deconstruct stealRange 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

If queue is empty, start over popBottom (Part Two) public Object popBottom() throws Abort { … long oldStealRange = this.stealRange; int rangeTop = getTop(oldStealRange); int rangeBot = getLast(oldStealRange); if (rangeBot == EMPTY) { this.bottom = 0; // last thread already stolen return null; } else if (this.bottom != rangeBot) return t; // no need to synchronize else { … If queue is empty, start over 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

popBottom (Part Two) If tentatively-popped thread not in public Object popBottom() throws Abort { … long oldStealRange = this.stealRange; int rangeTop = getTop(oldStealRange); int rangeBot = getLast(oldStealRange); if (rangeBot == null) { this.bottom = 0; // last thread already stolen return null; } else if (this.bottom != rangeBot) return t; // no need to synchronize else { … If tentatively-popped thread not in stealRange - no need to synchronize 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

popBottom (Part Three) public Object popBottom() throws Abort { … } else { // Try to make stealRange empty int rangeTag = getTag(oldStealRange); if (this.stealRange.CAS(oldStealRange, makeStealRange(tag+1,0,EMPTY))) { this.bottom=0; return t; // thread not stolen yet } else { this.bottom=0 return null; // thread stolen }}} 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Queue has at most one thread popBottom (Part Three) public Object popBottom() throws Abort { … } else { // Try to make stealRange empty int rangeTag = getTag(oldStealRange); if (this.stealRange.CAS(oldStealRange, makeStealRange(tag+1,0,EMPTY))) { this.bottom=0; return t; // thread not stolen yet } else { this.bottom=0 return null; // thread stolen }}} Queue has at most one thread 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Try to zero out steal range popBottom (Part Three) public Object popBottom() throws Abort { … } else { // Try to make stealRange empty int rangeTag = getTag(oldStealRange); if (this.stealRange.CAS(oldStealRange, makeStealRange(tag+1,0,EMPTY))) { this.bottom=0; return t; // thread not stolen yet } else { this.bottom=0 return null; // thread stolen }}} Try to zero out steal range 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

(and the deque is now empty) popBottom (Part Three) public Object popBottom() throws Abort { … } else { // Try to make stealRange empty int rangeTag = getTag(oldStealRange); if (this.stealRange.CAS(oldStealRange, makeStealRange(tag+1,0,EMPTY))) { this.bottom=0; return t; // thread not stolen yet } else { this.bottom=0 return null; // thread stolen }}} If we succeeded – the thread is ours! (and the deque is now empty) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

popBottom (Part Three) public Object popBottom() throws Abort { … } else { // Try to make stealRange empty int rangeTag = getTag(oldStealRange); if (this.stealRange.CAS(oldStealRange, makeStealRange(tag+1,0,EMPTY))) { this.bottom=0; return t; // thread not stolen yet } else { this.bottom=0 return null; // thread stolen }}} If we failed – our last thread was stolen 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

stealTop (Part One) public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

stealTop (Part One) Victim DEQueue public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… Victim DEQueue 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

stealTop (Part One) The number of threads Actually stolen public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… The number of threads Actually stolen 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Deconstruct victim’s steal range stealTop (Part One) public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… Deconstruct victim’s steal range 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

stealTop (Part One) Compute length of victim’s stealRange, (victim’s public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… Compute length of victim’s stealRange, (victim’s DEQueue length is at least twice as much) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

stealTop (Part One) Diff is a minimal bound on the difference in public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… Diff is a minimal bound on the difference in lengths between victim and thief 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

If we can’t equalize by stealing – don’t steal!! stealTop (Part One) public int stealTop(EDEQueue victim, int thiefLen) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – thiefLen; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… If we can’t equalize by stealing – don’t steal!! 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

stealTop (Part One) public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… Try to steal half the guaranteed difference: Copy threads-to-be-stolen to thief’s deque 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

stealTop (Part Two) public int stealTop(EDEQueue victim) { … int newRangeLen= max(1,power of 2 closest to half the remaining threads); newTop = (oldTop+numToSteal) % DEQUE_SIZE; newLast = (newTop + newRangeLen – 1) % DEQUE_SIZE; long newRange = makeStealRange(oldTag+1, newTop, newLast); if (victim.stealRange.CAS(oldStealRnage, newRange)) { this.bottom = (this.bottom + numToSteal) % DEQUE_SIZE; this.updateStealRange(); return numToSteal; } return 0; 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

stealTop (Part Two) public int stealTop(EDEQueue victim) { … int newRangeLen= max(1,power of 2 closest to half the remaining threads); newTop = (oldTop+numToSteal) % DEQUE_SIZE; newLast = (newTop + newRangeLen – 1) % DEQUE_SIZE; long newRange = makeStealRange(oldTag+1, newTop, newLast); if (victim.stealRange.CAS(oldStealRnage, newRange)) { this.bottom = (this.bottom + numToSteal) % DEQUE_SIZE; this.updateStealRange(); return numToSteal; } return 0; The new length of the victim’s stealRange is about half the remaining number of threads 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

stealTop (Part Two) public int stealTop(EDEQueue victim) { … int newRangeLen= max(1,power of 2 closest to half the remaining threads); newTop = (oldTop+numToSteal) % DEQUE_SIZE; newLast = (newTop + newRangeLen – 1) % DEQUE_SIZE; long newRange = makeStealRange(oldTag+1, newTop, newLast); if (victim.stealRange.CAS(oldStealRnage, newRange)) { this.bottom = (this.bottom + numToSteal) % DEQUE_SIZE; this.updateStealRange(); return numToSteal; } return 0; Try to update victim’s stealRange to reflect the theft and the new range length 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

stealTop (Part Two) If succeeded, update thief’s bottom and stealRange public int stealTop(EDEQueue victim) { … int newRangeLen= max(1,power of 2 closest to half the remaining threads); newTop = (oldTop+numToSteal) % DEQUE_SIZE; newLast = (newTop + newRangeLen – 1) % DEQUE_SIZE; long newRange = makeStealRange(oldTag+1, newTop, newLast); if (victim.stealRange.CAS(oldStealRnage, newRange)) { this.bottom = (this.bottom + numToSteal) % DEQUE_SIZE; this.updateStealRange(); return numToSteal; } return 0; If succeeded, update thief’s bottom and stealRange to include new threads, and return # stolen threads 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Details Works even if someone steals from thief Thief may fail to update own stealRange But will still update bottom, making theft happen 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Big Picture This code steals as much as it can More sensible to Split the difference? May depend on stealing strategy 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Vulnerability If queue size hovers around power of 2, performance will be lousy Extra credit Can we avoid this problem? 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Conclusions “Boutique” lock-free structures Not general purpose Customized for work-stealing Non-trivial correctness issues 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Alternative: Gang Scheduling processor 1 processor 2 processor 3 processor 4 time Bad Example: 4-process computation with 1-process computation on 4-processor machine. Good Example: Data-parallel programs with large working sets. 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Alternative: Process Control processor 1 processor 2 processor 3 processor 4 time process killed new process created Each computation creates and kills processes dynamically to equal number of processors assigned to it. 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

Clip Art 30-Nov-18 M. Herlihy & N. Shavit (c) 2003

T O M M A R V O L O R I D D L E 30-Nov-18 M. Herlihy & N. Shavit (c) 2003