Load Balancing and Multithreaded Programming Nir Shavit Multiprocessor Synchronization Spring 2003
How to write Parallel Apps? Multithreaded Programming Programming model Programming language (Cilk) Well-developed theory Successful practice 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Why We Care Interesting in its own right Scheduler Ideal application for Lock-free data structures 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} *Cilk Code (Java Code in Notes) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Parallel method call 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Wait for children to complete 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Safe to use children’s values 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Note Spawn & synch operators The scheduler Like Israeli traffic signs Are purely advisory in nature The scheduler Like the Israeli driver Has complete freedom to decide 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Dynamic Behavior Multithreaded program is A thread is A directed acyclic graph (DAG) That unfolds dynamically A thread is Maximal sequence of instructions Without spawn, sync, or return 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Fib DAG sync spawn fib(4) fib(3) fib(2) fib(1) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Arrows Reflect Dependencies fib(4) sync spawn fib(3) fib(2) fib(2) fib(1) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
How Parallel is That? Define work: Define critical-path length: Total time on one processor Define critical-path length: Longest dependency path Can’t beat that! 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Fib Work fib(4) fib(3) fib(2) fib(2) fib(1) fib(1) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Fib Work 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 work is 17 16 17 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Fib Critical Path fib(4) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Critical path length is 8 Fib Critical Path fib(4) 1 8 2 7 3 4 6 Critical path length is 8 5 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Notation Watch TP = time on P processors T1 = work (time on 1 processor) T∞ = critical path length (time on ∞ processors) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Simple Bounds TP ≥ T1/P TP ≥ T∞ In one step, can’t do more than P work Can’t beat infinite resources 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
More Notation Watch Speedup on P processors Linear speedup Ratio T1/TP How much faster with P processors Linear speedup T1/TP = Θ(P) Max speedup (average parallelism) T1/T∞ 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Remarks Graph nodes have out-degree ≤ 2 Unique Starting node Ending node 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Matrix Multiplication 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Matrix Multiplication Each n-by-n matrix multiplication 8 multiplications 4 additions Of n/2-by-n/2 submatrices 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Addition int add(Matrix C, Matrix T, int n) { if (n == 1) { C[1,1] = C[1,1] + T[1,1]; } else { partition C, T into half-size submatrices; spawn add(C11,T11,n/2); spawn add(C12,T12,n/2); spawn add(C21,T21,n/2); spawn add(C22,T22,n/2) sync(); }} 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Addition Let AP(n) be running time For example For n x n matrix on P processors For example A1(n) is work A∞(n) is critical path length 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Addition Work is Partition, synch, etc 4 spawned additions A1(n) = 4 A1(n/2) + Θ(1) Partition, synch, etc 4 spawned additions 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Same as double-loop summation Addition Work is A1(n) = 4 A1(n/2) + Θ(1) = Θ(n2) Same as double-loop summation 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
spawned additions in parallel Critical Path length is A∞(n) = A∞(n/2) + Θ(1) spawned additions in parallel Partition, synch, etc 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Addition Critical Path length is A∞(n) = A∞(n/2) + Θ(1) = Θ(log n) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Multiplication int mult(Matrix C, Matrix A, Matrix B, int n) { if (n == 1) { C[1,1] = A[1,1]·B[1,1]; } else { allocate temporary n·n matrix T; partition A,B,C,T into half-size submatrices; … 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Multiplication (con’t) spawn mult(C11,A11,B11,n/2); spawn mult(C12,A11,B12,n/2); spawn mult(C21,A21,B11,n/2); spawn mult(C22,A22,B12,n/2) spawn mult(T11,A11,B21,n/2); spawn mult(T12,A12,B22,n/2); spawn mult(T21,A21,B21,n/2); spawn mult(T22,A22,B22,n/2) sync(); spawn add(C,T,n); }} 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
8 spawned mulitplications Multiplication Work is M1(n) = 8 M1(n/2) + A1(n) Final addition 8 spawned mulitplications 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Same as serial triple-nested loop Multiplication Work is M1(n) = 8 M1(n/2) + Θ(n2) = Θ(n3) Same as serial triple-nested loop 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Half-size parallel multiplications Critical path length is M∞(n) = M∞(n/2) + A∞(n) Final addition Half-size parallel multiplications 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Multiplication Critical path length is M∞(n) = M∞(n/2) + A∞(n) = M∞(n/2) + Θ(log n) = Θ(log2 n) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Parallelism M1(n)/ M∞(n) = Θ(n3/log2 n) To multiply two 1000 x 1000 matrices 10003/102=107 Much more than number of processors on any real machine 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Shared-Memory Multiprocessors Parallel applications Java Cilk, etc. Mix of other jobs All run together Come & go dynamically 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Scheduling Ideally, In real life, User-level scheduler Maps threads to dedicated processors In real life, Maps threads to fixed number of processes Kernel-level scheduler Maps processes to dynamic pool of processors 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
For Example Initially, Serial computation All P processors available for application Serial computation Takes over one processor Leaving P-1 for us Waits for I/O We get that processor back …. 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Speedup Map threads onto P processes Cannot get P-fold speedup What if the kernel doesn’t cooperate? Can try for PA-fold speedup PA is time-averaged number of processors the kernel gives us 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
8-processor Sun Ultra Enterprise 5000. Static Load Balancing 8 7 6 5 speedup ideal mm(1024) lu(2048) barnes(16K,10) heat(4K,512,100) 4 8-processor Sun Ultra Enterprise 5000. 3 2 1 1 4 8 12 16 20 24 28 32 processes 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Dynamic Load Balancing 8 7 6 ideal mm(1024) lu(2048) barnes(16K,10) heat(4K,512,100) msort(32M) ray() 5 speedup 4 8-processor Sun Ultra Enterprise 5000. 3 2 1 1 4 8 8 12 12 16 16 20 24 28 32 processes 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Scheduling Hierarchy User-level scheduler Kernel-level scheduler Tells kernel which processes are ready Kernel-level scheduler Synchronous (for analysis, not correctness!) Picks pi threads to schedule at step i Time-weighted average is: 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Greed is Good Greedy scheduler Schedules as much as it can At each time step 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Theorem Greedy scheduler ensures actual time T ≤ T1/PA + T∞(P-1)/PA 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Proof Strategy Bound this! 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Put Tokens in Buckets Thread scheduled and executed Thread scheduled but not executed work idle 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
At the end …. Total #tokens = work idle 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
At the end …. T1 tokens work idle 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Must Show ≤ T∞(P-1) tokens work idle 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Every Move You Make … Scheduler is greedy At least one node ready Number of idle threads in one step At most pi-1 ≤ P-1 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Every Step You Take … Consider longest path in unexecuted sub-DAG at step i At least one node in path ready Length of path shrinks by at least one at each step Initially, path is T∞ So there are at most T∞ idle steps 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Counting Tokens At most P-1 idle threads per step At most T∞ steps So idle bucket contains at most T∞(P-1) tokens Both buckets contain T1 + T∞(P-1) tokens 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Recapitulating 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Turns Out This bound is within a constant factor of optimal Actual optimal is NP-complete 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Work Sharing Process generates new threads Migrate them elsewhere In hopes of balancing the load 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Work Stealing If a process runs out of work It steals work from another If everyone busy, no migration Idle process incurs synchronization cost 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Lock-Free Work Stealing Each process has a pool of ready threads Remove thread without synchronizing If you run out of threads, steal someone else’s Choose victim at random 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Work DEQueue1 threads pushBottom popBottom 1. Double-Ended Queue 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Obtain Work popBottom Obtain work Run thread until Blocks or terminates popBottom 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
New Work pushBottom Unblock node Spawn node 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Whatcha Gonna do When the Well Runs Dry? @&%$!! empty 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Steal this Thread! popTop 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Never happen concurrently Thread DEQueue Methods pushBottom popBottom popTop Never happen concurrently 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Yield Processes spin trying to steal, but all DEQueues are empty Each process yields processor between steal attempts Gives victims chance to do work 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Performance Without Yield 8 7 6 ideal mm(1024) lu(2048) barnes(16K,10) heat(4K,512,100) msort(32M) ray() 5 speedup 4 3 2 1 1 4 8 12 16 20 24 28 32 processes 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Ideal Wait-Free Linearizable Constant time Fortune Cookie: “It is better to be young, rich and beautiful, than old, poor, and ugly! 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Compromise Method popTop may signal abort if Concurrent popTop succeeds Concurrent popBottom takes last thread Blame the victim! 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Dreaded ABA Problem top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Dreaded ABA Problem top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Dreaded ABA Problem top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Dreaded ABA Problem top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Dreaded ABA Problem top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Dreaded ABA Problem top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Dreaded ABA Problem Uh-Oh … CAS Yes! top 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Dreaded ABA Fix tag top bottom 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
half index & half tag to avoid ABA Code public class DEQueue { longRMWregister top; // tag & top int bottom; // bottom thread index Thread[] deq; // array of threads … } half index & half tag to avoid ABA 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Dreaded ABA Problem Fix // extract tag field from top private int TAG_MASK = 0xFFFF0000; private int TAG_SHIFT = 16; private int getTag(int i) { return ((i & TAG_MASK) >> TAG_SHIFT); } 0x00210032 index tag 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Code public class DEQueue { … void pushBottom(Thread t){ this.deq[this.bottom] = t; this.bottom++; } 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Code Thread popTop() throws Abort { long oldTop = this.top.read(); int bottom = this.bottom; if (bottom < getIndex(oldTop)) // empty return null; Thread t = this.deq[getIndex(oldTop)]; long newTop = setIndex(oldTop, getIndex(oldTop)+1); if (this.top.CAS(oldTop, newTop)) return t; throw new Abort(); }…} 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Make sure queue non-empty Code Thread popTop() throws Abort { int oldTop = this.top.read(); int bottom = this.bottom; if (bottom < getIndex(oldTop)) // empty return null; Thread t = this.deq[getIndex(oldTop)]; long newTop = setIndex(oldTop, getIndex(oldTop)+1); if (this.top.CAS(oldTop, newTop)) return t; throw new Abort(); }…} Make sure queue non-empty 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Get old and new top values Code Thread popTop() throws Abort { int oldTop = this.top.read(); int bottom = this.bottom; if (bottom < getIndex(oldTop)) // empty return null; Thread t = this.deq[getIndex(oldTop)]; long newTop = setIndex(oldTop, getIndex(oldTop)+1); if (this.top.CAS(oldTop, newTop)) return t; throw new Abort(); }…} Get old and new top values 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Code Install new top value Thread popTop() throws Abort { int oldTop = this.top; int bottom = this.bottom; if (bottom < getIndex(oldTop)) // empty return null; Thread t = this.deq[getIndex(oldTop)]; int newTop = oldTop; newTop = setIndex(oldTop, getIndex(oldTop)+1); if (this.top.CAS(oldTop, newTop)) return t; throw new Abort(); }…} Install new top value 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Code Thread popBottom() { if (this.bottom == 0) return null; Thread t = this.deq[this.bottom]; long oldTop = this.top.read(); if (this.bottom > getIndex(oldTop)) return t; long newTop = makeTop(getTag(oldTop),0); this.bottom = 0; if (this.bottom == getIndex(oldTop)) if (this.top.CAS(oldTop, newTop)) return t; this.top.write(newTop); } 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Make sure queue non-empty Code Thread popBottom() { if (this.bottom == 0) return null; this.bottom--; Thread t = this.deq[this.bottom]; long oldTop = this.top.read(); if (this.bottom > getIndex(oldTop)) return t; long newTop = makeTop(getTag(oldTop),0); this.bottom = 0; if (this.bottom == getIndex(oldTop)) if (this.top.CAS(oldTop, newTop)) return t; this.top.write(newTop); } Make sure queue non-empty 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Code Grab bottom thread Thread popBottom() { if (this.bottom == 0) return null; this.bottom--; Thread t = this.deq[this.bottom]; long oldTop = this.top.read(); if (this.bottom > getIndex(oldTop)) return t; long newTop = makeTop(getTag(oldTop),0); this.bottom = 0; if (this.bottom == getIndex(oldTop)) if (this.top.CAS(oldTop, newTop)) return t; this.top.write(newTop); } Grab bottom thread 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
If not near top, we’re done Code Thread popBottom() { if (this.bottom == 0) return null; this.bottom--; Thread t = this.deq[this.bottom]; long oldTop = this.top.read(); if (this.bottom > getIndex(oldTop)) return t; long newTop = makeTop(getTag(oldTop),0); this.bottom = 0; if (this.bottom == getIndex(oldTop)) if (this.top.CAS(oldTop, newTop)) return t; this.top.write(newTop); } If not near top, we’re done 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Code Reset top & bottom Thread popBottom() { if (this.bottom == 0) return null; this.bottom--; Thread t = this.deq[this.bottom]; long oldTop = this.top.read(); if (this.bottom > getIndex(oldTop)) return t; long newTop = makeTop(getTag(oldTop),0); this.bottom = 0; if (this.bottom == getIndex(oldTop)) if (this.top.CAS(oldTop, newTop)) return t; this.top.write(newTop); } Reset top & bottom 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Summary so Far Multithreaded structures Scheduling Work Critical path length Parallelism Scheduling Work stealing Lock-free DEQueue 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Lock-Free Work Stealing OK even if the number of processes exceeds the number of processors or when the number of processors grows and shrinks over time. No need for “non-commercial” operating-system support, such as gang scheduling or process control. 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Old English Proverb “May as well be hanged for stealing a sheep as a goat” From which we conclude Stealing was punished severely Sheep were worth more than goats 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
But Wait, There’s More! Stealing is expensive What if CAS Only one thread taken What if We could steal more each time? Say, up to half? 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Review Double-ended queue (DEQueue) Local thread Remove/add thread without CAS If top and bottom > 1 apart 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Consensus If top and bottom are close Local thread and thief contend Need consensus to resolve In a sequence of k pushes or pops Number of CAS operations is Θ(1) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Consensus Stealing half increases uncertainty Consensus on half the queue? In a sequence of k pushes or pops Number of CAS operations is Θ(k) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
New Idea We can get down to Θ(log k) How: limit uncertainty to when queue size passes a power of 2! Keep a “half-point” counter Thief resets counter Local thread changes counter at power-of-2 boundary 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
The Big Picture Steal-range Up to 2i can be stolen atomically Previous-Steal-Range tag top last Up to 2i can be stolen atomically tag top At least 2i outside steal range last Bottom somewhere in group of 2i+1 bottom 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Steal Range stealRange tag: defeats ABA problem top: index of top-most item in DEQueue stealLast: last item to be stolen tag top last stealRange 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
When to Steal? Steal on empty Steal probabilistically if (shouldBalance()) { Process victim = randomProcess(); tryToSteal(victim); } Steal on empty Steal probabilistically Probability decreases as queue increases Steal when queue size passes threshold 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Before PushBottom tag top 3 last bottom 14 30-Nov-18 tag top 3 last bottom 14 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
After PushBottom tag top last 7 bottom 15 30-Nov-18 tag top last 7 bottom 15 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Update stealRange boolean updateStealRange() { if (size is a power of two || theft occurred) { // Try to update the stealRange int newSize = Math.max(1, power of 2 closest to half); long oldRange=this.stealRange; int tag = getTag(oldRange.stealRange); int top = getTop(oldRange.stealRange); long newRange= makeStealRange(tag+1, top, top+newSize-1)); boolean ok=this.stealRange.CAS(oldRange,newRange); if (ok) this.prevStealRange = newRange; return ok; } return true;} 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Readjust when queue size is power of two Update stealRange boolean updateStealRange() { if (size is a power of two || theft occurred) { // Try to update the stealRange int newSize = Math.max(1, power of 2 closest to half); long oldRange=this.stealRange; int tag = getTag(oldRange.stealRange); int top = getTop(oldRange.stealRange); long newRange= makeStealRange(tag+1, top, top+newSize-1)); boolean ok = this.stealRange.CAS(oldRange,newRange); if (ok) prevStealRange = newRange; return ok; } return true; Readjust when queue size is power of two 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Readjust when thief has taken some threads Update stealRange boolean updateStealRange() { if (size is a power of two || theft occurred) { // Try to update the stealRange int newSize = Math.max(1, power of 2 closest to half); long oldRange=this.stealRange; int tag = getTag(oldRange.stealRange); int top = getTop(oldRange.stealRange); long newRange= makeStealRange(tag+1, top, top+newSize-1)); boolean ok = this.stealRange.CAS(oldRange,newRange); if (ok) prevStealRange = newRange; return ok; } return true; Readjust when thief has taken some threads 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
New range size is roughly half Update stealRange boolean updateStealRange() { if (size is a power of two || theft occurred) { // Try to update the stealRange int newSize = Math.max(1, power of 2 closest to half); long oldRange=this.stealRange; int tag = getTag(oldRange.stealRange); int top = getTop(oldRange.stealRange); long newRange= makeStealRange(tag+1, top, top+newSize)); boolean ok = this.stealRange.CAS(oldRange,newRange); if (ok) prevStealRange = newRange; return ok; } return true; New range size is roughly half 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Try to update stealRange to reflect the new size boolean updateStealRange() { if (size is a power of two || theft occurred) { // Try to update the stealRange int newSize = Math.max(1, power of 2 closest to half); long oldRange=this.stealRange; int tag = getTag(oldRange.stealRange); int top = getTop(oldRange.stealRange); long newRange= makeStealRange(tag+1, top, top+newSize-1)); boolean ok = this.stealRange.CAS(oldRange,newRange); if (ok) this.prevStealRange = newRange; return ok; } return true; Try to update stealRange to reflect the new size 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Update stealRange If update succeeded, save a copy boolean updateStealRange() { if (size is a power of two || theft occurred) { // Try to update the stealRange int newSize = Math.max(1, power of 2 closest to half); long oldRange=this.stealRange; int tag = getTag(oldRange.stealRange); int top = getTop(oldRange.stealRange); long newRange= makeStealRange(tag+1, top, top+newSize-1)); boolean ok=this.stealRange.CAS(oldRange,newRange); if (ok) this.prevStealRange = newRange; return ok; } return true; If update succeeded, save a copy of updated range, to identify future thefts 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
pushBottom Code public void pushBottom(Thread t, throws Full { if (this.getSize() == QUEUE_SIZE) throw new Full(); this.deq[this.bottom] = t; this.bottom=(++this.bottom) % QUEUE_SIZE; updateStealRange(); } 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
pushBottom Code public void pushBottom(Thread t, throws Full { if (this.getSize() == QUEUE_SIZE) throw new Full(); this.deq[this.bottom] = t; this.bottom=(++this.bottom) % QUEUE_SIZE; updateStealRange(); } Thread to push 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
pushBottom Code public void pushBottom(Thread t, throws Full { if (this.getSize() == QUEUE_SIZE) throw new Full(); this.deq[this.bottom] = t; this.bottom=(++this.bottom) % QUEUE_SIZE; updateStealRange(); } Are we full? 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
pushBottom Code public void pushBottom(Thread t, throws Full { if (this.getSize() == QUEUE_SIZE) throw new Full(); this.deq[this.bottom] = t; this.bottom=(++this.bottom) % QUEUE_SIZE; updateStealRange(); } Push thread 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Update StealRange, if required pushBottom Code public void pushBottom(Thread t, throws Full { if (this.getSize() == QUEUE_SIZE) throw new Full(); this.deq[this.bottom] = t; this.bottom=(++this.bottom) % QUEUE_SIZE; updateStealRange(); } Update StealRange, if required 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Before PopBottom tag top last 7 bottom 15 30-Nov-18 tag top last 7 bottom 15 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
After PopBottom tag top 3 last bottom 14 30-Nov-18 tag top 3 last bottom 14 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
popBottom (Part One) public Object popBottom() throws Abort { if (this.getSize() == 0) return null; if (!updateStealRange()) throw new Abort(); if (this.bottom == 0) this.bottom = QUEUE_SIZE-1; else --this.bottom; Object t = this.deq[this.bottom]; … 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
popBottom (Part One) Bail if queue is empty public Object popBottom() throws Abort { if (this.getSize() == 0) return null; if (!updateStealRange()) throw new Abort(); if (this.bottom == 0) this.bottom = QUEUE_SIZE-1; else --this.bottom; Object t = this.deq[this.bottom]; … Bail if queue is empty 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
popBottom (Part One) Panic if unable to fix stealRange public Object popBottom() throws Abort { if (this.getSize() == 0) return null; if (!updateStealRange()) throw new Abort(); if (this.bottom == 0) this.bottom = QUEUE_SIZE-1; else --this.bottom; Object t = this.deq[this.bottom]; … Panic if unable to fix stealRange 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
popBottom (Part One) Tentatively pop a thread public Object popBottom() throws Abort { if (this.getSize() == 0) return null; if (!updateStealRange()) throw new Abort(); if (this.bottom == 0) this.bottom = QUEUE_SIZE-1; else --this.bottom; Object t = this.deq[this.bottom]; … Tentatively pop a thread 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
popBottom (Part Two) public Object popBottom() throws Abort { … long oldStealRange = this.stealRange; int rangeTop = getTop(oldStealRange); int rangeBot = getLast(oldStealRange); if (rangeBot == EMPTY) { this.bottom = 0; // last thread already stolen return null; } else if (this.bottom != rangeBot) return t; // no need to synchronize else { … 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Deconstruct stealRange popBottom (Part Two) public Object popBottom() throws Abort { … long oldStealRange = this.stealRange; int rangeTop = getTop(oldStealRange); int rangeBot = getLast(oldStealRange); if (rangeBot == EMPTY) { this.bottom = 0; // last thread already stolen return null; } else if (this.bottom != rangeBot) return t; // no need to synchronize else { … Deconstruct stealRange 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
If queue is empty, start over popBottom (Part Two) public Object popBottom() throws Abort { … long oldStealRange = this.stealRange; int rangeTop = getTop(oldStealRange); int rangeBot = getLast(oldStealRange); if (rangeBot == EMPTY) { this.bottom = 0; // last thread already stolen return null; } else if (this.bottom != rangeBot) return t; // no need to synchronize else { … If queue is empty, start over 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
popBottom (Part Two) If tentatively-popped thread not in public Object popBottom() throws Abort { … long oldStealRange = this.stealRange; int rangeTop = getTop(oldStealRange); int rangeBot = getLast(oldStealRange); if (rangeBot == null) { this.bottom = 0; // last thread already stolen return null; } else if (this.bottom != rangeBot) return t; // no need to synchronize else { … If tentatively-popped thread not in stealRange - no need to synchronize 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
popBottom (Part Three) public Object popBottom() throws Abort { … } else { // Try to make stealRange empty int rangeTag = getTag(oldStealRange); if (this.stealRange.CAS(oldStealRange, makeStealRange(tag+1,0,EMPTY))) { this.bottom=0; return t; // thread not stolen yet } else { this.bottom=0 return null; // thread stolen }}} 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Queue has at most one thread popBottom (Part Three) public Object popBottom() throws Abort { … } else { // Try to make stealRange empty int rangeTag = getTag(oldStealRange); if (this.stealRange.CAS(oldStealRange, makeStealRange(tag+1,0,EMPTY))) { this.bottom=0; return t; // thread not stolen yet } else { this.bottom=0 return null; // thread stolen }}} Queue has at most one thread 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Try to zero out steal range popBottom (Part Three) public Object popBottom() throws Abort { … } else { // Try to make stealRange empty int rangeTag = getTag(oldStealRange); if (this.stealRange.CAS(oldStealRange, makeStealRange(tag+1,0,EMPTY))) { this.bottom=0; return t; // thread not stolen yet } else { this.bottom=0 return null; // thread stolen }}} Try to zero out steal range 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
(and the deque is now empty) popBottom (Part Three) public Object popBottom() throws Abort { … } else { // Try to make stealRange empty int rangeTag = getTag(oldStealRange); if (this.stealRange.CAS(oldStealRange, makeStealRange(tag+1,0,EMPTY))) { this.bottom=0; return t; // thread not stolen yet } else { this.bottom=0 return null; // thread stolen }}} If we succeeded – the thread is ours! (and the deque is now empty) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
popBottom (Part Three) public Object popBottom() throws Abort { … } else { // Try to make stealRange empty int rangeTag = getTag(oldStealRange); if (this.stealRange.CAS(oldStealRange, makeStealRange(tag+1,0,EMPTY))) { this.bottom=0; return t; // thread not stolen yet } else { this.bottom=0 return null; // thread stolen }}} If we failed – our last thread was stolen 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
stealTop (Part One) public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
stealTop (Part One) Victim DEQueue public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… Victim DEQueue 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
stealTop (Part One) The number of threads Actually stolen public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… The number of threads Actually stolen 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Deconstruct victim’s steal range stealTop (Part One) public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… Deconstruct victim’s steal range 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
stealTop (Part One) Compute length of victim’s stealRange, (victim’s public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… Compute length of victim’s stealRange, (victim’s DEQueue length is at least twice as much) 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
stealTop (Part One) Diff is a minimal bound on the difference in public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… Diff is a minimal bound on the difference in lengths between victim and thief 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
If we can’t equalize by stealing – don’t steal!! stealTop (Part One) public int stealTop(EDEQueue victim, int thiefLen) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – thiefLen; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… If we can’t equalize by stealing – don’t steal!! 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
stealTop (Part One) public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldLast = getLast(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int deqBot = victim.bot; int rangeLen = oldStealRange.getSize(); int diff = 2*rangeLen – this.deq.length; if (diff <= 1) return 0; else { int numToSteal = diff/2 for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; }… Try to steal half the guaranteed difference: Copy threads-to-be-stolen to thief’s deque 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
stealTop (Part Two) public int stealTop(EDEQueue victim) { … int newRangeLen= max(1,power of 2 closest to half the remaining threads); newTop = (oldTop+numToSteal) % DEQUE_SIZE; newLast = (newTop + newRangeLen – 1) % DEQUE_SIZE; long newRange = makeStealRange(oldTag+1, newTop, newLast); if (victim.stealRange.CAS(oldStealRnage, newRange)) { this.bottom = (this.bottom + numToSteal) % DEQUE_SIZE; this.updateStealRange(); return numToSteal; } return 0; 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
stealTop (Part Two) public int stealTop(EDEQueue victim) { … int newRangeLen= max(1,power of 2 closest to half the remaining threads); newTop = (oldTop+numToSteal) % DEQUE_SIZE; newLast = (newTop + newRangeLen – 1) % DEQUE_SIZE; long newRange = makeStealRange(oldTag+1, newTop, newLast); if (victim.stealRange.CAS(oldStealRnage, newRange)) { this.bottom = (this.bottom + numToSteal) % DEQUE_SIZE; this.updateStealRange(); return numToSteal; } return 0; The new length of the victim’s stealRange is about half the remaining number of threads 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
stealTop (Part Two) public int stealTop(EDEQueue victim) { … int newRangeLen= max(1,power of 2 closest to half the remaining threads); newTop = (oldTop+numToSteal) % DEQUE_SIZE; newLast = (newTop + newRangeLen – 1) % DEQUE_SIZE; long newRange = makeStealRange(oldTag+1, newTop, newLast); if (victim.stealRange.CAS(oldStealRnage, newRange)) { this.bottom = (this.bottom + numToSteal) % DEQUE_SIZE; this.updateStealRange(); return numToSteal; } return 0; Try to update victim’s stealRange to reflect the theft and the new range length 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
stealTop (Part Two) If succeeded, update thief’s bottom and stealRange public int stealTop(EDEQueue victim) { … int newRangeLen= max(1,power of 2 closest to half the remaining threads); newTop = (oldTop+numToSteal) % DEQUE_SIZE; newLast = (newTop + newRangeLen – 1) % DEQUE_SIZE; long newRange = makeStealRange(oldTag+1, newTop, newLast); if (victim.stealRange.CAS(oldStealRnage, newRange)) { this.bottom = (this.bottom + numToSteal) % DEQUE_SIZE; this.updateStealRange(); return numToSteal; } return 0; If succeeded, update thief’s bottom and stealRange to include new threads, and return # stolen threads 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Details Works even if someone steals from thief Thief may fail to update own stealRange But will still update bottom, making theft happen 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Big Picture This code steals as much as it can More sensible to Split the difference? May depend on stealing strategy 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Vulnerability If queue size hovers around power of 2, performance will be lousy Extra credit Can we avoid this problem? 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Conclusions “Boutique” lock-free structures Not general purpose Customized for work-stealing Non-trivial correctness issues 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Alternative: Gang Scheduling processor 1 processor 2 processor 3 processor 4 time Bad Example: 4-process computation with 1-process computation on 4-processor machine. Good Example: Data-parallel programs with large working sets. 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Alternative: Process Control processor 1 processor 2 processor 3 processor 4 time process killed new process created Each computation creates and kills processes dynamically to equal number of processors assigned to it. 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
Clip Art 30-Nov-18 M. Herlihy & N. Shavit (c) 2003
T O M M A R V O L O R I D D L E 30-Nov-18 M. Herlihy & N. Shavit (c) 2003