Spin Locks and Contention Management Multiprocessor Synchronization Nir Shavit Spring 2003
Focus so far: Correctness Models Accurate (we never lied to you) But idealized (so we forgot to mention a few things) Protocols Elegant Important But naïve 24-Nov-18 © 2003 Herlihy & Shavit
New Focus: Performance Models More complicated (not the same as complex!) Still focus on principals (not soon obsolete) Protocols Elegant (in their fashion) Important (why else would we pay attention) And realistic (your mileage may vary) 24-Nov-18 © 2003 Herlihy & Shavit
Kinds of Architectures SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data. Our space 24-Nov-18 (1) © 2003 Herlihy & Shavit (1)
MIMD Architectures memory Memory Contention Communication Contention Shared Bus Distributed Memory Contention Communication Contention Communication Latency 24-Nov-18 © 2003 Herlihy & Shavit (1)
Lets Look Again at Mutual Exclusion This time using synchronization operations stronger than reads or writes And on the way learn more about hardware and about: Memory Contention Communication Contention Communication Latency 24-Nov-18 © 2003 Herlihy & Shavit
Real World: What should a thread do if it can’t get the lock? Keep trying “spin”, “busy-wait” Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor our focus 24-Nov-18 © 2003 Herlihy & Shavit (1)
. Basic Spin-Lock …lock suffers from contention 0/1 CS P2 spin lock critical section Resets lock upon exit Pn Lets try and understand this phenomena 24-Nov-18 © 2003 Herlihy & Shavit
Review: Test-and-Set remember old value return old value public class RMW extends Register { int value; public synchronized int TAS() { int result = value; value = 1; return result; } remember old value return old value new value is 1 24-Nov-18 © 2003 Herlihy & Shavit (5)
Test-and-Set Atomically Atomic swap Use write method to reset Returns previous value Sets current value to 1 Atomic swap Use write method to reset 24-Nov-18 © 2003 Herlihy & Shavit
Test-and-Set Locks Locking Acquire lock by calling TAS Lock is free: value is 0 Lock is taken: value is 1 Acquire lock by calling TAS If result is 0, you win If result is 1, you lose Release lock by writing 0 24-Nov-18 © 2003 Herlihy & Shavit
TASLock Keep trying until lock acquired Simple write to release public class TASLOCK implements Lock { TASRegister lock = TASRegister(0); public void acquire(int i) { while (lock.TAS() == 1) {}; } } public void release(int i) } lock.write(0); }} Keep trying until lock acquired Simple write to release 24-Nov-18 © 2003 Herlihy & Shavit (1)
Performance Experiment How long should it take? How long does it take? n threads Increment shared counter 1 million times How long should it take? How long does it take? 24-Nov-18 © 2003 Herlihy & Shavit
Graph time threads Initial speedup: loop overhead in parallel ideal Work independent of number of threads 24-Nov-18 © 2003 Herlihy & Shavit (2)
Huston, we have a problem … Mystery #1 TAS lock time Ideal Huston, we have a problem … threads 24-Nov-18 © 2003 Herlihy & Shavit (1)
Test-and-Test-and-Set Locks Lurking stage Wait until lock “looks” free Spin while read returns 1 (lock taken) Pouncing state As soon as lock “looks” available Read returns 0 (lock free) Call TAS to acquire lock If TAS loses, back to lurking 24-Nov-18 © 2003 Herlihy & Shavit
TTASLock Then try to acquire lock Wait until lock looks free public class TTASLock implements Lock { TASRegister lock = TASRegister(0); public void acquire(int i) { while (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; } Then try to acquire lock Wait until lock looks free 24-Nov-18 © 2003 Herlihy & Shavit (3)
Mystery #2 time threads Ideal TAS lock TTAS lock 24-Nov-18 © 2003 Herlihy & Shavit
Mystery Both Except that TAS and TTAS Do the same thing (in our model) TTAS performs much better than TAS Neither approaches ideal 24-Nov-18 © 2003 Herlihy & Shavit
Opinion Our memory abstraction is broken TAS & TTAS methods Are provably the same (in our model) Except they aren’t (in field tests) Need a more detailed model … 24-Nov-18 © 2003 Herlihy & Shavit
Bus-Based Architectures cache cache cache Bus memory 24-Nov-18 © 2003 Herlihy & Shavit
Bus-Based Architectures Random access memory (10s of cycles) cache cache cache Bus memory 24-Nov-18 © 2003 Herlihy & Shavit
Bus-Based Architectures Shared Bus broadcast medium One broadcaster at a time Processors and memory all “snoop” cache cache cache Bus memory 24-Nov-18 © 2003 Herlihy & Shavit
Bus-Based Architectures Per-Processor Caches Small Fast: 1 or 2 cycles Address & state information Bus-Based Architectures cache cache cache Bus memory 24-Nov-18 © 2003 Herlihy & Shavit
Jargon Watch Cache hit Cache miss “I found what I wanted in my cache” Good Thing™ Cache miss “I had to shlep all the way to memory for that data” Bad Thing™ 24-Nov-18 © 2003 Herlihy & Shavit
Cave Canem This model is still a simplification But not in any essential way Illustrates basic principles Will discuss complexities later 24-Nov-18 © 2003 Herlihy & Shavit
Processor Issues Load Request Gimme data cache cache cache Bus Bus memory data 24-Nov-18 © 2003 Herlihy & Shavit (1)
Memory Responds memory cache cache cache I got data data data Bus Bus 24-Nov-18 © 2003 Herlihy & Shavit (3)
Processor Issues Load Request Gimme data data cache cache Bus Bus memory data 24-Nov-18 © 2003 Herlihy & Shavit (2)
Other Processor Responds I got data data data cache cache Bus Bus memory data 24-Nov-18 © 2003 Herlihy & Shavit (2)
Modify Cached Data memory data data data data cache Bus 24-Nov-18 © 2003 Herlihy & Shavit (1)
What’s up with the other copies? Modify Cached Data data data cache Bus What’s up with the other copies? memory data 24-Nov-18 © 2003 Herlihy & Shavit (1)
Cache Coherence We have lots of copies of data Original copy in memory Cached copies at processors Some processor modifies its own copy What do we do with the others? How to avoid confusion? Generic version is a fundamental problem™ in Computer Science 24-Nov-18 © 2003 Herlihy & Shavit
Fundamental Problem™ Managing replicated data This is a fundamental problem™ in Computer Science Multiprocessor architecture Distributed file systems Distributed databases … 24-Nov-18 © 2003 Herlihy & Shavit
Write-Through Cache memory Listen to me! data data data data cache Bus Bus memory data 24-Nov-18 © 2003 Herlihy & Shavit (5)
Write-Through Caches “show stoppers” Immediately broadcast changes Good Memory, caches always agree More read hits, maybe Bad Bus traffic on all writes Most writes to unshared data For example, loop indexes … “show stoppers” 24-Nov-18 © 2003 Herlihy & Shavit (1)
Write-Back Caches Accumulate changes in cache Write back when needed Need the cache for something else Another processor wants it On first modification Invalidate other entries Requires non-trivial protocol … 24-Nov-18 © 2003 Herlihy & Shavit
Write-Back Caches Cache entry has three states Invalid: contains raw seething bits Valid: I can read but I can’t write Dirty: Data has been modified Intercept other load requests Write back to memory before using cache 24-Nov-18 © 2003 Herlihy & Shavit
Invalidate memory Mine, all mine! Uh,oh cache data data cache data Bus 24-Nov-18 © 2003 Herlihy & Shavit (4)
Invalidate memory Other caches lose read permission data cache Bus This cache acquires write permission memory data 24-Nov-18 © 2003 Herlihy & Shavit (2)
Invalidate Memory provides data only if not present in any cache, so no need to change it now (expensive) cache data cache Bus memory data 24-Nov-18 © 2003 Herlihy & Shavit (2)
Another Processor Asks for Data cache data cache Bus Bus memory data 24-Nov-18 © 2003 Herlihy & Shavit (2)
Owner Responds memory Here it is! cache data data cache data Bus Bus 24-Nov-18 © 2003 Herlihy & Shavit (2)
End of the Day … memory Reading OK, no writing data data data cache Bus memory data Reading OK, no writing 24-Nov-18 © 2003 Herlihy & Shavit (1)
Bus-Based Architectures data data cache Bus memory data 24-Nov-18 © 2003 Herlihy & Shavit
Mutual Exclusion What do we want to optimize? Bus bandwidth used by spinning threads Release/Acquire latency Acquire latency for idle lock 24-Nov-18 © 2003 Herlihy & Shavit
Simple TAS TAS invalidates cache lines Spinners Miss in cache Go to bus Thread wants to release lock delayed behind spinners 24-Nov-18 © 2003 Herlihy & Shavit
Test-and-test-and-set Wait until lock “looks” free Spin on local cache No bus use while lock busy Problem: when lock is released Invalidation storm … 24-Nov-18 © 2003 Herlihy & Shavit
Local Spinning while Lock is Busy memory busy 24-Nov-18 © 2003 Herlihy & Shavit
On Release memory free invalid invalid free Bus 24-Nov-18 © 2003 Herlihy & Shavit
Everyone misses, rereads On Release Everyone misses, rereads miss miss invalid invalid free Bus memory free 24-Nov-18 © 2003 Herlihy & Shavit (1)
On Release memory Everyone tries TAS TAS(…) TAS(…) free invalid Bus memory free 24-Nov-18 © 2003 Herlihy & Shavit (1)
Problems Everyone misses Everyone does TAS Reads satisfied sequentially Everyone does TAS Invalidates others’ caches Eventually reaches quiescence after lock acquired How long does this take? 24-Nov-18 © 2003 Herlihy & Shavit
Measuring Quiescence Time X = time of ops that don’t use the bus Y = time of ops that cause intensive bus traffic 0/1 CS spin lock critical section P1 P2 Pn In critical section, run ops X then ops Y. As long as Quiescence time is less than X, no drop in performance. By gradually varying X, can determine the exact time to quiesce. 24-Nov-18 © 2003 Herlihy & Shavit
Quiescence Time time threads Increses linearly with the number of processors for bus architecture time threads 24-Nov-18 © 2003 Herlihy & Shavit
Mystery Explained time threads Ideal TAS lock TTAS lock 24-Nov-18 © 2003 Herlihy & Shavit
Solution: Introduce Delay The place where the delay is inserted: After lock release After every lock reference The way the size is set: 1. Static 2. Dynamic 1 1 time spin lock r2d r1d d 24-Nov-18 © 2003 Herlihy & Shavit
Static Example: Slotted Delays 1 1 time spin lock 3d 2d d Split time into slots. Each process delays amount that will place him in a predetermined slot. 24-Nov-18 © 2003 Herlihy & Shavit
Dynamic Example: Exponential Backoff 1 1 time spin lock 4d 2d d If I fail to get lock wait random duration before retry Each subsequent failure doubles expected wait 24-Nov-18 © 2003 Herlihy & Shavit
Delay Strategies The place where the delay is inserted: After lock release After every lock reference The way the size is set: 1. Static 2. Dynamic Usually most Effective…Exp Backoff 24-Nov-18 © 2003 Herlihy & Shavit
Exponential Backoff Lock public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (lock.read() == 1) { if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} 24-Nov-18 © 2003 Herlihy & Shavit
Exponential Backoff Lock public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay 24-Nov-18 © 2003 Herlihy & Shavit
Exponential Backoff Lock public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free 24-Nov-18 © 2003 Herlihy & Shavit
Exponential Backoff Lock public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (TRUE) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return 24-Nov-18 © 2003 Herlihy & Shavit
Exponential Backoff Lock public class backoff implements lock public void acquire() { int delay = MIN_DELAY; While (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration 24-Nov-18 © 2003 Herlihy & Shavit
Exponential Backoff Lock public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason 24-Nov-18 © 2003 Herlihy & Shavit
Spin-Waiting Overhead TTS exp-backoff 24-Nov-18 © 2003 Herlihy & Shavit
Can We Improve On This? Optimize “slot” size before trying to enter the CS and avoid useless invalidations How? By keeping a queue of threads Each thread Notifies next in line Without bothering the others 24-Nov-18 © 2003 Herlihy & Shavit
Anderson’s Queue Lock class ALock implements lock { int flags[n] = {ENTER, WAIT, …, WAIT}; int myslot[n] = {0,..,0}; RMW tail = new RMW(0) ; 24-Nov-18 © 2003 Herlihy & Shavit
Anderson’s Queue Lock Thread i gets slot Reset slot Wait while busy public void acquire() { myslot[i] = tail.fetchInc(); while (flags[myslot[i] % n] == WAIT){}; flags[myslot[i] % n] = WAIT; } public void release() { flags[myslot[i] + 1 % n] = ENTER; Reset slot for next round Wait while busy Release next thread 24-Nov-18 © 2003 Herlihy & Shavit
Performance tts Curve is practically flat Scalable performance queue 24-Nov-18 © 2003 Herlihy & Shavit
Observations Need to allocate size n-array no matter how many threads actually access the lock Cache line granularity? Read the whole array into cache? Can we do better? 24-Nov-18 © 2003 Herlihy & Shavit
CLH Lock FIFO order Small, Constant-size overhead per thread 24-Nov-18 © 2003 Herlihy & Shavit
CLH Queue Lock 1 Critical Section release Swap into queue tail, wait for a “0” in Predecessor’s node 24-Nov-18 © 2003 Herlihy & Shavit
CLH Queue Lock I have not released yet class Qnode { boolean locked = true; } 24-Nov-18 © 2003 Herlihy & Shavit
Get pred and point tail to me CLH Queue Lock class CLHLock implements Lock { RMW queue; public void acquire(Qnode mynode){ /** mynode.locked = true; */ Qnode pred = queue.swap(qnode); while (pred.locked) {} }} Get pred and point tail to me Wait until unlocked 24-Nov-18 © 2003 Herlihy & Shavit (3)
CLH Queue Lock Notify successor class CLHLock implements Lock { RMW queue; public void release(Qnode mynode) { mynode.locked = false; }} Notify successor 24-Nov-18 © 2003 Herlihy & Shavit (3)
Initially acquire idle queue false 24-Nov-18 © 2003 Herlihy & Shavit
Purple Acquires Lock acquire idle queue false true 24-Nov-18 © 2003 Herlihy & Shavit
Red Wants Lock acquire want Enter CS queue true false true 24-Nov-18 © 2003 Herlihy & Shavit
NUMA Machines Distributed Shared Memory Machine Non-Uniform Memory Access (NUMA) Shared local memory is fast Shared remote memory is slow 24-Nov-18 © 2003 Herlihy & Shavit
MCS Lock On NUMA machine without caches CLH is problematic because it spins on remote location MCS Queue Lock: FIFO order Small, Constant-size overhead Local spinning! 24-Nov-18 © 2003 Herlihy & Shavit
MCS Queue Lock 1 Critical Section 1 1 Swap into tail of list, wait for a “0” in local node 24-Nov-18 © 2003 Herlihy & Shavit
MCS Queue Lock class Qnode { boolean locked = false; qnode next = null; } 24-Nov-18 © 2003 Herlihy & Shavit
MCS Queue Lock Point to my qnode Point pred to my node class MCSLock implements Lock { RMW queue; public void acquire(Qnode mynode) { Qnode pred = queue.swap(mynode); if (pred != null) { mynode.locked = true; pred.next = mynode; while (mynode.locked) {} }}} Point to my qnode Point pred to my node Wait until unlocked 24-Nov-18 © 2003 Herlihy & Shavit (3)
Purple Acquires Lock locked idle false 24-Nov-18 © 2003 Herlihy & Shavit
Red Wants Lock locked allocate qnode true false 24-Nov-18 © 2003 Herlihy & Shavit
Red Wants Lock locked spinning true false 24-Nov-18 © 2003 Herlihy & Shavit
MCS Queue Lock No successor? Wait for successor Notify successor class MCSLock implements Lock { RMW queue; public void release(Qnode mynode) { if (mynode.next == null) { if (queue.CAS(mynode, null)) return; while (mynode.next == null) {} } mynode.next.locked = false; }} No successor? Wait for successor Notify successor 24-Nov-18 © 2003 Herlihy & Shavit (3)
Purple Release releasing swap By looking at the queue, I see another thread is active releasing swap false false I have to wait for that thread to finish 24-Nov-18 © 2003 Herlihy & Shavit (2)
Purple Release releasing spinning Enter CS false true false 24-Nov-18 © 2003 Herlihy & Shavit
Performance Test&Test&set A-Lock Exp-backoff MCS NUMA No Coherence (c. 1982) 24-Nov-18 © 2003 Herlihy & Shavit
How Different are Modern Machines TAS with backoff MCS 16 32 48 Sun Wildfire (c. 1998) experiments curtsey of Scott
Contention Eliminated We reduced contention by slotting thread access to a lock over time We saw that Queue Locks provide very tight slotting and limit invalidation traffic thus lowering contention with minimal latency 24-Nov-18 © 2003 Herlihy & Shavit
Java Synchronization synchronize (exp) {…actions …} wait – lock an object wait – release lock and suspend thread notify, notifyall – wake one or all to resume execution where it was suspended 24-Nov-18 © 2003 Herlihy & Shavit
Locks in Java Frequent Ubiquitous Benchmark: 765,000/second Every object has a (potential) lock Space overhead? Potentially huge Actual small (6% in Javac) 24-Nov-18 © 2003 Herlihy & Shavit
Paradox? Frequency Ubiquity Requires time efficiency Requires space efficiency 24-Nov-18 © 2003 Herlihy & Shavit
Solution Create lock only when needed Fast path for common case The Meta Lock: 2 bits in header Local spinning only 24-Nov-18 © 2003 Herlihy & Shavit
Java Synchronization Java compiled to byte code Must respect block structure Must deal with exceptions Nested locks OK Locks need to count 24-Nov-18 © 2003 Herlihy & Shavit
Jargon Watch Monitor lock Meta lock Modus Operandi Protects object Protects monitor lock Modus Operandi Acquire meta lock Manipulate monitor lock Release meta lock 24-Nov-18 © 2003 Herlihy & Shavit
Java Objects Class pointer Object header Multi-use word User-defined fields 24-Nov-18 © 2003 Herlihy & Shavit
Meta-Lock meta lock other stuff 2 bits 30 bits Multi-use word 24-Nov-18 © 2003 Herlihy & Shavit
Meta-Lock - Neutral Locked Waiters - Busy 2 bits = 4 states 24-Nov-18 © 2003 Herlihy & Shavit
Usual state: nothing happening Neutral State hash code age Usual state: nothing happening 24-Nov-18 © 2003 Herlihy & Shavit
pointer to lock records Locked State lock record pointer to lock records 1 Object is monitor-locked 24-Nov-18 © 2003 Herlihy & Shavit
Lock Record Owner thread Lock count Hash and age (displaced) Next lock record in queue Free list for unused records lock record pointer to lock record 1 Object is monitor-locked 24-Nov-18 © 2003 Herlihy & Shavit
Waiters State pointer to lock records 1 Monitor lock released, but other threads waiting to get in 24-Nov-18 © 2003 Herlihy & Shavit
Busy State environment pointer 1 1 Metalock is locked environment 24-Nov-18 © 2003 Herlihy & Shavit
Acquire Meta-Lock Swap it in Prepare new value BitField getMetaLock(ExecEnv *ee, Object *obj) { BitField busyBits = ee | BUSY; BitField lockBits = SWAP(busyBits, multiUseWordAddr(obj)); if (getLockState(lockBits) != BUSY) return lockBits; else return getMetaLockSlow(ee, lockBits); Swap it in Prepare new value Return if not already locked, Otherwise take slow path 24-Nov-18 © 2003 Herlihy & Shavit
Slow Path Acquire First thread knows it’s first Didn’t see BUSY bits Later threads know predecessor From result of SWAP 24-Nov-18 © 2003 Herlihy & Shavit
Release Meta-Lock Try to replace it (CAS in C returns old value) BitField releaseMetaLock(ExecEnv *ee, Object *obj, BitField releaseBits) { BitField busyBits = ee | BUSY; BitField lockBits =CAS(releaseBits, busyBits, multiUseWordAddr(obj)); if (lockBits != busyBits) releaseMetaLockSlow(ee, lockBits); Try to replace it (CAS in C returns old value) Value we expect Take slow path if unsuccessful 24-Nov-18 © 2003 Herlihy & Shavit
Release Slow Path Hand-off the metalock to next waiting thread Synchronize via sucessor’s environment structure … 24-Nov-18 © 2003 Herlihy & Shavit
Locking Objects Common cases No thread interaction needed Neutral Waiters Recursively locked No thread interaction needed 24-Nov-18 © 2003 Herlihy & Shavit
Locking Objects Mutex object (lock) Suspends on Condition variable Release processor Until condition is signalled Not a spin lock When it awakes, takes slow path Locked: go back to sleep Unlocked: update object and go for it 24-Nov-18 © 2003 Herlihy & Shavit
Unlocking Objects Common cases No thread interactions needed Recursive lock No other threads No thread interactions needed 24-Nov-18 © 2003 Herlihy & Shavit
Unlocking Object Obtain metalock Remove own lock record Wake up successor Release metalock Shorter queue Waiters state 24-Nov-18 © 2003 Herlihy & Shavit
Wait Acquire metalock Sets isWaitingForNotify field in execution environment Release metalock Wait for bit to be set Not a busy wait Can time out 24-Nov-18 © 2003 Herlihy & Shavit
Notify Acquire metalock Walk through queue Release metalock Notify: wake first waiting thread NotifyAll: wake all waiting threads Release metalock 24-Nov-18 © 2003 Herlihy & Shavit
Locking…not so easy after all Principles Create lock only when needed Fast path vs slow path Optimize the common case Locking…not so easy after all 24-Nov-18 © 2003 Herlihy & Shavit