Download presentation
Presentation is loading. Please wait.
1
Spin Locks and Contention Management
Multiprocessor Synchronization Nir Shavit Spring 2003
2
Focus so far: Correctness
Models Accurate (we never lied to you) But idealized (so we forgot to mention a few things) Protocols Elegant Important But naïve 24-Nov-18 © Herlihy & Shavit
3
New Focus: Performance
Models More complicated (not the same as complex!) Still focus on principals (not soon obsolete) Protocols Elegant (in their fashion) Important (why else would we pay attention) And realistic (your mileage may vary) 24-Nov-18 © Herlihy & Shavit
4
Kinds of Architectures
SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data. Our space 24-Nov-18 (1) © Herlihy & Shavit (1)
5
MIMD Architectures memory Memory Contention Communication Contention
Shared Bus Distributed Memory Contention Communication Contention Communication Latency 24-Nov-18 © Herlihy & Shavit (1)
6
Lets Look Again at Mutual Exclusion
This time using synchronization operations stronger than reads or writes And on the way learn more about hardware and about: Memory Contention Communication Contention Communication Latency 24-Nov-18 © Herlihy & Shavit
7
Real World: What should a thread do if it can’t get the lock?
Keep trying “spin”, “busy-wait” Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor our focus 24-Nov-18 © Herlihy & Shavit (1)
8
. Basic Spin-Lock …lock suffers from contention
0/1 CS P2 spin lock critical section Resets lock upon exit Pn Lets try and understand this phenomena 24-Nov-18 © Herlihy & Shavit
9
Review: Test-and-Set remember old value return old value
public class RMW extends Register { int value; public synchronized int TAS() { int result = value; value = 1; return result; } remember old value return old value new value is 1 24-Nov-18 © Herlihy & Shavit (5)
10
Test-and-Set Atomically Atomic swap Use write method to reset
Returns previous value Sets current value to 1 Atomic swap Use write method to reset 24-Nov-18 © Herlihy & Shavit
11
Test-and-Set Locks Locking Acquire lock by calling TAS
Lock is free: value is 0 Lock is taken: value is 1 Acquire lock by calling TAS If result is 0, you win If result is 1, you lose Release lock by writing 0 24-Nov-18 © Herlihy & Shavit
12
TASLock Keep trying until lock acquired Simple write to release
public class TASLOCK implements Lock { TASRegister lock = TASRegister(0); public void acquire(int i) { while (lock.TAS() == 1) {}; } } public void release(int i) } lock.write(0); }} Keep trying until lock acquired Simple write to release 24-Nov-18 © Herlihy & Shavit (1)
13
Performance Experiment How long should it take? How long does it take?
n threads Increment shared counter 1 million times How long should it take? How long does it take? 24-Nov-18 © Herlihy & Shavit
14
Graph time threads Initial speedup: loop overhead in parallel ideal
Work independent of number of threads 24-Nov-18 © Herlihy & Shavit (2)
15
Huston, we have a problem …
Mystery #1 TAS lock time Ideal Huston, we have a problem … threads 24-Nov-18 © Herlihy & Shavit (1)
16
Test-and-Test-and-Set Locks
Lurking stage Wait until lock “looks” free Spin while read returns 1 (lock taken) Pouncing state As soon as lock “looks” available Read returns 0 (lock free) Call TAS to acquire lock If TAS loses, back to lurking 24-Nov-18 © Herlihy & Shavit
17
TTASLock Then try to acquire lock Wait until lock looks free
public class TTASLock implements Lock { TASRegister lock = TASRegister(0); public void acquire(int i) { while (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; } Then try to acquire lock Wait until lock looks free 24-Nov-18 © Herlihy & Shavit (3)
18
Mystery #2 time threads Ideal TAS lock TTAS lock 24-Nov-18
© Herlihy & Shavit
19
Mystery Both Except that TAS and TTAS Do the same thing (in our model)
TTAS performs much better than TAS Neither approaches ideal 24-Nov-18 © Herlihy & Shavit
20
Opinion Our memory abstraction is broken TAS & TTAS methods
Are provably the same (in our model) Except they aren’t (in field tests) Need a more detailed model … 24-Nov-18 © Herlihy & Shavit
21
Bus-Based Architectures
cache cache cache Bus memory 24-Nov-18 © Herlihy & Shavit
22
Bus-Based Architectures
Random access memory (10s of cycles) cache cache cache Bus memory 24-Nov-18 © Herlihy & Shavit
23
Bus-Based Architectures
Shared Bus broadcast medium One broadcaster at a time Processors and memory all “snoop” cache cache cache Bus memory 24-Nov-18 © Herlihy & Shavit
24
Bus-Based Architectures
Per-Processor Caches Small Fast: 1 or 2 cycles Address & state information Bus-Based Architectures cache cache cache Bus memory 24-Nov-18 © Herlihy & Shavit
25
Jargon Watch Cache hit Cache miss “I found what I wanted in my cache”
Good Thing™ Cache miss “I had to shlep all the way to memory for that data” Bad Thing™ 24-Nov-18 © Herlihy & Shavit
26
Cave Canem This model is still a simplification
But not in any essential way Illustrates basic principles Will discuss complexities later 24-Nov-18 © Herlihy & Shavit
27
Processor Issues Load Request
Gimme data cache cache cache Bus Bus memory data 24-Nov-18 © Herlihy & Shavit (1)
28
Memory Responds memory cache cache cache I got data data data Bus Bus
24-Nov-18 © Herlihy & Shavit (3)
29
Processor Issues Load Request
Gimme data data cache cache Bus Bus memory data 24-Nov-18 © Herlihy & Shavit (2)
30
Other Processor Responds
I got data data data cache cache Bus Bus memory data 24-Nov-18 © Herlihy & Shavit (2)
31
Modify Cached Data memory data data data data cache Bus 24-Nov-18
© Herlihy & Shavit (1)
32
What’s up with the other copies?
Modify Cached Data data data cache Bus What’s up with the other copies? memory data 24-Nov-18 © Herlihy & Shavit (1)
33
Cache Coherence We have lots of copies of data
Original copy in memory Cached copies at processors Some processor modifies its own copy What do we do with the others? How to avoid confusion? Generic version is a fundamental problem™ in Computer Science 24-Nov-18 © Herlihy & Shavit
34
Fundamental Problem™ Managing replicated data
This is a fundamental problem™ in Computer Science Multiprocessor architecture Distributed file systems Distributed databases … 24-Nov-18 © Herlihy & Shavit
35
Write-Through Cache memory Listen to me! data data data data cache
Bus Bus memory data 24-Nov-18 © Herlihy & Shavit (5)
36
Write-Through Caches “show stoppers” Immediately broadcast changes
Good Memory, caches always agree More read hits, maybe Bad Bus traffic on all writes Most writes to unshared data For example, loop indexes … “show stoppers” 24-Nov-18 © Herlihy & Shavit (1)
37
Write-Back Caches Accumulate changes in cache Write back when needed
Need the cache for something else Another processor wants it On first modification Invalidate other entries Requires non-trivial protocol … 24-Nov-18 © Herlihy & Shavit
38
Write-Back Caches Cache entry has three states
Invalid: contains raw seething bits Valid: I can read but I can’t write Dirty: Data has been modified Intercept other load requests Write back to memory before using cache 24-Nov-18 © Herlihy & Shavit
39
Invalidate memory Mine, all mine! Uh,oh cache data data cache data Bus
24-Nov-18 © Herlihy & Shavit (4)
40
Invalidate memory Other caches lose read permission
data cache Bus This cache acquires write permission memory data 24-Nov-18 © Herlihy & Shavit (2)
41
Invalidate Memory provides data only if not present in any cache, so no need to change it now (expensive) cache data cache Bus memory data 24-Nov-18 © Herlihy & Shavit (2)
42
Another Processor Asks for Data
cache data cache Bus Bus memory data 24-Nov-18 © Herlihy & Shavit (2)
43
Owner Responds memory Here it is! cache data data cache data Bus Bus
24-Nov-18 © Herlihy & Shavit (2)
44
End of the Day … memory Reading OK, no writing data data data cache
Bus memory data Reading OK, no writing 24-Nov-18 © Herlihy & Shavit (1)
45
Bus-Based Architectures
data data cache Bus memory data 24-Nov-18 © Herlihy & Shavit
46
Mutual Exclusion What do we want to optimize?
Bus bandwidth used by spinning threads Release/Acquire latency Acquire latency for idle lock 24-Nov-18 © Herlihy & Shavit
47
Simple TAS TAS invalidates cache lines Spinners
Miss in cache Go to bus Thread wants to release lock delayed behind spinners 24-Nov-18 © Herlihy & Shavit
48
Test-and-test-and-set
Wait until lock “looks” free Spin on local cache No bus use while lock busy Problem: when lock is released Invalidation storm … 24-Nov-18 © Herlihy & Shavit
49
Local Spinning while Lock is Busy
memory busy 24-Nov-18 © Herlihy & Shavit
50
On Release memory free invalid invalid free Bus 24-Nov-18
© Herlihy & Shavit
51
Everyone misses, rereads
On Release Everyone misses, rereads miss miss invalid invalid free Bus memory free 24-Nov-18 © Herlihy & Shavit (1)
52
On Release memory Everyone tries TAS TAS(…) TAS(…) free invalid
Bus memory free 24-Nov-18 © Herlihy & Shavit (1)
53
Problems Everyone misses Everyone does TAS
Reads satisfied sequentially Everyone does TAS Invalidates others’ caches Eventually reaches quiescence after lock acquired How long does this take? 24-Nov-18 © Herlihy & Shavit
54
Measuring Quiescence Time
X = time of ops that don’t use the bus Y = time of ops that cause intensive bus traffic 0/1 CS spin lock critical section P1 P2 Pn In critical section, run ops X then ops Y. As long as Quiescence time is less than X, no drop in performance. By gradually varying X, can determine the exact time to quiesce. 24-Nov-18 © Herlihy & Shavit
55
Quiescence Time time threads Increses linearly with the number of
processors for bus architecture time threads 24-Nov-18 © Herlihy & Shavit
56
Mystery Explained time threads Ideal TAS lock TTAS lock 24-Nov-18
© Herlihy & Shavit
57
Solution: Introduce Delay
The place where the delay is inserted: After lock release After every lock reference The way the size is set: 1. Static 2. Dynamic 1 1 time spin lock r2d r1d d 24-Nov-18 © Herlihy & Shavit
58
Static Example: Slotted Delays
1 1 time spin lock 3d 2d d Split time into slots. Each process delays amount that will place him in a predetermined slot. 24-Nov-18 © Herlihy & Shavit
59
Dynamic Example: Exponential Backoff
1 1 time spin lock 4d 2d d If I fail to get lock wait random duration before retry Each subsequent failure doubles expected wait 24-Nov-18 © Herlihy & Shavit
60
Delay Strategies The place where the delay is inserted:
After lock release After every lock reference The way the size is set: 1. Static 2. Dynamic Usually most Effective…Exp Backoff 24-Nov-18 © Herlihy & Shavit
61
Exponential Backoff Lock
public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (lock.read() == 1) { if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} 24-Nov-18 © Herlihy & Shavit
62
Exponential Backoff Lock
public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay 24-Nov-18 © Herlihy & Shavit
63
Exponential Backoff Lock
public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free 24-Nov-18 © Herlihy & Shavit
64
Exponential Backoff Lock
public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (TRUE) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return 24-Nov-18 © Herlihy & Shavit
65
Exponential Backoff Lock
public class backoff implements lock public void acquire() { int delay = MIN_DELAY; While (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration 24-Nov-18 © Herlihy & Shavit
66
Exponential Backoff Lock
public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason 24-Nov-18 © Herlihy & Shavit
67
Spin-Waiting Overhead
TTS exp-backoff 24-Nov-18 © Herlihy & Shavit
68
Can We Improve On This? Optimize “slot” size before trying to enter the CS and avoid useless invalidations How? By keeping a queue of threads Each thread Notifies next in line Without bothering the others 24-Nov-18 © Herlihy & Shavit
69
Anderson’s Queue Lock class ALock implements lock { int flags[n] =
{ENTER, WAIT, …, WAIT}; int myslot[n] = {0,..,0}; RMW tail = new RMW(0) ; 24-Nov-18 © Herlihy & Shavit
70
Anderson’s Queue Lock Thread i gets slot Reset slot Wait while busy
public void acquire() { myslot[i] = tail.fetchInc(); while (flags[myslot[i] % n] == WAIT){}; flags[myslot[i] % n] = WAIT; } public void release() { flags[myslot[i] + 1 % n] = ENTER; Reset slot for next round Wait while busy Release next thread 24-Nov-18 © Herlihy & Shavit
71
Performance tts Curve is practically flat Scalable performance queue
24-Nov-18 © Herlihy & Shavit
72
Observations Need to allocate size n-array no matter how many threads actually access the lock Cache line granularity? Read the whole array into cache? Can we do better? 24-Nov-18 © Herlihy & Shavit
73
CLH Lock FIFO order Small, Constant-size overhead per thread 24-Nov-18
© Herlihy & Shavit
74
CLH Queue Lock 1 Critical Section
release Swap into queue tail, wait for a “0” in Predecessor’s node 24-Nov-18 © Herlihy & Shavit
75
CLH Queue Lock I have not released yet class Qnode {
boolean locked = true; } 24-Nov-18 © Herlihy & Shavit
76
Get pred and point tail to me
CLH Queue Lock class CLHLock implements Lock { RMW queue; public void acquire(Qnode mynode){ /** mynode.locked = true; */ Qnode pred = queue.swap(qnode); while (pred.locked) {} }} Get pred and point tail to me Wait until unlocked 24-Nov-18 © Herlihy & Shavit (3)
77
CLH Queue Lock Notify successor class CLHLock implements Lock {
RMW queue; public void release(Qnode mynode) { mynode.locked = false; }} Notify successor 24-Nov-18 © Herlihy & Shavit (3)
78
Initially acquire idle queue false 24-Nov-18 © Herlihy & Shavit
79
Purple Acquires Lock acquire idle queue false true 24-Nov-18
© Herlihy & Shavit
80
Red Wants Lock acquire want Enter CS queue true false true 24-Nov-18
© Herlihy & Shavit
81
NUMA Machines Distributed Shared Memory Machine
Non-Uniform Memory Access (NUMA) Shared local memory is fast Shared remote memory is slow 24-Nov-18 © Herlihy & Shavit
82
MCS Lock On NUMA machine without caches CLH is problematic
because it spins on remote location MCS Queue Lock: FIFO order Small, Constant-size overhead Local spinning! 24-Nov-18 © Herlihy & Shavit
83
MCS Queue Lock 1 Critical Section
1 1 Swap into tail of list, wait for a “0” in local node 24-Nov-18 © Herlihy & Shavit
84
MCS Queue Lock class Qnode { boolean locked = false;
qnode next = null; } 24-Nov-18 © Herlihy & Shavit
85
MCS Queue Lock Point to my qnode Point pred to my node
class MCSLock implements Lock { RMW queue; public void acquire(Qnode mynode) { Qnode pred = queue.swap(mynode); if (pred != null) { mynode.locked = true; pred.next = mynode; while (mynode.locked) {} }}} Point to my qnode Point pred to my node Wait until unlocked 24-Nov-18 © Herlihy & Shavit (3)
86
Purple Acquires Lock locked idle false 24-Nov-18
© Herlihy & Shavit
87
Red Wants Lock locked allocate qnode true false 24-Nov-18
© Herlihy & Shavit
88
Red Wants Lock locked spinning true false 24-Nov-18
© Herlihy & Shavit
89
MCS Queue Lock No successor? Wait for successor Notify successor
class MCSLock implements Lock { RMW queue; public void release(Qnode mynode) { if (mynode.next == null) { if (queue.CAS(mynode, null)) return; while (mynode.next == null) {} } mynode.next.locked = false; }} No successor? Wait for successor Notify successor 24-Nov-18 © Herlihy & Shavit (3)
90
Purple Release releasing swap
By looking at the queue, I see another thread is active releasing swap false false I have to wait for that thread to finish 24-Nov-18 © Herlihy & Shavit (2)
91
Purple Release releasing spinning Enter CS false true false 24-Nov-18
© Herlihy & Shavit
92
Performance Test&Test&set A-Lock Exp-backoff MCS
NUMA No Coherence (c. 1982) 24-Nov-18 © Herlihy & Shavit
93
How Different are Modern Machines
TAS with backoff MCS 16 32 48 Sun Wildfire (c. 1998) experiments curtsey of Scott
94
Contention Eliminated
We reduced contention by slotting thread access to a lock over time We saw that Queue Locks provide very tight slotting and limit invalidation traffic thus lowering contention with minimal latency 24-Nov-18 © Herlihy & Shavit
95
Java Synchronization synchronize (exp) {…actions …} wait
– lock an object wait – release lock and suspend thread notify, notifyall – wake one or all to resume execution where it was suspended 24-Nov-18 © Herlihy & Shavit
96
Locks in Java Frequent Ubiquitous Benchmark: 765,000/second
Every object has a (potential) lock Space overhead? Potentially huge Actual small (6% in Javac) 24-Nov-18 © Herlihy & Shavit
97
Paradox? Frequency Ubiquity Requires time efficiency
Requires space efficiency 24-Nov-18 © Herlihy & Shavit
98
Solution Create lock only when needed Fast path for common case
The Meta Lock: 2 bits in header Local spinning only 24-Nov-18 © Herlihy & Shavit
99
Java Synchronization Java compiled to byte code
Must respect block structure Must deal with exceptions Nested locks OK Locks need to count 24-Nov-18 © Herlihy & Shavit
100
Jargon Watch Monitor lock Meta lock Modus Operandi Protects object
Protects monitor lock Modus Operandi Acquire meta lock Manipulate monitor lock Release meta lock 24-Nov-18 © Herlihy & Shavit
101
Java Objects Class pointer Object header Multi-use word
User-defined fields 24-Nov-18 © Herlihy & Shavit
102
Meta-Lock meta lock other stuff 2 bits 30 bits Multi-use word
24-Nov-18 © Herlihy & Shavit
103
Meta-Lock - Neutral Locked Waiters - Busy 2 bits = 4 states 24-Nov-18
© Herlihy & Shavit
104
Usual state: nothing happening
Neutral State hash code age Usual state: nothing happening 24-Nov-18 © Herlihy & Shavit
105
pointer to lock records
Locked State lock record pointer to lock records 1 Object is monitor-locked 24-Nov-18 © Herlihy & Shavit
106
Lock Record Owner thread Lock count Hash and age (displaced)
Next lock record in queue Free list for unused records lock record pointer to lock record 1 Object is monitor-locked 24-Nov-18 © Herlihy & Shavit
107
Waiters State pointer to lock records 1
Monitor lock released, but other threads waiting to get in 24-Nov-18 © Herlihy & Shavit
108
Busy State environment pointer 1 1 Metalock is locked environment
24-Nov-18 © Herlihy & Shavit
109
Acquire Meta-Lock Swap it in Prepare new value
BitField getMetaLock(ExecEnv *ee, Object *obj) { BitField busyBits = ee | BUSY; BitField lockBits = SWAP(busyBits, multiUseWordAddr(obj)); if (getLockState(lockBits) != BUSY) return lockBits; else return getMetaLockSlow(ee, lockBits); Swap it in Prepare new value Return if not already locked, Otherwise take slow path 24-Nov-18 © Herlihy & Shavit
110
Slow Path Acquire First thread knows it’s first
Didn’t see BUSY bits Later threads know predecessor From result of SWAP 24-Nov-18 © Herlihy & Shavit
111
Release Meta-Lock Try to replace it (CAS in C returns old value)
BitField releaseMetaLock(ExecEnv *ee, Object *obj, BitField releaseBits) { BitField busyBits = ee | BUSY; BitField lockBits =CAS(releaseBits, busyBits, multiUseWordAddr(obj)); if (lockBits != busyBits) releaseMetaLockSlow(ee, lockBits); Try to replace it (CAS in C returns old value) Value we expect Take slow path if unsuccessful 24-Nov-18 © Herlihy & Shavit
112
Release Slow Path Hand-off the metalock to next waiting thread
Synchronize via sucessor’s environment structure … 24-Nov-18 © Herlihy & Shavit
113
Locking Objects Common cases No thread interaction needed Neutral
Waiters Recursively locked No thread interaction needed 24-Nov-18 © Herlihy & Shavit
114
Locking Objects Mutex object (lock) Suspends on Condition variable
Release processor Until condition is signalled Not a spin lock When it awakes, takes slow path Locked: go back to sleep Unlocked: update object and go for it 24-Nov-18 © Herlihy & Shavit
115
Unlocking Objects Common cases No thread interactions needed
Recursive lock No other threads No thread interactions needed 24-Nov-18 © Herlihy & Shavit
116
Unlocking Object Obtain metalock Remove own lock record
Wake up successor Release metalock Shorter queue Waiters state 24-Nov-18 © Herlihy & Shavit
117
Wait Acquire metalock Sets isWaitingForNotify field in execution environment Release metalock Wait for bit to be set Not a busy wait Can time out 24-Nov-18 © Herlihy & Shavit
118
Notify Acquire metalock Walk through queue Release metalock
Notify: wake first waiting thread NotifyAll: wake all waiting threads Release metalock 24-Nov-18 © Herlihy & Shavit
119
Locking…not so easy after all
Principles Create lock only when needed Fast path vs slow path Optimize the common case Locking…not so easy after all 24-Nov-18 © Herlihy & Shavit
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.