Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spin Locks and Contention Management

Similar presentations


Presentation on theme: "Spin Locks and Contention Management"— Presentation transcript:

1 Spin Locks and Contention Management
Multiprocessor Synchronization Nir Shavit Spring 2003

2 Focus so far: Correctness
Models Accurate (we never lied to you) But idealized (so we forgot to mention a few things) Protocols Elegant Important But naïve 24-Nov-18 © Herlihy & Shavit

3 New Focus: Performance
Models More complicated (not the same as complex!) Still focus on principals (not soon obsolete) Protocols Elegant (in their fashion) Important (why else would we pay attention) And realistic (your mileage may vary) 24-Nov-18 © Herlihy & Shavit

4 Kinds of Architectures
SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data. Our space 24-Nov-18 (1) © Herlihy & Shavit (1)

5 MIMD Architectures memory Memory Contention Communication Contention
Shared Bus Distributed Memory Contention Communication Contention Communication Latency 24-Nov-18 © Herlihy & Shavit (1)

6 Lets Look Again at Mutual Exclusion
This time using synchronization operations stronger than reads or writes And on the way learn more about hardware and about: Memory Contention Communication Contention Communication Latency 24-Nov-18 © Herlihy & Shavit

7 Real World: What should a thread do if it can’t get the lock?
Keep trying “spin”, “busy-wait” Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor our focus 24-Nov-18 © Herlihy & Shavit (1)

8 . Basic Spin-Lock …lock suffers from contention
0/1 CS P2 spin lock critical section Resets lock upon exit Pn Lets try and understand this phenomena 24-Nov-18 © Herlihy & Shavit

9 Review: Test-and-Set remember old value return old value
public class RMW extends Register { int value; public synchronized int TAS() { int result = value; value = 1; return result; } remember old value return old value new value is 1 24-Nov-18 © Herlihy & Shavit (5)

10 Test-and-Set Atomically Atomic swap Use write method to reset
Returns previous value Sets current value to 1 Atomic swap Use write method to reset 24-Nov-18 © Herlihy & Shavit

11 Test-and-Set Locks Locking Acquire lock by calling TAS
Lock is free: value is 0 Lock is taken: value is 1 Acquire lock by calling TAS If result is 0, you win If result is 1, you lose Release lock by writing 0 24-Nov-18 © Herlihy & Shavit

12 TASLock Keep trying until lock acquired Simple write to release
public class TASLOCK implements Lock { TASRegister lock = TASRegister(0); public void acquire(int i) { while (lock.TAS() == 1) {}; } } public void release(int i) } lock.write(0); }} Keep trying until lock acquired Simple write to release 24-Nov-18 © Herlihy & Shavit (1)

13 Performance Experiment How long should it take? How long does it take?
n threads Increment shared counter 1 million times How long should it take? How long does it take? 24-Nov-18 © Herlihy & Shavit

14 Graph time threads Initial speedup: loop overhead in parallel ideal
Work independent of number of threads 24-Nov-18 © Herlihy & Shavit (2)

15 Huston, we have a problem …
Mystery #1 TAS lock time Ideal Huston, we have a problem … threads 24-Nov-18 © Herlihy & Shavit (1)

16 Test-and-Test-and-Set Locks
Lurking stage Wait until lock “looks” free Spin while read returns 1 (lock taken) Pouncing state As soon as lock “looks” available Read returns 0 (lock free) Call TAS to acquire lock If TAS loses, back to lurking 24-Nov-18 © Herlihy & Shavit

17 TTASLock Then try to acquire lock Wait until lock looks free
public class TTASLock implements Lock { TASRegister lock = TASRegister(0); public void acquire(int i) { while (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; } Then try to acquire lock Wait until lock looks free 24-Nov-18 © Herlihy & Shavit (3)

18 Mystery #2 time threads Ideal TAS lock TTAS lock 24-Nov-18
© Herlihy & Shavit

19 Mystery Both Except that TAS and TTAS Do the same thing (in our model)
TTAS performs much better than TAS Neither approaches ideal 24-Nov-18 © Herlihy & Shavit

20 Opinion Our memory abstraction is broken TAS & TTAS methods
Are provably the same (in our model) Except they aren’t (in field tests) Need a more detailed model … 24-Nov-18 © Herlihy & Shavit

21 Bus-Based Architectures
cache cache cache Bus memory 24-Nov-18 © Herlihy & Shavit

22 Bus-Based Architectures
Random access memory (10s of cycles) cache cache cache Bus memory 24-Nov-18 © Herlihy & Shavit

23 Bus-Based Architectures
Shared Bus broadcast medium One broadcaster at a time Processors and memory all “snoop” cache cache cache Bus memory 24-Nov-18 © Herlihy & Shavit

24 Bus-Based Architectures
Per-Processor Caches Small Fast: 1 or 2 cycles Address & state information Bus-Based Architectures cache cache cache Bus memory 24-Nov-18 © Herlihy & Shavit

25 Jargon Watch Cache hit Cache miss “I found what I wanted in my cache”
Good Thing™ Cache miss “I had to shlep all the way to memory for that data” Bad Thing™ 24-Nov-18 © Herlihy & Shavit

26 Cave Canem This model is still a simplification
But not in any essential way Illustrates basic principles Will discuss complexities later 24-Nov-18 © Herlihy & Shavit

27 Processor Issues Load Request
Gimme data cache cache cache Bus Bus memory data 24-Nov-18 © Herlihy & Shavit (1)

28 Memory Responds memory cache cache cache I got data data data Bus Bus
24-Nov-18 © Herlihy & Shavit (3)

29 Processor Issues Load Request
Gimme data data cache cache Bus Bus memory data 24-Nov-18 © Herlihy & Shavit (2)

30 Other Processor Responds
I got data data data cache cache Bus Bus memory data 24-Nov-18 © Herlihy & Shavit (2)

31 Modify Cached Data memory data data data data cache Bus 24-Nov-18
© Herlihy & Shavit (1)

32 What’s up with the other copies?
Modify Cached Data data data cache Bus What’s up with the other copies? memory data 24-Nov-18 © Herlihy & Shavit (1)

33 Cache Coherence We have lots of copies of data
Original copy in memory Cached copies at processors Some processor modifies its own copy What do we do with the others? How to avoid confusion? Generic version is a fundamental problem™ in Computer Science 24-Nov-18 © Herlihy & Shavit

34 Fundamental Problem™ Managing replicated data
This is a fundamental problem™ in Computer Science Multiprocessor architecture Distributed file systems Distributed databases … 24-Nov-18 © Herlihy & Shavit

35 Write-Through Cache memory Listen to me! data data data data cache
Bus Bus memory data 24-Nov-18 © Herlihy & Shavit (5)

36 Write-Through Caches “show stoppers” Immediately broadcast changes
Good Memory, caches always agree More read hits, maybe Bad Bus traffic on all writes Most writes to unshared data For example, loop indexes … “show stoppers” 24-Nov-18 © Herlihy & Shavit (1)

37 Write-Back Caches Accumulate changes in cache Write back when needed
Need the cache for something else Another processor wants it On first modification Invalidate other entries Requires non-trivial protocol … 24-Nov-18 © Herlihy & Shavit

38 Write-Back Caches Cache entry has three states
Invalid: contains raw seething bits Valid: I can read but I can’t write Dirty: Data has been modified Intercept other load requests Write back to memory before using cache 24-Nov-18 © Herlihy & Shavit

39 Invalidate memory Mine, all mine! Uh,oh cache data data cache data Bus
24-Nov-18 © Herlihy & Shavit (4)

40 Invalidate memory Other caches lose read permission
data cache Bus This cache acquires write permission memory data 24-Nov-18 © Herlihy & Shavit (2)

41 Invalidate Memory provides data only if not present in any cache, so no need to change it now (expensive) cache data cache Bus memory data 24-Nov-18 © Herlihy & Shavit (2)

42 Another Processor Asks for Data
cache data cache Bus Bus memory data 24-Nov-18 © Herlihy & Shavit (2)

43 Owner Responds memory Here it is! cache data data cache data Bus Bus
24-Nov-18 © Herlihy & Shavit (2)

44 End of the Day … memory Reading OK, no writing data data data cache
Bus memory data Reading OK, no writing 24-Nov-18 © Herlihy & Shavit (1)

45 Bus-Based Architectures
data data cache Bus memory data 24-Nov-18 © Herlihy & Shavit

46 Mutual Exclusion What do we want to optimize?
Bus bandwidth used by spinning threads Release/Acquire latency Acquire latency for idle lock 24-Nov-18 © Herlihy & Shavit

47 Simple TAS TAS invalidates cache lines Spinners
Miss in cache Go to bus Thread wants to release lock delayed behind spinners 24-Nov-18 © Herlihy & Shavit

48 Test-and-test-and-set
Wait until lock “looks” free Spin on local cache No bus use while lock busy Problem: when lock is released Invalidation storm … 24-Nov-18 © Herlihy & Shavit

49 Local Spinning while Lock is Busy
memory busy 24-Nov-18 © Herlihy & Shavit

50 On Release memory free invalid invalid free Bus 24-Nov-18
© Herlihy & Shavit

51 Everyone misses, rereads
On Release Everyone misses, rereads miss miss invalid invalid free Bus memory free 24-Nov-18 © Herlihy & Shavit (1)

52 On Release memory Everyone tries TAS TAS(…) TAS(…) free invalid
Bus memory free 24-Nov-18 © Herlihy & Shavit (1)

53 Problems Everyone misses Everyone does TAS
Reads satisfied sequentially Everyone does TAS Invalidates others’ caches Eventually reaches quiescence after lock acquired How long does this take? 24-Nov-18 © Herlihy & Shavit

54 Measuring Quiescence Time
X = time of ops that don’t use the bus Y = time of ops that cause intensive bus traffic 0/1 CS spin lock critical section P1 P2 Pn In critical section, run ops X then ops Y. As long as Quiescence time is less than X, no drop in performance. By gradually varying X, can determine the exact time to quiesce. 24-Nov-18 © Herlihy & Shavit

55 Quiescence Time time threads Increses linearly with the number of
processors for bus architecture time threads 24-Nov-18 © Herlihy & Shavit

56 Mystery Explained time threads Ideal TAS lock TTAS lock 24-Nov-18
© Herlihy & Shavit

57 Solution: Introduce Delay
The place where the delay is inserted: After lock release After every lock reference The way the size is set: 1. Static 2. Dynamic 1 1 time spin lock r2d r1d d 24-Nov-18 © Herlihy & Shavit

58 Static Example: Slotted Delays
1 1 time spin lock 3d 2d d Split time into slots. Each process delays amount that will place him in a predetermined slot. 24-Nov-18 © Herlihy & Shavit

59 Dynamic Example: Exponential Backoff
1 1 time spin lock 4d 2d d If I fail to get lock wait random duration before retry Each subsequent failure doubles expected wait 24-Nov-18 © Herlihy & Shavit

60 Delay Strategies The place where the delay is inserted:
After lock release After every lock reference The way the size is set: 1. Static 2. Dynamic Usually most Effective…Exp Backoff 24-Nov-18 © Herlihy & Shavit

61 Exponential Backoff Lock
public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (lock.read() == 1) { if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} 24-Nov-18 © Herlihy & Shavit

62 Exponential Backoff Lock
public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay 24-Nov-18 © Herlihy & Shavit

63 Exponential Backoff Lock
public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free 24-Nov-18 © Herlihy & Shavit

64 Exponential Backoff Lock
public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (TRUE) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return 24-Nov-18 © Herlihy & Shavit

65 Exponential Backoff Lock
public class backoff implements lock public void acquire() { int delay = MIN_DELAY; While (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration 24-Nov-18 © Herlihy & Shavit

66 Exponential Backoff Lock
public class backoff implements lock public void acquire() { int delay = MIN_DELAY; while (true) { while (lock.read() == 1) {}; if (lock.TAS() == 0) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason 24-Nov-18 © Herlihy & Shavit

67 Spin-Waiting Overhead
TTS exp-backoff 24-Nov-18 © Herlihy & Shavit

68 Can We Improve On This? Optimize “slot” size before trying to enter the CS and avoid useless invalidations How? By keeping a queue of threads Each thread Notifies next in line Without bothering the others 24-Nov-18 © Herlihy & Shavit

69 Anderson’s Queue Lock class ALock implements lock { int flags[n] =
{ENTER, WAIT, …, WAIT}; int myslot[n] = {0,..,0}; RMW tail = new RMW(0) ; 24-Nov-18 © Herlihy & Shavit

70 Anderson’s Queue Lock Thread i gets slot Reset slot Wait while busy
public void acquire() { myslot[i] = tail.fetchInc(); while (flags[myslot[i] % n] == WAIT){}; flags[myslot[i] % n] = WAIT; } public void release() { flags[myslot[i] + 1 % n] = ENTER; Reset slot for next round Wait while busy Release next thread 24-Nov-18 © Herlihy & Shavit

71 Performance tts Curve is practically flat Scalable performance queue
24-Nov-18 © Herlihy & Shavit

72 Observations Need to allocate size n-array no matter how many threads actually access the lock Cache line granularity? Read the whole array into cache? Can we do better? 24-Nov-18 © Herlihy & Shavit

73 CLH Lock FIFO order Small, Constant-size overhead per thread 24-Nov-18
© Herlihy & Shavit

74 CLH Queue Lock 1 Critical Section
release Swap into queue tail, wait for a “0” in Predecessor’s node 24-Nov-18 © Herlihy & Shavit

75 CLH Queue Lock I have not released yet class Qnode {
boolean locked = true; } 24-Nov-18 © Herlihy & Shavit

76 Get pred and point tail to me
CLH Queue Lock class CLHLock implements Lock { RMW queue; public void acquire(Qnode mynode){ /** mynode.locked = true; */ Qnode pred = queue.swap(qnode); while (pred.locked) {} }} Get pred and point tail to me Wait until unlocked 24-Nov-18 © Herlihy & Shavit (3)

77 CLH Queue Lock Notify successor class CLHLock implements Lock {
RMW queue; public void release(Qnode mynode) { mynode.locked = false; }} Notify successor 24-Nov-18 © Herlihy & Shavit (3)

78 Initially acquire idle queue false 24-Nov-18 © Herlihy & Shavit

79 Purple Acquires Lock acquire idle queue false true 24-Nov-18
© Herlihy & Shavit

80 Red Wants Lock acquire want Enter CS queue true false true 24-Nov-18
© Herlihy & Shavit

81 NUMA Machines Distributed Shared Memory Machine
Non-Uniform Memory Access (NUMA) Shared local memory is fast Shared remote memory is slow 24-Nov-18 © Herlihy & Shavit

82 MCS Lock On NUMA machine without caches CLH is problematic
because it spins on remote location MCS Queue Lock: FIFO order Small, Constant-size overhead Local spinning! 24-Nov-18 © Herlihy & Shavit

83 MCS Queue Lock 1 Critical Section
1 1 Swap into tail of list, wait for a “0” in local node 24-Nov-18 © Herlihy & Shavit

84 MCS Queue Lock class Qnode { boolean locked = false;
qnode next = null; } 24-Nov-18 © Herlihy & Shavit

85 MCS Queue Lock Point to my qnode Point pred to my node
class MCSLock implements Lock { RMW queue; public void acquire(Qnode mynode) { Qnode pred = queue.swap(mynode); if (pred != null) { mynode.locked = true; pred.next = mynode; while (mynode.locked) {} }}} Point to my qnode Point pred to my node Wait until unlocked 24-Nov-18 © Herlihy & Shavit (3)

86 Purple Acquires Lock locked idle false 24-Nov-18
© Herlihy & Shavit

87 Red Wants Lock locked allocate qnode true false 24-Nov-18
© Herlihy & Shavit

88 Red Wants Lock locked spinning true false 24-Nov-18
© Herlihy & Shavit

89 MCS Queue Lock No successor? Wait for successor Notify successor
class MCSLock implements Lock { RMW queue; public void release(Qnode mynode) { if (mynode.next == null) { if (queue.CAS(mynode, null)) return; while (mynode.next == null) {} } mynode.next.locked = false; }} No successor? Wait for successor Notify successor 24-Nov-18 © Herlihy & Shavit (3)

90 Purple Release releasing swap
By looking at the queue, I see another thread is active releasing swap false false I have to wait for that thread to finish 24-Nov-18 © Herlihy & Shavit (2)

91 Purple Release releasing spinning Enter CS false true false 24-Nov-18
© Herlihy & Shavit

92 Performance Test&Test&set A-Lock Exp-backoff MCS
NUMA No Coherence (c. 1982) 24-Nov-18 © Herlihy & Shavit

93 How Different are Modern Machines
TAS with backoff MCS 16 32 48 Sun Wildfire (c. 1998) experiments curtsey of Scott

94 Contention Eliminated
We reduced contention by slotting thread access to a lock over time We saw that Queue Locks provide very tight slotting and limit invalidation traffic thus lowering contention with minimal latency 24-Nov-18 © Herlihy & Shavit

95 Java Synchronization synchronize (exp) {…actions …} wait
– lock an object wait – release lock and suspend thread notify, notifyall – wake one or all to resume execution where it was suspended 24-Nov-18 © Herlihy & Shavit

96 Locks in Java Frequent Ubiquitous Benchmark: 765,000/second
Every object has a (potential) lock Space overhead? Potentially huge Actual small (6% in Javac) 24-Nov-18 © Herlihy & Shavit

97 Paradox? Frequency Ubiquity Requires time efficiency
Requires space efficiency 24-Nov-18 © Herlihy & Shavit

98 Solution Create lock only when needed Fast path for common case
The Meta Lock: 2 bits in header Local spinning only 24-Nov-18 © Herlihy & Shavit

99 Java Synchronization Java compiled to byte code
Must respect block structure Must deal with exceptions Nested locks OK Locks need to count 24-Nov-18 © Herlihy & Shavit

100 Jargon Watch Monitor lock Meta lock Modus Operandi Protects object
Protects monitor lock Modus Operandi Acquire meta lock Manipulate monitor lock Release meta lock 24-Nov-18 © Herlihy & Shavit

101 Java Objects Class pointer Object header Multi-use word
User-defined fields 24-Nov-18 © Herlihy & Shavit

102 Meta-Lock meta lock other stuff 2 bits 30 bits Multi-use word
24-Nov-18 © Herlihy & Shavit

103 Meta-Lock - Neutral Locked Waiters - Busy 2 bits = 4 states 24-Nov-18
© Herlihy & Shavit

104 Usual state: nothing happening
Neutral State hash code age Usual state: nothing happening 24-Nov-18 © Herlihy & Shavit

105 pointer to lock records
Locked State lock record pointer to lock records 1 Object is monitor-locked 24-Nov-18 © Herlihy & Shavit

106 Lock Record Owner thread Lock count Hash and age (displaced)
Next lock record in queue Free list for unused records lock record pointer to lock record 1 Object is monitor-locked 24-Nov-18 © Herlihy & Shavit

107 Waiters State pointer to lock records 1
Monitor lock released, but other threads waiting to get in 24-Nov-18 © Herlihy & Shavit

108 Busy State environment pointer 1 1 Metalock is locked environment
24-Nov-18 © Herlihy & Shavit

109 Acquire Meta-Lock Swap it in Prepare new value
BitField getMetaLock(ExecEnv *ee, Object *obj) { BitField busyBits = ee | BUSY; BitField lockBits = SWAP(busyBits, multiUseWordAddr(obj)); if (getLockState(lockBits) != BUSY) return lockBits; else return getMetaLockSlow(ee, lockBits); Swap it in Prepare new value Return if not already locked, Otherwise take slow path 24-Nov-18 © Herlihy & Shavit

110 Slow Path Acquire First thread knows it’s first
Didn’t see BUSY bits Later threads know predecessor From result of SWAP 24-Nov-18 © Herlihy & Shavit

111 Release Meta-Lock Try to replace it (CAS in C returns old value)
BitField releaseMetaLock(ExecEnv *ee, Object *obj, BitField releaseBits) { BitField busyBits = ee | BUSY; BitField lockBits =CAS(releaseBits, busyBits, multiUseWordAddr(obj)); if (lockBits != busyBits) releaseMetaLockSlow(ee, lockBits); Try to replace it (CAS in C returns old value) Value we expect Take slow path if unsuccessful 24-Nov-18 © Herlihy & Shavit

112 Release Slow Path Hand-off the metalock to next waiting thread
Synchronize via sucessor’s environment structure … 24-Nov-18 © Herlihy & Shavit

113 Locking Objects Common cases No thread interaction needed Neutral
Waiters Recursively locked No thread interaction needed 24-Nov-18 © Herlihy & Shavit

114 Locking Objects Mutex object (lock) Suspends on Condition variable
Release processor Until condition is signalled Not a spin lock When it awakes, takes slow path Locked: go back to sleep Unlocked: update object and go for it 24-Nov-18 © Herlihy & Shavit

115 Unlocking Objects Common cases No thread interactions needed
Recursive lock No other threads No thread interactions needed 24-Nov-18 © Herlihy & Shavit

116 Unlocking Object Obtain metalock Remove own lock record
Wake up successor Release metalock Shorter queue Waiters state 24-Nov-18 © Herlihy & Shavit

117 Wait Acquire metalock Sets isWaitingForNotify field in execution environment Release metalock Wait for bit to be set Not a busy wait Can time out 24-Nov-18 © Herlihy & Shavit

118 Notify Acquire metalock Walk through queue Release metalock
Notify: wake first waiting thread NotifyAll: wake all waiting threads Release metalock 24-Nov-18 © Herlihy & Shavit

119 Locking…not so easy after all
Principles Create lock only when needed Fast path vs slow path Optimize the common case Locking…not so easy after all 24-Nov-18 © Herlihy & Shavit


Download ppt "Spin Locks and Contention Management"

Similar presentations


Ads by Google