Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Slides:

Advertisements

Similar presentations

CS 603 Process Synchronization: The Colored Ticket Algorithm February 13, 2002.

Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization without Contention

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Spin Locks and Contention Management The Art of Multiprocessor Programming Spring 2007.

Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Introduction Companion slides for

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IV: OS Support.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.

Multiprocessor Cache Coherency

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

More on Locks: Case Studies

Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike.

Multicore Programming

Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Ahmed Khademzadeh Azad University.

Process Synchronization Continued 7.2 Critical-Section Problem 7.3 Synchronization Hardware 7.4 Semaphores.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Spin Locks and Contention

Mutual Exclusion Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Spin Locks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.

Mutual Exclusion Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Operating Systems CMPSC 473 Mutual Exclusion Lecture 11: October 5, 2010 Instructor: Bhuvan Urgaonkar.

Monitors and Blocking Synchronization Dalia Cohn Alperovich Based on “The Art of Multiprocessor Programming” by Herlihy & Shavit, chapter 8.

Concurrent Queues Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

Concurrent Stacks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

Local-Spin Mutual Exclusion Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640,

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

The Relative Power of Synchronization Operations Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Spin Locks and Contention

Atomic Operations in Hardware

Atomic Operations in Hardware

Multiprocessor Cache Coherency

Spin Locks and Contention

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Spin Locks and Contention Management

CS533 Concepts of Operating Systems

CS510 Concurrent Systems Jonathan Walpole.

CSE 153 Design of Operating Systems Winter 19

Lecture 18: Coherence and Synchronization

Presentation transcript:

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Art of Multiprocessor Programming2 Focus so far: Correctness and Progress Models –Accurate (we never lied to you) –But idealized (so we forgot to mention a few things) Protocols –Elegant –Important –But naïve

Art of Multiprocessor Programming3 New Focus: Performance Models –More complicated (not the same as complex!) –Still focus on principles (not soon obsolete) Protocols –Elegant (in their fashion) –Important (why else would we pay attention) –And realistic (your mileage may vary)

Art of Multiprocessor Programming4 Kinds of Architectures SISD (Uniprocessor) –Single instruction stream –Single data stream SIMD (Vector) –Single instruction –Multiple data MIMD (Multiprocessors) –Multiple instruction –Multiple data.

Art of Multiprocessor Programming5 Kinds of Architectures SISD (Uniprocessor) –Single instruction stream –Single data stream SIMD (Vector) –Single instruction –Multiple data MIMD (Multiprocessors) –Multiple instruction –Multiple data. Our space (1)

Art of Multiprocessor Programming6 MIMD Architectures Memory Contention Communication Contention Communication Latency Shared Bus memory Distributed

Art of Multiprocessor Programming7 Today: Revisit Mutual Exclusion Think of performance, not just correctness and progress Begin to understand how performance depends on our software properly utilizing the multiprocessor machine’s hardware And get to know a collection of locking algorithms… (1)

Art of Multiprocessor Programming8 What Should you do if you can’t get a lock? Keep trying –“spin” or “busy-wait” –Good if delays are short Give up the processor –Good if delays are long –Always good on uniprocessor (1)

Art of Multiprocessor Programming9 What Should you do if you can’t get a lock? Keep trying –“spin” or “busy-wait” –Good if delays are short Give up the processor –Good if delays are long –Always good on uniprocessor our focus

Art of Multiprocessor Programming10 Basic Spin-Lock CS Resets lock upon exit spin lock critical section...

Art of Multiprocessor Programming11 Basic Spin-Lock CS Resets lock upon exit spin lock critical section... …lock introduces sequential bottleneck

Art of Multiprocessor Programming12 Basic Spin-Lock CS Resets lock upon exit spin lock critical section... …lock suffers from contention

Art of Multiprocessor Programming13 Basic Spin-Lock CS Resets lock upon exit spin lock critical section... Notice: these are distinct phenomena …lock suffers from contention

Art of Multiprocessor Programming14 Basic Spin-Lock CS Resets lock upon exit spin lock critical section... …lock suffers from contention Seq Bottleneck  no parallelism

Art of Multiprocessor Programming15 Basic Spin-Lock CS Resets lock upon exit spin lock critical section... Contention  ??? …lock suffers from contention

Art of Multiprocessor Programming16 Review: Test-and-Set Boolean value Test-and-set (TAS) –Swap true with current value –Return value tells if prior value was true or false Can reset just by writing false TAS aka “getAndSet”

Art of Multiprocessor Programming17 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } (5)

Art of Multiprocessor Programming18 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Package java.util.concurrent.atomic

Art of Multiprocessor Programming19 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Swap old and new values

Art of Multiprocessor Programming20 Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true)

Art of Multiprocessor Programming21 Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true) (5) Swapping in true is called “test-and-set” or TAS

Art of Multiprocessor Programming22 Test-and-Set Locks Locking –Lock is free: value is false –Lock is taken: value is true Acquire lock by calling TAS –If result is false, you win –If result is true, you lose Release lock by writing false

Art of Multiprocessor Programming23 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}

Art of Multiprocessor Programming24 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Lock state is AtomicBoolean

Art of Multiprocessor Programming25 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Keep trying until lock acquired

Art of Multiprocessor Programming26 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Release lock by resetting state to false

Art of Multiprocessor Programming27 Space Complexity TAS spin-lock has small “footprint” N thread spin-lock uses O(1) space As opposed to O(n) Peterson/Bakery How did we overcome the  (n) lower bound? We used a RMW operation…

Art of Multiprocessor Programming28 Performance Experiment –n threads –Increment shared counter 1 million times How long should it take? How long does it take?

Art of Multiprocessor Programming29 Graph ideal time threads no speedup because of sequential bottleneck

Art of Multiprocessor Programming30 Mystery #1 time threads TAS lock Ideal (1) What is going on?

Art of Multiprocessor Programming31 Test-and-Test-and-Set Locks Lurking stage –Wait until lock “looks” free –Spin while read returns true (lock taken) Pouncing state –As soon as lock “looks” available –Read returns false (lock free) –Call TAS to acquire lock –If TAS loses, back to lurking

Art of Multiprocessor Programming32 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; }

Art of Multiprocessor Programming33 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Wait until lock looks free

Art of Multiprocessor Programming34 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Then try to acquire it

Art of Multiprocessor Programming35 Mystery #2 TAS lock TTAS lock Ideal time threads

Art of Multiprocessor Programming36 Mystery Both –TAS and TTAS –Do the same thing (in our model) Except that –TTAS performs much better than TAS –Neither approaches ideal

Art of Multiprocessor Programming37 Opinion Our memory abstraction is broken TAS & TTAS methods –Are provably the same (in our model) –Except they aren’t (in field tests) Need a more detailed model …

Art of Multiprocessor Programming38 Bus-Based Architectures Bus cache memory cache

Art of Multiprocessor Programming39 Bus-Based Architectures Bus cache memory cache Random access memory (10s of cycles)

Art of Multiprocessor Programming40 Bus-Based Architectures cache memory cache Shared Bus Broadcast medium One broadcaster at a time Processors and memory all “snoop” Bus

Art of Multiprocessor Programming41 Bus-Based Architectures Bus cache memory cache Per-Processor Caches Small Fast: 1 or 2 cycles Address & state information

Art of Multiprocessor Programming42 Jargon Watch Cache hit –“I found what I wanted in my cache” –Good Thing™

Art of Multiprocessor Programming43 Jargon Watch Cache hit –“I found what I wanted in my cache” –Good Thing™ Cache miss –“I had to shlep all the way to memory for that data” –Bad Thing™

Art of Multiprocessor Programming44 Cave Canem This model is still a simplification –But not in any essential way –Illustrates basic principles Will discuss complexities later

Art of Multiprocessor Programming45 Bus Processor Issues Load Request cache memory cache data

Art of Multiprocessor Programming46 Bus Processor Issues Load Request Bus cache memory cache data Gimme data

Art of Multiprocessor Programming47 cache Bus Memory Responds Bus memory cache data Got your data right here data

Art of Multiprocessor Programming48 Bus Processor Issues Load Request memory cache data Gimme data

Art of Multiprocessor Programming49 Bus Processor Issues Load Request Bus memory cache data Gimme data

Art of Multiprocessor Programming50 Bus Processor Issues Load Request Bus memory cache data I got data

Art of Multiprocessor Programming51 Bus Other Processor Responds memory cache data I got data data Bus

Art of Multiprocessor Programming52 Bus Other Processor Responds memory cache data Bus

Art of Multiprocessor Programming53 Modify Cached Data Bus data memory cachedata (1)

Art of Multiprocessor Programming54 Modify Cached Data Bus data memory cachedata (1)

Art of Multiprocessor Programming55 memory Bus data Modify Cached Data cachedata

Art of Multiprocessor Programming56 memory Bus data Modify Cached Data cache What’s up with the other copies? data

Art of Multiprocessor Programming57 Cache Coherence We have lots of copies of data –Original copy in memory –Cached copies at processors Some processor modifies its own copy –What do we do with the others? –How to avoid confusion?

Art of Multiprocessor Programming58 Write-Back Caches Accumulate changes in cache Write back when needed –Need the cache for something else –Another processor wants it On first modification –Invalidate other entries –Requires non-trivial protocol …

Art of Multiprocessor Programming59 Write-Back Caches Cache entry has three states –Invalid: contains raw seething bits –Valid: I can read but I can’t write –Dirty: Data has been modified Intercept other load requests Write back to memory before using cache

Art of Multiprocessor Programming60 Bus Invalidate memory cachedata

Art of Multiprocessor Programming61 Bus Invalidate Bus memory cachedata Mine, all mine!

Art of Multiprocessor Programming62 Bus Invalidate Bus memory cachedata cache Uh,oh

Art of Multiprocessor Programming63 cache Bus Invalidate memory cachedata Other caches lose read permission

Art of Multiprocessor Programming64 cache Bus Invalidate memory cachedata Other caches lose read permission This cache acquires write permission

Art of Multiprocessor Programming65 cache Bus Invalidate memory cachedata Memory provides data only if not present in any cache, so no need to change it now (expensive) (2)

Art of Multiprocessor Programming66 cache Bus Another Processor Asks for Data memory cachedata (2) Bus

Art of Multiprocessor Programming67 cache data Bus Owner Responds memory cachedata (2) Bus Here it is!

Art of Multiprocessor Programming68 Bus End of the Day … memory cachedata (1) Reading OK, no writing data

Art of Multiprocessor Programming69 Mutual Exclusion What do we want to optimize? –Bus bandwidth used by spinning threads –Release/Acquire latency –Acquire latency for idle lock

Art of Multiprocessor Programming70 Simple TASLock TAS invalidates cache lines Spinners –Miss in cache –Go to bus Thread wants to release lock –delayed behind spinners

Art of Multiprocessor Programming71 Test-and-test-and-set Wait until lock “looks” free –Spin on local cache –No bus use while lock busy Problem: when lock is released –Invalidation storm …

Art of Multiprocessor Programming72 Local Spinning while Lock is Busy Bus memory busy

Art of Multiprocessor Programming73 Bus On Release memory freeinvalid free

Art of Multiprocessor Programming74 On Release Bus memory freeinvalid free miss Everyone misses, rereads (1)

Art of Multiprocessor Programming75 On Release Bus memory freeinvalid free TAS(…) Everyone tries TAS (1)

Art of Multiprocessor Programming76 Problems Everyone misses –Reads satisfied sequentially Everyone does TAS –Invalidates others’ caches Eventually quiesces after lock acquired –How long does this take?

Art of Multiprocessor Programming77 Measuring Quiescence Time P1P1 P2P2 PnPn X = time of ops that don’t use the bus Y = time of ops that cause intensive bus traffic In critical section, run ops X then ops Y. As long as Quiescence time is less than X, no drop in performance. By gradually varying X, can determine the exact time to quiesce.

Art of Multiprocessor Programming78 Quiescence Time Increses linearly with the number of processors for bus architecture time threads

Art of Multiprocessor Programming79 Mystery Explained TAS lock TTAS lock Ideal time threads Better than TAS but still not as good as ideal

Art of Multiprocessor Programming80 Solution: Introduce Delay spin lock time d r1dr1d r2dr2d If the lock looks free But I fail to get it There must be lots of contention Better to back off than to collide again

Art of Multiprocessor Programming81 Dynamic Example: Exponential Backoff time d 2d4d spin lock If I fail to get lock –wait random duration before retry –Each subsequent failure doubles expected wait

Art of Multiprocessor Programming82 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}}

Art of Multiprocessor Programming83 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay

Art of Multiprocessor Programming84 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free

Art of Multiprocessor Programming85 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return

Art of Multiprocessor Programming86 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration

Art of Multiprocessor Programming87 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason

Art of Multiprocessor Programming88 Spin-Waiting Overhead TTAS Lock Backoff lock time threads

Art of Multiprocessor Programming89 Backoff: Other Issues Good –Easy to implement –Beats TTAS lock Bad –Must choose parameters carefully –Not portable across platforms

Art of Multiprocessor Programming90 Idea Avoid useless invalidations –By keeping a queue of threads Each thread –Notifies next in line –Without bothering the others

Art of Multiprocessor Programming91 Anderson Queue Lock flags next TFFFFFFF idle

Art of Multiprocessor Programming92 Anderson Queue Lock flags next TFFFFFFF acquiring getAndIncrement

Art of Multiprocessor Programming93 Anderson Queue Lock flags next TFFFFFFF acquiring getAndIncrement

Art of Multiprocessor Programming94 Anderson Queue Lock flags next TFFFFFFF acquired Mine!

Art of Multiprocessor Programming95 Anderson Queue Lock flags next TFFFFFFF acquired acquiring

Art of Multiprocessor Programming96 Anderson Queue Lock flags next TFFFFFFF acquired acquiring getAndIncrement

Art of Multiprocessor Programming97 Anderson Queue Lock flags next TFFFFFFF acquired acquiring getAndIncrement

Art of Multiprocessor Programming98 acquired Anderson Queue Lock flags next TFFFFFFF acquiring

Art of Multiprocessor Programming99 released Anderson Queue Lock flags next TTFFFFFF acquired

Art of Multiprocessor Programming100 released Anderson Queue Lock flags next TTFFFFFF acquired Yow!

Art of Multiprocessor Programming101 Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); int[] slot = new int[n];

Art of Multiprocessor Programming102 Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); int[] slot = new int[n]; One flag per thread

Art of Multiprocessor Programming103 Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); int[] slot = new int[n]; Next flag to use

Art of Multiprocessor Programming104 Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal mySlot; Thread-local variable

Art of Multiprocessor Programming105 Anderson Queue Lock public lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; }

Art of Multiprocessor Programming106 Anderson Queue Lock public lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; } Take next slot

Art of Multiprocessor Programming107 Anderson Queue Lock public lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; } Spin until told to go

Art of Multiprocessor Programming108 Anderson Queue Lock public lock() { myslot = next.getAndIncrement(); while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; } Prepare slot for re-use

Art of Multiprocessor Programming109 Anderson Queue Lock public lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; } Tell next thread to go

Art of Multiprocessor Programming110 Performance Shorter handover than backoff Curve is practically flat Scalable performance FIFO fairness queue TTAS

Art of Multiprocessor Programming111 Anderson Queue Lock Good –First truly scalable lock –Simple, easy to implement Bad –Space hog –One bit per thread Unknown number of threads? Small number of actual contenders?

Art of Multiprocessor Programming112 CLH Lock FIFO order Small, constant-size overhead per thread

Art of Multiprocessor Programming113 Initially false tail idle

Art of Multiprocessor Programming114 Initially false tail idle Queue tail

Art of Multiprocessor Programming115 Initially false tail idle Lock is free

Art of Multiprocessor Programming116 Initially false tail idle

Art of Multiprocessor Programming117 Purple Wants the Lock false tail acquiring

Art of Multiprocessor Programming118 Purple Wants the Lock false tail acquiring true

Art of Multiprocessor Programming119 Purple Wants the Lock false tail acquiring true Swap

Art of Multiprocessor Programming120 Purple Has the Lock false tail acquired true

Art of Multiprocessor Programming121 Red Wants the Lock false tail acquired acquiring true

Art of Multiprocessor Programming122 Red Wants the Lock false tail acquired acquiring true Swap true

Art of Multiprocessor Programming123 Red Wants the Lock false tail acquired acquiring true

Art of Multiprocessor Programming124 Red Wants the Lock false tail acquired acquiring true

Art of Multiprocessor Programming125 Red Wants the Lock false tail acquired acquiring true Implicitely Linked list

Art of Multiprocessor Programming126 Red Wants the Lock false tail acquired acquiring true

Art of Multiprocessor Programming127 Red Wants the Lock false tail acquired acquiring true Actually, it spins on cached copy

Art of Multiprocessor Programming128 Purple Releases false tail release acquiring false true false Bingo!

Art of Multiprocessor Programming129 Purple Releases tail released acquired true

Art of Multiprocessor Programming130 Space Usage Let –L = number of locks –N = number of threads ALock –O(LN) CLH lock –O(L+N)

Art of Multiprocessor Programming131 CLH Queue Lock class Qnode { AtomicBoolean locked = new AtomicBoolean(true); }

Art of Multiprocessor Programming132 CLH Queue Lock class Qnode { AtomicBoolean locked = new AtomicBoolean(true); } Not released yet

Art of Multiprocessor Programming133 CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} (3)

Art of Multiprocessor Programming134 CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} (3) Tail of the queue

Art of Multiprocessor Programming135 CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} (3) Thread-local Qnode

Art of Multiprocessor Programming136 CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} (3) Swap in my node

Art of Multiprocessor Programming137 CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} (3) Spin until predecessor releases lock

Art of Multiprocessor Programming138 CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } (3)

Art of Multiprocessor Programming139 CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } (3) Notify successor

Art of Multiprocessor Programming140 CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } (3) Recycle predecessor’s node

Art of Multiprocessor Programming141 CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } (3) (notice that we actually don’t reuse myNode. Code in book shows how its done.)

Art of Multiprocessor Programming142 CLH Lock Good –Lock release affects predecessor only –Small, constant-sized space Bad –Doesn’t work for uncached NUMA architectures

Art of Multiprocessor Programming143 NUMA Architecturs Acronym: –Non-Uniform Memory Architecture Illusion: –Flat shared memory Truth: –No caches (sometimes) –Some memory regions faster than others

Art of Multiprocessor Programming144 NUMA Machines Spinning on local memory is fast

Art of Multiprocessor Programming145 NUMA Machines Spinning on remote memory is slow

Art of Multiprocessor Programming146 CLH Lock Each thread spin’s on predecessor’s memory Could be far away …

Art of Multiprocessor Programming147 MCS Lock FIFO order Spin on local memory only Small, Constant-size overhead

Art of Multiprocessor Programming148 Initially false tail false idle

Art of Multiprocessor Programming149 Acquiring false queue false true acquiring (allocate Qnode)

Art of Multiprocessor Programming150 Acquiring false tail false true acquired swap

Art of Multiprocessor Programming151 Acquiring false tail false true acquired

Art of Multiprocessor Programming152 Acquired false tail false true acquired

Art of Multiprocessor Programming153 Acquiring tail false acquired acquiring true swap

Art of Multiprocessor Programming154 Acquiring tail acquired acquiring true false

Art of Multiprocessor Programming155 Acquiring tail acquired acquiring true false

Art of Multiprocessor Programming156 Acquiring tail acquired acquiring true false

Art of Multiprocessor Programming157 Acquiring tail acquired acquiring true false

Art of Multiprocessor Programming158 Acquiring tail acquired acquiring true Yes! false

Art of Multiprocessor Programming159 MCS Queue Lock class Qnode { boolean locked = false; qnode next = null; }

Art of Multiprocessor Programming160 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} (3)

Art of Multiprocessor Programming161 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} (3) Make a QNode

Art of Multiprocessor Programming162 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} (3) add my Node to the tail of queue

Art of Multiprocessor Programming163 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} (3) Fix if queue was non-empty

Art of Multiprocessor Programming164 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} (3) Wait until unlocked

Art of Multiprocessor Programming165 MCS Queue Unlock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} (3)

Art of Multiprocessor Programming166 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} (3) Missing successor?

Art of Multiprocessor Programming167 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} (3) If really no successor, return

Art of Multiprocessor Programming168 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} (3) Otherwise wait for successor to catch up

Art of Multiprocessor Programming169 MCS Queue Lock class MCSLock implements Lock { AtomicReference queue; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} (3) Pass lock to successor

Art of Multiprocessor Programming170 Purple Release false releasing swap false (2)

Art of Multiprocessor Programming171 Purple Release false releasing swap false By looking at the queue, I see another thread is active (2)

Art of Multiprocessor Programming172 Purple Release false releasing swap false By looking at the queue, I see another thread is active I have to wait for that thread to finish (2)

Art of Multiprocessor Programming173 Purple Release false releasing prepare to spin true

Art of Multiprocessor Programming174 Purple Release false releasing spinning true

Art of Multiprocessor Programming175 Purple Release false releasing spinning true false

Art of Multiprocessor Programming176 Purple Release false releasing true Acquired lock false

Art of Multiprocessor Programming177 Abortable Locks What if you want to give up waiting for a lock? For example –Timeout –Database transaction aborted by user

Art of Multiprocessor Programming178 Back-off Lock Aborting is trivial –Just return from lock() call Extra benefit: –No cleaning up –Wait-free –Immediate return

Art of Multiprocessor Programming179 Queue Locks Can’t just quit –Thread in line behind will starve Need a graceful way out

Art of Multiprocessor Programming180 Queue Locks spinning true spinning true spinning

Art of Multiprocessor Programming181 Queue Locks spinning true spinning true false locked

Art of Multiprocessor Programming182 Queue Locks spinning true locked false

Art of Multiprocessor Programming183 Queue Locks locked false

Art of Multiprocessor Programming184 Queue Locks spinning true spinning true spinning

Art of Multiprocessor Programming185 Queue Locks spinning true spinning

Art of Multiprocessor Programming186 Queue Locks spinning true false locked

Art of Multiprocessor Programming187 Queue Locks spinning true false

Art of Multiprocessor Programming188 Queue Locks pwned true false

Art of Multiprocessor Programming189 Abortable CLH Lock When a thread gives up –Removing node in a wait-free way is hard Idea: –let successor deal with it.

Art of Multiprocessor Programming190 Initially tail idle Pointer to predecessor (or null) A

Art of Multiprocessor Programming191 Initially tail idle Distinguished available node means lock is free A

Art of Multiprocessor Programming192 Acquiring tail acquiring A

Art of Multiprocessor Programming193 Acquiring acquiring A Null predecessor means lock not released or aborted

Art of Multiprocessor Programming194 Acquiring acquiring A Swap

Art of Multiprocessor Programming195 Acquiring acquiring A

Art of Multiprocessor Programming196 Acquired locked A Pointer to AVAILABLE means lock is free.

spinning locked Art of Multiprocessor Programming197 Normal Case Null means lock is not free & request not aborted

Art of Multiprocessor Programming198 One Thread Aborts spinning Timed out locked

Art of Multiprocessor Programming199 Successor Notices spinning Timed out locked Non-Null means predecessor aborted

Art of Multiprocessor Programming200 Recycle Predecessor’s Node spinning locked

Art of Multiprocessor Programming201 Spin on Earlier Node spinning locked

Art of Multiprocessor Programming202 Spin on Earlier Node spinning released A The lock is now mine

Art of Multiprocessor Programming203 Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference tail; ThreadLocal myNode;

Art of Multiprocessor Programming204 Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference tail; ThreadLocal myNode; Distinguished node to signify free lock

Art of Multiprocessor Programming205 Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference tail; ThreadLocal myNode; Tail of the queue

Art of Multiprocessor Programming206 Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference tail; ThreadLocal myNode; Remember my node …

Art of Multiprocessor Programming207 Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred== null || myPred.prev == AVAILABLE) { return true; } …

Art of Multiprocessor Programming208 Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred == null || myPred.prev == AVAILABLE) { return true; } Create & initialize node

Art of Multiprocessor Programming209 Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred == null || myPred.prev == AVAILABLE) { return true; } Swap with tail

Art of Multiprocessor Programming210 Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred == null || myPred.prev == AVAILABLE) { return true; }... If predecessor absent or released, we are done

Art of Multiprocessor Programming211 Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } … spinning locked

Art of Multiprocessor Programming212 Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } … Keep trying for a while …

Art of Multiprocessor Programming213 Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } … Spin on predecessor’s prev field

Art of Multiprocessor Programming214 Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } … Predecessor released lock

Art of Multiprocessor Programming215 Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } … Predecessor aborted, advance one

Art of Multiprocessor Programming216 Time-out Lock … if (!tail.compareAndSet(qnode, myPred)) qnode.prev = myPred; return false; } What do I do when I time out?

Art of Multiprocessor Programming217 Time-out Lock … if (!tail.compareAndSet(qnode, myPred)) qnode.prev = myPred; return false; } Do I have a successor? If CAS fails: I do have a successor, tell it about myPred

Art of Multiprocessor Programming218 Time-out Lock … if (!tail.compareAndSet(qnode, myPred)) qnode.prev = myPred; return false; } If CAS succeeds: no successor, simply return false

Art of Multiprocessor Programming219 Time-Out Unlock public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; }

Art of Multiprocessor Programming220 public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; } Time-out Unlock If CAS failed: exists successor, notify successor it can enter

Art of Multiprocessor Programming221 public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; } Timing-out Lock CAS successful: set tail to null, no clean up since no successor waiting

Art of Multiprocessor Programming222 One Lock To Rule Them All? TTAS+Backoff, CLH, MCS, ToLock… Each better than others in some way There is no one solution Lock we pick really depends on: – the application – the hardware – which properties are important

Art of Multiprocessor Programming223 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike 2.5 License You are free: –to Share — to copy, distribute and transmit the work –to Remix — to adapt the work Under the following conditions: –Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). –Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to – Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.