Spin Locks and Contention

Slides:

Advertisements

Similar presentations

CS 603 Process Synchronization: The Colored Ticket Algorithm February 13, 2002.

Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization without Contention

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Spin Locks and Contention Management The Art of Multiprocessor Programming Spring 2007.

Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Introduction Companion slides for

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IV: OS Support.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.

Multiprocessor Cache Coherency

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

More on Locks: Case Studies

Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike.

Multicore Programming

Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Ahmed Khademzadeh Azad University.

Process Synchronization Continued 7.2 Critical-Section Problem 7.3 Synchronization Hardware 7.4 Semaphores.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Mutual Exclusion Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Spin Locks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.

Mutual Exclusion Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Operating Systems CMPSC 473 Mutual Exclusion Lecture 11: October 5, 2010 Instructor: Bhuvan Urgaonkar.

Monitors and Blocking Synchronization Dalia Cohn Alperovich Based on “The Art of Multiprocessor Programming” by Herlihy & Shavit, chapter 8.

Concurrent Queues Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

Concurrent Stacks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

Local-Spin Mutual Exclusion Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640,

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

The Relative Power of Synchronization Operations Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Spin Locks and Contention

Atomic Operations in Hardware

Atomic Operations in Hardware

Multiprocessor Cache Coherency

Spin Locks and Contention

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Spin Locks and Contention Management

CS533 Concepts of Operating Systems

CS510 Concurrent Systems Jonathan Walpole.

CSE 153 Design of Operating Systems Winter 19

Lecture 18: Coherence and Synchronization

Presentation transcript:

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Focus so far: Correctness and Progress Models Accurate (we never lied to you) But idealized (so we forgot to mention a few things) Protocols Elegant Important But naïve So far we have focused on idealized models because we were interesting in understanding the foundations of the field. Understanding how to build a bakery lock isn’t all that useful if you really want to implement a lock directly, but you now know more about the underlying problems than you would if we had started looking at complicated practical techniques right away. Art of Multiprocessor Programming

New Focus: Performance Models More complicated (not the same as complex!) Still focus on principles (not soon obsolete) Protocols Elegant (in their fashion) Important (why else would we pay attention) And realistic (your mileage may vary) No we are going to transition over to the “real” world and look at the kinds of protocols you might actually use. These models are necessarily more complicated, in the sense of having more details and exceptions and things to remember, but not necessarily more complex, in the sense of encompassing deeper ideas. We are still going to focus on issues that are important, that is, things that matter for all kinds of platforms and applications, and avoid other optimizations that might sometimes work in practice but that don’t teach us anyting. Art of Multiprocessor Programming

Kinds of Architectures SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data. You can classify processors by how they manage data and instruction streams. A single-instruction single-data stream architecture is just a uniprocessor. A few years ago, single-instruction multiple-data stream architectures were popular, for example the Connection Machine CM2. These have fallen out of favor, at least for the time being. Art of Multiprocessor Programming

Kinds of Architectures SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data. Our space Instead, most modern multiprocessors provide multiple instruction streams, meaning that processors execute independent sequences of instructions, and multiple data streams, meaning that processors issue independent sequences of memory reads and writes. Such architectures are usually called ``MIMD''. (1) Art of Multiprocessor Programming

Art of Multiprocessor Programming MIMD Architectures memory Shared Bus Distributed There are two basic kinds of MIMD architectures. In a shared bus architecture, processors and memory are connected by a shared broadcast medium called a bus (like a tiny Ethernet). Both the processors and the memory controller can broadcast on the bus. Only one processor (or memory) can broadcast on the bus at a time, but all processors (and memory) can listen. Bus-based architectures are the most common today because they are easy to build. The principal things that affect performance are (1) contention for the memory: not all the processors can usually get at the same memory location at the same times, and if they try, they will have to queue up, (2) contention for the communication medium. If everyone wants to communicate at the same time, or to the same processor, then the processors will have to wait for one another. Finally, there is a growing communication latency, the time it takes for a processor to communicate with memory or with another processor. Memory Contention Communication Contention Communication Latency Art of Multiprocessor Programming

Today: Revisit Mutual Exclusion Think of performance, not just correctness and progress Begin to understand how performance depends on our software properly utilizing the multiprocessor machine’s hardware And get to know a collection of locking algorithms… When programming uniprocessors, one can generally ignore the exact structure and properties of the underlying system. Unfortunately, multiprocessor programming has yet to reach that state, and at present it is crucial that the programmer understand the properties and limitations of the underlying machine architecture. This will be our goal in this chapter. We revisit the familiar mutual exclusion problem, this time with the aim of devising mutual exclusion protocols that work well on today’s multiprocessors. Art of Multiprocessor Programming (1)

What Should you do if you can’t get a lock? Keep trying “spin” or “busy-wait” Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor Any mutual exclusion protocol poses the question: what do you do if you cannot acquire the lock? There are two alternatives. If you keep trying, the lock is called a spin lock, and repeatedly testing the lock is called spinning. or busy-waiting. We will use these terms interchangeably. The filter and bakery algorithms are spin-locks. Spinning is sensible when you expect the lock delay to be short. For obvious reasons, spinning makes sense only on multiprocessors. The alternative is to suspend yourself and ask the operating system's scheduler to schedule another thread on your processor, which is sometimes called blocking. Java’s built-in synchronization is blocking. Because switching from one thread to another is expensive, blocking makes sense only if you expect the lock delay to be long. Many operating systems mix both strategies, spinning for a short time and then blocking. Art of Multiprocessor Programming (1)

What Should you do if you can’t get a lock? Keep trying “spin” or “busy-wait” Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor Both spinning and blocking are important techniques. In this lecture, we turn our attention to locks that use spinning. our focus Art of Multiprocessor Programming

Art of Multiprocessor Programming Basic Spin-Lock . CS With spin locks, synchronization usually looks like this. Some set of threads contend for the lock. One of them acquires it, and the others spin. The winner enters the critical section, does its job, and releases the lock on exit. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming

Art of Multiprocessor Programming Basic Spin-Lock …lock introduces sequential bottleneck . CS Contention occurs when multiple threads try to acquire a lock at the same time. High contention means there are many such threads, and low contention means the opposite. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming

Art of Multiprocessor Programming Basic Spin-Lock …lock suffers from contention . CS Contention occurs when multiple threads try to acquire a lock at the same time. High contention means there are many such threads, and low contention means the opposite. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming

Art of Multiprocessor Programming Basic Spin-Lock …lock suffers from contention . CS Our goal is to understand how contention works, and to develop a set of techniques that can avoid or alleviate it. These techniques provide a useful toolkit for all kinds of synchronization problems. spin lock critical section Resets lock upon exit Notice: these are distinct phenomena Art of Multiprocessor Programming

Art of Multiprocessor Programming Basic Spin-Lock …lock suffers from contention . CS Our goal is to understand how contention works, and to develop a set of techniques that can avoid or alleviate it. These techniques provide a useful toolkit for all kinds of synchronization problems. spin lock critical section Resets lock upon exit Seq Bottleneck  no parallelism Art of Multiprocessor Programming

Art of Multiprocessor Programming Basic Spin-Lock …lock suffers from contention . CS Our goal is to understand how contention happens, and to develop a set of techniques that can avoid or alleviate it. These techniques provide a useful toolkit for all kinds of synchronization problems. spin lock critical section Resets lock upon exit Contention  ??? Art of Multiprocessor Programming

Art of Multiprocessor Programming Review: Test-and-Set Boolean value Test-and-set (TAS) Swap true with current value Return value tells if prior value was true or false Can reset just by writing false TAS aka “getAndSet” The test-and-set machine instruction atomically stores true in that word, and returns that word's previous value, swapping the value true for the word's current value. You can reset the word just by writing false to it. Note in Java TAS is called getAndSet. We will use the terms interchangeably. Art of Multiprocessor Programming

Art of Multiprocessor Programming Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } The test-and-set machine instruction, with consensus number two, was the principal synchronization instruction provided by many early multiprocessors. Art of Multiprocessor Programming (5)

Review: Test-and-Set Package java.util.concurrent.atomic public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Here we are going to build one out of the atomic Boolean class, which is provided as part of Java’s standard library of atomic primitives. You can think of this object as a box holding a Boolean value. Package java.util.concurrent.atomic Art of Multiprocessor Programming

Art of Multiprocessor Programming Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } The getAndSet method swaps a Boolean value with the current contents of the box. Swap old and new values Art of Multiprocessor Programming

Art of Multiprocessor Programming Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true) At first glance, the AtomicBoolean class seems ideal for implementing a spin lock. Art of Multiprocessor Programming

Review: Test-and-Set Swapping in true is called “test-and-set” or TAS AtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true) Swapping in true is called “test-and-set” or TAS If we call getAndSet(true), then we have a test-and-set. Art of Multiprocessor Programming (5)

Art of Multiprocessor Programming Test-and-Set Locks Locking Lock is free: value is false Lock is taken: value is true Acquire lock by calling TAS If result is false, you win If result is true, you lose Release lock by writing false The lock is free when the word's value is false, and busy when it is true. Art of Multiprocessor Programming

Art of Multiprocessor Programming Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Here is what it looks like in more detail. Art of Multiprocessor Programming

Art of Multiprocessor Programming Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} The lock is just an AtomicBoolean initialized to false. Lock state is AtomicBoolean Art of Multiprocessor Programming

Art of Multiprocessor Programming Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} The lock() method repeatedly applies test-and-set to the location until that instruction returns false (that is, until the lock is free). Keep trying until lock acquired Art of Multiprocessor Programming

Release lock by resetting state to false Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Release lock by resetting state to false The unlock() method simply writes the value false to that word. Art of Multiprocessor Programming

Art of Multiprocessor Programming Space Complexity TAS spin-lock has small “footprint” N thread spin-lock uses O(1) space As opposed to O(n) Peterson/Bakery How did we overcome the W(n) lower bound? We used a RMW operation… We call real world space complexity the “footprint”. By using TAS we are able to reduce the footprint from linear to constant. Art of Multiprocessor Programming

Art of Multiprocessor Programming Performance Experiment n threads Increment shared counter 1 million times How long should it take? How long does it take? Let’s do an experiment on a real machine. Take n threads and have them collectively acquire a lock, increment a counter, and release the lock. Have them do it collectively, say, one million times. Before we look at any curves, let’s try to reason about how long it should take them. Art of Multiprocessor Programming

Graph time threads ideal no speedup because of sequential bottleneck Ideally the curve should stay flat after that. Why? Because we have a sequential bottleneck so no matter How many threads we add running in parallel, we will not get any speedup (remember Amdahl’s law). threads Art of Multiprocessor Programming

Art of Multiprocessor Programming Mystery #1 TAS lock Ideal time What is going on? The curve for TAS lock looks like this. In fact, if you do the experiment you have to give up because it takes so long beyond a certain number of processors? What is happening? Let us try this again. threads Art of Multiprocessor Programming (1)

Test-and-Test-and-Set Locks Lurking stage Wait until lock “looks” free Spin while read returns true (lock taken) Pouncing state As soon as lock “looks” available Read returns false (lock free) Call TAS to acquire lock If TAS loses, back to lurking Let’s try a slightly different approach. Instead of repeatedly trying to test-and-set the lock, let’s split the locking method into two phases. In the lurking phase, we wait until the lock looks like it’s free, and when it’s free, we pounce, attempting to acquire the lock by a call to test-and-set. If we win, we’re in, if we lose, we go back to lurking. Art of Multiprocessor Programming

Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Here is what the code looks like. Art of Multiprocessor Programming

Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } First we spin on the value, repeatedly reading it until it looks like the lock is free. We don’t try to modify it, we just read it. Wait until lock looks free Art of Multiprocessor Programming

Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Then try to acquire it As soon as it looks like the lock is free, we call test-and-set to try to acquire it. If we are first and we succeed, the lock() method returns, and otherwise, if someone else got there before us, we go back to lurking, to repeatedly rereading the variable. Art of Multiprocessor Programming

Art of Multiprocessor Programming Mystery #2 TAS lock TTAS lock Ideal time The difference is dramatic. The TTAS lock performs much better than the TAS lock, but still much worse than we expected from an ideal lock. threads Art of Multiprocessor Programming

Art of Multiprocessor Programming Mystery Both TAS and TTAS Do the same thing (in our model) Except that TTAS performs much better than TAS Neither approaches ideal There are two mysteries here: why is the TTAS lock so good (that is, so much better than TAS), and why is it so bad (so much worse than ideal)? Art of Multiprocessor Programming

Art of Multiprocessor Programming Opinion Our memory abstraction is broken TAS & TTAS methods Are provably the same (in our model) Except they aren’t (in field tests) Need a more detailed model … From everything we told you in the earlier section on mutual exclusion, you would expect the TAS and TTAS locks to be the same --- after all, they are logically equivalent programs. In fact, they are equivalent with respect to correctness (they both work), but very different with respect to performance. The problem here is that the shared memory abstraction is broken with respect to performance. If you don’t understand the underlying architecture, you will never understand why your reasonable-looking programs are so slow. It’s a little like not being able to drive a car unless you know how an automatic transmission works, but there it is. Art of Multiprocessor Programming

Bus-Based Architectures cache cache cache We are going to do yet another review of multiprocessor architecture. Bus memory Art of Multiprocessor Programming

Bus-Based Architectures Random access memory (10s of cycles) cache cache cache The processors share a memory that has a high latency, say 50 to 100 cycles to read or write a value. This means that while you are waiting for the memory to respond, you can execute that many instructions. Bus memory Art of Multiprocessor Programming

Bus-Based Architectures Shared Bus Broadcast medium One broadcaster at a time Processors and memory all “snoop” cache cache cache Processors communicate with the memory and with one another over a shared bus. The bus is a broadcast medium, meaning that only one processor at a time can send a message, although everyone can passively listen. Bus memory Art of Multiprocessor Programming

Bus-Based Architectures Per-Processor Caches Small Fast: 1 or 2 cycles Address & state information Bus-Based Architectures cache cache cache Each processor has a cache, a small high-speed memory where the processor keeps data likely to be of interest. A cache access typically requires one or two machine cycles, while a memory access typically requires tens of machine cycles. Technology trends are making this contrast more extreme: although both processor cycle times and memory access times are becoming faster, the cycle times are improving faster than the memory access times, so cache performance is critical to the overall performance of a multiprocessor architecture. Bus memory Art of Multiprocessor Programming

Art of Multiprocessor Programming Jargon Watch Cache hit “I found what I wanted in my cache” Good Thing™ If a processor finds data in its cache, then it doesn’t have to go all the way to memory. This is a very good thing, which we call a cache hit. Art of Multiprocessor Programming

Art of Multiprocessor Programming Jargon Watch Cache hit “I found what I wanted in my cache” Good Thing™ Cache miss “I had to shlep all the way to memory for that data” Bad Thing™ If the processor doesn’t find what it wants in its cache, then we have a cache miss, which is very time-consuming. How well a synchronization protocol or concurrent algorithm performs is largely determined by its cache behavior: how many hits and misses. Art of Multiprocessor Programming

Art of Multiprocessor Programming Cave Canem This model is still a simplification But not in any essential way Illustrates basic principles Will discuss complexities later Real cache coherence protocols can be very complex. For example, modern multiprocessors have multi-level caches, where each processor has an on-chip \textit{level-one} (L1) cache, and clusters of processors share a \textit{level-two} (L2) cache. The L2 cache is on-chip in some modern architectures, and off chip in others, a detail that greatly changes the observed performance. We are going to avoid going into too much detail here because the basic principles don't depend on that level of detail. Art of Multiprocessor Programming

Processor Issues Load Request cache cache cache Here is one thing that can happen when a processor issues a load request. Bus memory data Art of Multiprocessor Programming

Processor Issues Load Request Gimme data cache cache cache It broadcasts a message asking for the data it needs. Notice that while it is broadcasting, no one else can use the bus. Bus Bus memory data Art of Multiprocessor Programming

Memory Responds memory cache cache cache data data Bus Bus In this case, the memory responds to the request, also over the bus. Bus Bus Got your data right here memory data data Art of Multiprocessor Programming

Processor Issues Load Request Gimme data data cache cache Now suppose another processor issues a load request for the same data. Bus memory data Art of Multiprocessor Programming

Processor Issues Load Request Gimme data data cache cache It broadcasts its request over the bus. Bus Bus memory data Art of Multiprocessor Programming

Processor Issues Load Request I got data data cache cache This time, however, the request is picked up by the first processor, which has the data in its cache. Usually, when a processor has the data cached, it, rather than the memory, will respond to load requests. Bus Bus memory data Art of Multiprocessor Programming

Other Processor Responds I got data data data cache cache The processor puts the data on the bus. Bus Bus memory data Art of Multiprocessor Programming

Other Processor Responds data data cache cache Now both processors have the same data cached. Bus Bus memory data Art of Multiprocessor Programming

Art of Multiprocessor Programming Modify Cached Data data data cache Now what happens if the red processor decides to modify the cached data? Bus memory data Art of Multiprocessor Programming (1)

Art of Multiprocessor Programming Modify Cached Data data data data cache It changes the copy in its cache (from blue to white). Bus memory data Art of Multiprocessor Programming (1)

Art of Multiprocessor Programming Modify Cached Data data data cache Bus memory data Art of Multiprocessor Programming

Modify Cached Data memory What’s up with the other copies? data data Now we have a problem. The data cached at the red processor disagrees with the same copy of that memory location stored both at the other processors and in the memory itself. Bus What’s up with the other copies? memory data Art of Multiprocessor Programming

Art of Multiprocessor Programming Cache Coherence We have lots of copies of data Original copy in memory Cached copies at processors Some processor modifies its own copy What do we do with the others? How to avoid confusion? The problem of keeping track of multiple copies of the same data is called the cache coherence problem, and ways to accomplish it are called cache coherence protocols. Art of Multiprocessor Programming

Art of Multiprocessor Programming Write-Back Caches Accumulate changes in cache Write back when needed Need the cache for something else Another processor wants it On first modification Invalidate other entries Requires non-trivial protocol … A write-back coherence protocol sends out an invalidation message when the value is first modified, instructing the other processors to discard that value from their caches. Once the processor has invalidated the other cached values, it can make subsequent modifications without further bus traffic. A value that has been modified in the cache but not written back is called dirty. If the processor needs to use the cache for another value, however, it must remember to write back any dirty values. Art of Multiprocessor Programming

Art of Multiprocessor Programming Write-Back Caches Cache entry has three states Invalid: contains raw seething bits Valid: I can read but I can’t write Dirty: Data has been modified Intercept other load requests Write back to memory before using cache Instead, we keep track of each cache’s state as follows. If the cache is invalid, then its contents are meaningless. If it is valid, then the processor can read the value, but does not have permission to write it because it may be cached elsewhere. If the value is dirty, then the processor has modified the value and it must be written back before that cache can be reused. Art of Multiprocessor Programming

Art of Multiprocessor Programming Invalidate data data cache Let’s rewind back to the moment when the red processor updated its cached data. Bus memory data Art of Multiprocessor Programming

Art of Multiprocessor Programming Invalidate Mine, all mine! data data cache It broadcasts an invalidation message warning the other processors to invalidate, or discard their cached copies of that data. Bus Bus memory data Art of Multiprocessor Programming

Art of Multiprocessor Programming Invalidate Uh,oh cache data data cache When the other processors hear the invalidation message, they set their caches to the invalid state. Bus Bus memory data Art of Multiprocessor Programming

Invalidate memory Other caches lose read permission cache data cache From this point on, the red processor can update that data value without causing any bus traffic, because it knows that it has the only cached copy. This is much more efficient than a write-through cache because it produces much less bus traffic. Bus memory data Art of Multiprocessor Programming

Invalidate memory Other caches lose read permission data cache From this point on, the red processor can update that data value without causing any bus traffic, because it knows that it has the only cached copy. This is much more efficient than a write-through cache because it produces much less bus traffic. Bus This cache acquires write permission memory data Art of Multiprocessor Programming

Art of Multiprocessor Programming Invalidate Memory provides data only if not present in any cache, so no need to change it now (expensive) cache data cache Finally, there is no need to update memory until the processor wants to use that cache space for something else. Any other processor that asks for the data will get it from the red processor. Bus memory data Art of Multiprocessor Programming (2)

Another Processor Asks for Data cache data cache If another processor wants the data, it asks for it over the bus. Bus Bus memory data Art of Multiprocessor Programming (2)

Art of Multiprocessor Programming Owner Responds Here it is! cache data data cache And the owner responds by sending the data over. Bus Bus memory data Art of Multiprocessor Programming (2)

Art of Multiprocessor Programming End of the Day … data data data cache Bus memory data Reading OK, no writing Art of Multiprocessor Programming (1)

Art of Multiprocessor Programming Mutual Exclusion What do we want to optimize? Bus bandwidth used by spinning threads Release/Acquire latency Acquire latency for idle lock Note that optimizing a spin lock is not a simple question, because we have to figure out exactly what we want to optimize: whether it’s the bus bandwidth used by spinning threads, the latency of lock acquisition or release, or whether we mostly care about uncontended locks. Art of Multiprocessor Programming

Art of Multiprocessor Programming Simple TASLock TAS invalidates cache lines Spinners Miss in cache Go to bus Thread wants to release lock delayed behind spinners We now consider how the simple test-and-set algorithm performs using a bus-based write-back cache (the most common case in practice). Each test-and-set call goes over the bus, and since all of the waiting threads are continually using the bus, all threads, even those not waiting for the lock, must wait to use the bus for each memory access. Even worse, the test-and-set call invalidates all cached copies of the lock, so every spinning thread encounters a cache miss almost every time, and has to use the bus to fetch the new, but unchanged value. Adding insult to injury, when the thread holding the lock tries to release it, it may be delayed waiting to use the bus that is monopolized by the spinners. We now understand why the TAS lock performs so poorly. Art of Multiprocessor Programming

Test-and-test-and-set Wait until lock “looks” free Spin on local cache No bus use while lock busy Problem: when lock is released Invalidation storm … Now consider the behavior of the TAS lock algorithm while the lock is held by a thread A. The first time thread B reads the lock it takes a cache miss, forcing B to block while the value is loaded into B's cache. As long as A holds the lock, B repeatedly rereads the value, but each time B hits in its cache (finding the desired value). B thus produces no bus traffic, and does not slow down other threads' memory accesses. Moreover, a thread that releases a lock is not delayed by threads spinning on that lock. Art of Multiprocessor Programming

Local Spinning while Lock is Busy While the lock is held, all the contenders spin in their caches, rereading cached data without causing any bus traffic. Bus memory busy Art of Multiprocessor Programming

Art of Multiprocessor Programming On Release invalid invalid free Things deteriorate, however, when the lock is released. The lock holder releases the lock by writing false to the lock variable… Bus memory free Art of Multiprocessor Programming

On Release memory Everyone misses, rereads miss miss free invalid … which immediately invalidates the spinners' cached copies. Each one takes a cache miss, rereads the new value, Bus memory free Art of Multiprocessor Programming (1)

Art of Multiprocessor Programming On Release Everyone tries TAS TAS(…) TAS(…) invalid invalid free and they all (more-or-less simultaneously) call test-and-set to acquire the lock. The first to succeed invalidates the others, who must then reread the value, causing a storm of bus traffic. Bus memory free Art of Multiprocessor Programming (1)

Art of Multiprocessor Programming Problems Everyone misses Reads satisfied sequentially Everyone does TAS Invalidates others’ caches Eventually quiesces after lock acquired How long does this take? Eventually, the processors settle down once again to local spinning. This could explain why TATAS lock cannot get to the ideal lock. How long does this take? Art of Multiprocessor Programming

Measuring Quiescence Time X = time of ops that don’t use the bus Y = time of ops that cause intensive bus traffic P1 P2 Pn In critical section, run ops X then ops Y. As long as Quiescence time is less than X, no drop in performance. This is the classical experiment conducted by Anderson. We decrease X until the bus intensive operations in Y interleave with the quiescing of the lock release opeartion, at which point we will see a drop in throughput or an increase in latency. By gradually varying X, can determine the exact time to quiesce. Art of Multiprocessor Programming

Art of Multiprocessor Programming Quiescence Time Increses linearly with the number of processors for bus architecture time threads Art of Multiprocessor Programming

Mystery Explained time threads TAS lock TTAS lock Ideal time So now we understand why the TTAS lock performs much better than the TAS lock, but still much worse than an ideal lock. threads Better than TAS but still not as good as ideal Art of Multiprocessor Programming

Solution: Introduce Delay If the lock looks free But I fail to get it There must be lots of contention Better to back off than to collide again Another approach is to introduce a delay. If the lock looks free but you can’t get it, then instead of trying again right away, it makes sense to back off a little and let things calm down. There are several places we could introduce a delay --- after the lock rl time spin lock r2d r1d d Art of Multiprocessor Programming

Dynamic Example: Exponential Backoff time spin lock 4d 2d d If I fail to get lock wait random duration before retry Each subsequent failure doubles expected wait For how long should the thread back off before retrying? A good rule of thumb is that the larger the number of unsuccessful tries, the higher the likely contention, and the longer the thread should back off. Here is a simple approach. Whenever the thread sees the lock has become free but fails to acquire it, it backs off before retrying. To ensure that concurrent conflicting threads do not fall into ``lock-step'', all trying to acquire the lock at the same time, the thread backs off for a random duration. Each time the thread tries and fails to get the lock, it doubles the expected time it backs off, up to a fixed maximum. Art of Multiprocessor Programming

Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Here is the code for an exponential back-off lock. Art of Multiprocessor Programming

Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} The constant MIN_DELAY indicates the initial, shortest limit (it makes no sense for the thread to back off for too short a duration), Fix minimum delay Art of Multiprocessor Programming

Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} As in the TTAS algorithm, the thread spins testing the lock until the lock appears to be free. Wait until lock looks free Art of Multiprocessor Programming

Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Then the thread tries to acquire the lock. If we win, return Art of Multiprocessor Programming

Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration If it fails, then it computes a random delay between zero and the current limit. and then sleeps for that delay before retrying. Art of Multiprocessor Programming

Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason It doubles the limit for the next back-off, up to MAX_DELAY. It is important to note that the thread backs off only when it fails to acquire a lock that it had immediately before observed to be free. Observing that the lock is held by another thread says nothing about the level of contention. Art of Multiprocessor Programming

Spin-Waiting Overhead TTAS Lock time Backoff lock The graph shows that the backoff lock outperforms the TTAS lock, though it is far from the ideal curve which is flat. The slope of the backoff curve varies greatly from one machine to another, but is invariably better than that of a TTAS lock. threads Art of Multiprocessor Programming

Art of Multiprocessor Programming Backoff: Other Issues Good Easy to implement Beats TTAS lock Bad Must choose parameters carefully Not portable across platforms The Backoff lock is easy to implement, and typically performs significantly better than TTAS lock on many architectures. Unfortunately, its performance is sensitive to the choice of minimum and maximum delay constants. To deploy this lock on a particular architecture, it is easy to experiment with different values, and choose the ones that work best. Experience shows, however, that these optimal values are sensitive to the number of processors and their speed, so it is not easy to tune the back-off lock class to be portable across a range of different machines. Art of Multiprocessor Programming

Art of Multiprocessor Programming Idea Avoid useless invalidations By keeping a queue of threads Each thread Notifies next in line Without bothering the others We now explore a different approach to implementing spin locks, one that is a little more complicated than backoff locks, but inherently more portable. One can overcome these drawbacks by having threads form a line, or queue. In a queue, each thread can learn if its turn has arrived by checking whether its predecessor has been served. Invalidation traffic is reduced by having each thread spin on a different location. A queue also allows for better utilization of the critical section since there is no need to guess when to attempt to access it: each thread is notified directly by its predecessor in the queue. Finally, a queue provides first-come-first-served fairness, the same high level of fairness achieved by the Bakery algorithm. We now explore different ways to implement queue locks, a family of locking algorithms that exploit these insights. Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock idle next flags Here is the Anderson queue lock, a simple array-based queue lock. The threads share an atomic integer tail field, initially zero. To acquire the lock, each thread atomically increments the tail field. Call the resulting value the thread's slot. The slot is used as an index into a Boolean flag array. If flag[j] is true, then the thread with slotj has permission to acquire the lock. Initially, flag[0] is true. T F F F F F F F Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock acquiring next getAndIncrement flags To acquire the lock, each thread atomically increments the tail field. Call the resulting value the thread's slot. T F F F F F F F Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock acquiring next getAndIncrement flags T F F F F F F F Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock acquired next Mine! flags The slot is used as an index into a Boolean flag array. If flag[j] is true, then the thread with slot j has permission to acquire the lock. Initially, flag[0] is true. To acquire the lock, a thread spins until the flag at its slot becomes true. T F F F F F F F Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock acquired acquiring next flags Here another thread wants to acquire the lock. T F F F F F F F Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock acquired acquiring next getAndIncrement flags It applies get-and-increment to the next pointer. T F F F F F F F Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock acquired acquiring next getAndIncrement flags Advances the next pointer to acquire its own slot. T F F F F F F F Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock acquired acquiring next flags Then it spins until the flag variable at that slot becomes true. T F F F F F F F Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock released acquired next flags The first thread releases the lock by setting the next slot to true. T T F F F F F F Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock released acquired next Yow! flags The second thread notices the change, and enters its critical section. T T F F F F F F Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); int[] slot = new int[n]; Here is what the code looks like. Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); int[] slot = new int[n]; One flag per thread We have one flag per thread (this means we have to know how many threads there are). Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); int[] slot = new int[n]; The next field tells us which flag to use. Next flag to use Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> mySlot; Each thread has a thread-local variable that keeps track of its slot. Thread-local variable Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock public lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; Here is the code for the lock and unlock methods. Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock public lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; First, claim a slot by atomically incrementing the next field. Take next slot Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock public lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; Next wait until my predecessor has released the lock. Spin until told to go Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock public lock() { myslot = next.getAndIncrement(); while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; Reset my slot to false so that it can be used the next time around. Prepare slot for re-use Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock public lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; Tell next thread to go To release the lock, set the slot after mine to true, being careful to wrap around. Art of Multiprocessor Programming

Art of Multiprocessor Programming Performance TTAS Shorter handover than backoff Curve is practically flat Scalable performance FIFO fairness The Anderson queue lock improves on Back-off locks because it reduces invalidations to a minimum and schedules access to the critical section tightly, minimizing the interval between when a lock is freed by one thread and when is acquired by another. There is also a theoretical benefit: unlike the TTAS and backoff lock, this algorithm guarantees that there is no lockout, and in fact, provides first-come-first-served fairness, which we actually loose in TTAS and TTAS with backoff. queue Art of Multiprocessor Programming

Art of Multiprocessor Programming Anderson Queue Lock Good First truly scalable lock Simple, easy to implement Bad Space hog One bit per thread Unknown number of threads? Small number of actual contenders? The Anderson lock has two disadvantages. First, it is not space-efficient. It requires knowing a bound N on the maximum number of concurrent threads, and it allocates an array of that size per lock. Thus, L locks will require O(LN) space even if a thread accesses only one lock at a given time. Second, the lock is poorly suited for uncached architectures, since any thread may end up spinning on any array location, and in the absence of caches, spinning on a remote location may be very expensive. Art of Multiprocessor Programming

Art of Multiprocessor Programming CLH Lock FIFO order Small, constant-size overhead per thread The CLH queue lock (by Travis Craig, Erik Hagersten, and Anders Landin) is much more space-efficient, since it incurs a small constant-size overhead per thread. Art of Multiprocessor Programming

Art of Multiprocessor Programming Initially idle This algorithm records each thread's status in a QNode object, which has a Boolean locked field. If that field is true, then the corresponding thread either has acquired the lock or is waiting for the lock. If that field is false, then the thread has released the lock. The lock itself is represented as a virtual linked list of QNodes objects. We use the term ``virtual'' because the list is implicit: each thread points to its predecessor through a thread-local predvariable. The public tail variable points to the last node in the queue. tail false Art of Multiprocessor Programming

Art of Multiprocessor Programming Initially idle tail false Queue tail Art of Multiprocessor Programming

Art of Multiprocessor Programming Initially idle Lock is free tail false Art of Multiprocessor Programming

Art of Multiprocessor Programming Initially idle tail false Art of Multiprocessor Programming

Art of Multiprocessor Programming Purple Wants the Lock acquiring tail false Art of Multiprocessor Programming

Art of Multiprocessor Programming Purple Wants the Lock acquiring To acquire the lock, a thread sets the locked field of its QNode to true, meaning that the thread is not ready to release the lock. tail false true Art of Multiprocessor Programming

Art of Multiprocessor Programming Purple Wants the Lock acquiring Swap The thread applies swap to the tail to make its own node the tail of the queue, simultaneously acquiring a reference to its predecessor's QNode. tail false true Art of Multiprocessor Programming

Art of Multiprocessor Programming Purple Has the Lock acquired Because it sees that its predecessor’s QNode is false, this thread now has the lock. tail false true Art of Multiprocessor Programming

Art of Multiprocessor Programming Red Wants the Lock acquired acquiring Another thread that wants the lock does the same sequence of steps … tail false true true Art of Multiprocessor Programming

Art of Multiprocessor Programming Red Wants the Lock acquired acquiring Swap tail false true true Art of Multiprocessor Programming

Art of Multiprocessor Programming Red Wants the Lock acquired acquiring Notice that the list is ordered implicitely, there are no real pointers between the nodes… tail false true true Art of Multiprocessor Programming

Art of Multiprocessor Programming Red Wants the Lock acquired acquiring Notice that the list is ordered implicitely, there are no real pointers between the nodes… tail false true true Art of Multiprocessor Programming

Art of Multiprocessor Programming Red Wants the Lock acquired acquiring Implicitely Linked list Notice that the list is ordered implicitely, there are no real pointers between the nodes… tail false true true Art of Multiprocessor Programming

Art of Multiprocessor Programming Red Wants the Lock acquired acquiring The thread then spins on the predecessor's QNode until the predecessor releases the lock. tail false true true Art of Multiprocessor Programming

Art of Multiprocessor Programming Red Wants the Lock acquired acquiring Actually, it spins on cached copy true The thread then spins actually spins on a chaced copy of purples node which is very efficient in terms of interconnect traffic. tail false true true Art of Multiprocessor Programming

Art of Multiprocessor Programming Purple Releases release acquiring Bingo! false The thread then spins actually spins on a cached copy of purples node which is very efficient in terms of interconnect traffic. In fact, in some coherence protocols shared memory might not be updated at all, only the cached copy. This is very efficient. tail false false true Art of Multiprocessor Programming

Art of Multiprocessor Programming Purple Releases released acquired When a thread acquires a lock it can reuses its predecessor's QNode as its new node for future lock accesses. Note that it can do so since at this point the thread's predecessor's QNode will no longer be used by the predecessor, and the thread's old QNode is pointed-to either by the thread's successor or by the tail. tail true Art of Multiprocessor Programming

Art of Multiprocessor Programming Space Usage Let L = number of locks N = number of threads ALock O(LN) CLH lock O(L+N) The reuse of the QNode's means that for $L$ locks, if each thread accesses at most one lock at a time, we only need O(L+N) space as compared with O(LN) for the ALock. Art of Multiprocessor Programming

Art of Multiprocessor Programming CLH Queue Lock class Qnode { AtomicBoolean locked = new AtomicBoolean(true); } Here is what the node looks like as a Java object. Art of Multiprocessor Programming

Art of Multiprocessor Programming CLH Queue Lock class Qnode { AtomicBoolean locked = new AtomicBoolean(true); } If the locked field is true, the lock has not been released yet (it may also not have been acquired yet either). Not released yet Art of Multiprocessor Programming

Art of Multiprocessor Programming CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Here is the code for lock(). Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Here is the code for lock(). Tail of the queue Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Here is the code for lock(). Thread-local Qnode Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Swap in my node Here is the code for lock(). Art of Multiprocessor Programming (3)

CLH Queue Lock Spin until predecessor releases lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Spin until predecessor releases lock Here is the code for lock(). Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } Notify successor Art of Multiprocessor Programming (3)

CLH Queue Lock Recycle predecessor’s node Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } Recycle predecessor’s node Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } (notice that we actually don’t reuse myNode. Code in book shows how its done.) Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming CLH Lock Good Lock release affects predecessor only Small, constant-sized space Bad Doesn’t work for uncached NUMA architectures Like the ALock, this algorithm has each thread spin on a distinct location, so when one thread releases its lock, it invalidates its successor's cache only, and does not invalidate any other thread's cache. It does so with a lower space overhead, and, importantly, without requiring prior knowledge of the number of threads accessing a lock It also provides first-come-first-served fairness. Art of Multiprocessor Programming

Art of Multiprocessor Programming NUMA Architecturs Acronym: Non-Uniform Memory Architecture Illusion: Flat shared memory Truth: No caches (sometimes) Some memory regions faster than others The principal disadvantage of this lock algorithm is that it performs poorly on cacheless NUMA architectures. Each thread spins waiting for its predecessor's node's to become false. If this memory location is remote, then performance will suffer. On cache-coherent architectures, however, this approach should work well. Art of Multiprocessor Programming

Spinning on local memory is fast NUMA Machines Spinning on local memory is fast Art of Multiprocessor Programming

Spinning on remote memory is slow NUMA Machines Spinning on remote memory is slow Art of Multiprocessor Programming

Art of Multiprocessor Programming CLH Lock Each thread spin’s on predecessor’s memory Could be far away … Art of Multiprocessor Programming

Art of Multiprocessor Programming MCS Lock FIFO order Spin on local memory only Small, Constant-size overhead The MCS lock is another kind of queue lock that ensures that processes always spin on a fixed location, so this one works well for cacheless architectures. Like the CLH lock, it uses only a small fixed-size overhead per thread. Art of Multiprocessor Programming

Art of Multiprocessor Programming Initially idle Here, too, the lock is represented as a linked list of QNodes, where each QNode represents either a lock holder or a thread waiting to acquire the lock. Unlike the CLH lock, the list is explicit, not virtual. tail false false Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquiring acquiring (allocate Qnode) true To acquire the lock, a thread places its own QNode} at the tail of the list. (Notice that the Setting of the value to true is a simplification of the presentation. In the code, we allocate the node with its value false so that we can get out of the with loop immediately if there is no predecessor. But if there is a predecessor, a thread sets the value of its flag to true before it redirects the other thread's pointer to it, so in effect its true whenever its used...thus, in order not to confuse the viewer with the minor issue, we diverge slightly from the code and describe the node as being true here). queue false false Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquiring acquired true swap The node then swaps in a reference to its own QNode. (Notice that the Setting of the value to true is a simplification of the presentation. In the code, we allocate the node with its value false so that we can get out of the with loop immediately if there is no predecessor. But if there is a predecessor, a thread sets the value of its flag to true before it redirects the other thread's pointer to it, so in effect its true whenever its used...thus, in order not to confuse the viewer with the minor issue, we diverge slightly from the code and describe the node as being true here). tail false false Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquiring acquired true At this point the swap is completed, and the queue variable points to the tail of the queue. tail false false Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquired acquired true To acquire the lock, a thread places its own QNode} at the tail of the list. If it has a predecessor, it modifies the predecessor's node to refer back to its own QNode. tail false false Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquiring acquiring acquired false tail true swap Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquiring acquiring acquired false tail true Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquiring acquiring acquired false tail true Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquiring acquiring acquired false tail true Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquiring acquiring acquired true tail false true Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquiring acquiring acquired Yes! true tail false true Art of Multiprocessor Programming

Art of Multiprocessor Programming MCS Queue Lock class Qnode { boolean locked = false; qnode next = null; } Art of Multiprocessor Programming

Art of Multiprocessor Programming MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Make a QNode Art of Multiprocessor Programming (3)

MCS Queue Lock add my Node to the tail of queue class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} add my Node to the tail of queue Art of Multiprocessor Programming (3)

MCS Queue Lock Fix if queue was non-empty class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Fix if queue was non-empty Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Wait until unlocked Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming MCS Queue Unlock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Missing successor? Art of Multiprocessor Programming (3)

MCS Queue Lock If really no successor, return class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} If really no successor, return Art of Multiprocessor Programming (3)

MCS Queue Lock Otherwise wait for successor to catch up class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Otherwise wait for successor to catch up Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming MCS Queue Lock class MCSLock implements Lock { AtomicReference queue; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Pass lock to successor Art of Multiprocessor Programming (3)

Art of Multiprocessor Programming Purple Release releasing swap false false Art of Multiprocessor Programming (2)

Purple Release releasing swap By looking at the queue, I see another thread is active releasing swap false false Art of Multiprocessor Programming (2)

Purple Release releasing swap By looking at the queue, I see another thread is active releasing swap false false I have to wait for that thread to finish Art of Multiprocessor Programming (2)

Art of Multiprocessor Programming Purple Release releasing prepare to spin true false Art of Multiprocessor Programming

Art of Multiprocessor Programming Purple Release releasing spinning true false Art of Multiprocessor Programming

Art of Multiprocessor Programming Purple Release releasing spinning true false false Art of Multiprocessor Programming

Art of Multiprocessor Programming Purple Release releasing Acquired lock true false false Art of Multiprocessor Programming

Art of Multiprocessor Programming Abortable Locks What if you want to give up waiting for a lock? For example Timeout Database transaction aborted by user The spin-lock constructions we have studied up till now provide first-come-first-served access with little contention, making them useful in many applications. In real-time systems, however, threads may require the ability to \emph{abort}, canceling a pending request to acquire a lock in order to meet some other real-time deadline. Art of Multiprocessor Programming

Art of Multiprocessor Programming Back-off Lock Aborting is trivial Just return from lock() call Extra benefit: No cleaning up Wait-free Immediate return Aborting a backoff lock request is trivial: if the request takes too long, simply return from the lock() method. The lock's state is unaffected by any thread's abort. A nice property of this lock is that the abort itself is immediate: it is wait-free, requiring only a constant number of steps. Art of Multiprocessor Programming

Art of Multiprocessor Programming Queue Locks Can’t just quit Thread in line behind will starve Need a graceful way out Art of Multiprocessor Programming

Art of Multiprocessor Programming Queue Locks spinning spinning spinning true true true Art of Multiprocessor Programming

Art of Multiprocessor Programming Queue Locks locked spinning spinning false true true Art of Multiprocessor Programming

Art of Multiprocessor Programming Queue Locks locked spinning false true Art of Multiprocessor Programming

Art of Multiprocessor Programming Queue Locks locked false Art of Multiprocessor Programming

Art of Multiprocessor Programming Queue Locks spinning spinning spinning true true true Art of Multiprocessor Programming

Art of Multiprocessor Programming Queue Locks spinning spinning true true true Art of Multiprocessor Programming

Art of Multiprocessor Programming Queue Locks locked spinning false true true Art of Multiprocessor Programming

Art of Multiprocessor Programming Queue Locks spinning false true Art of Multiprocessor Programming

Art of Multiprocessor Programming Queue Locks pwned false true Art of Multiprocessor Programming

Art of Multiprocessor Programming Abortable CLH Lock When a thread gives up Removing node in a wait-free way is hard Idea: let successor deal with it. Removing a node in a wait-free manner from the middle of the list is difficult if we want that thread to reuse its own node. Instead, the aborting thread marks the node so that the successor in the list will reuse the abandoned node, and wait for that node's predecessor. Art of Multiprocessor Programming

Initially idle Pointer to predecessor (or null) tail A Art of Multiprocessor Programming

Initially idle Distinguished available node means lock is free tail A Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquiring acquiring tail A Art of Multiprocessor Programming

Acquiring Null predecessor means lock not released or aborted Each thread allocates a new QNode for each lock access. A Null predecessor value means that the QNode is not aborted and has not yet completed executing the critical section. A Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquiring acquiring Swap Each thread allocates a new QNode for each lock access. A Null predecessor value means that the QNode is not aborted and has not yet completed executing the critical section. A Art of Multiprocessor Programming

Art of Multiprocessor Programming Acquiring acquiring Each thread allocates a new QNode for each lock access. A Null predecessor value means that the QNode is not aborted and has not yet completed executing the critical section. A Art of Multiprocessor Programming

Pointer to AVAILABLE means lock is free. Acquired Pointer to AVAILABLE means lock is free. locked Each thread allocates a new QNode for each lock access. A Null predecessor value means that the QNode is not aborted and has not yet completed executing the critical section. A Art of Multiprocessor Programming

Null means lock is not free & request not aborted Normal Case spinning locked Null means lock is not free & request not aborted Art of Multiprocessor Programming

Art of Multiprocessor Programming One Thread Aborts locked Timed out spinning Art of Multiprocessor Programming

Non-Null means predecessor aborted Successor Notices locked Timed out spinning Non-Null means predecessor aborted Art of Multiprocessor Programming

Recycle Predecessor’s Node locked spinning Art of Multiprocessor Programming

Art of Multiprocessor Programming Spin on Earlier Node locked spinning Art of Multiprocessor Programming

Art of Multiprocessor Programming Spin on Earlier Node released spinning A The lock is now mine Art of Multiprocessor Programming

Art of Multiprocessor Programming Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode; Art of Multiprocessor Programming

Time-out Lock Distinguished node to signify free lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode; Distinguished node to signify free lock Art of Multiprocessor Programming

Art of Multiprocessor Programming Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode; Tail of the queue Art of Multiprocessor Programming

Art of Multiprocessor Programming Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode; Remember my node … Art of Multiprocessor Programming

Art of Multiprocessor Programming Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred== null || myPred.prev == AVAILABLE) { return true; } … Art of Multiprocessor Programming

Time-out Lock Create & initialize node public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred == null || myPred.prev == AVAILABLE) { return true; } Create & initialize node Art of Multiprocessor Programming

Art of Multiprocessor Programming Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred == null || myPred.prev == AVAILABLE) { return true; } Swap with tail Art of Multiprocessor Programming

Time-out Lock If predecessor absent or released, we are done public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred == null || myPred.prev == AVAILABLE) { return true; } ... If predecessor absent or released, we are done Art of Multiprocessor Programming

Art of Multiprocessor Programming Time-out Lock spinning locked … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } Art of Multiprocessor Programming

Time-out Lock Keep trying for a while … … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } Keep trying for a while … Art of Multiprocessor Programming

Time-out Lock Spin on predecessor’s prev field … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } Spin on predecessor’s prev field Art of Multiprocessor Programming

Time-out Lock Predecessor released lock … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } Predecessor released lock Art of Multiprocessor Programming

Time-out Lock Predecessor aborted, advance one … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } Predecessor aborted, advance one Art of Multiprocessor Programming

Time-out Lock What do I do when I time out? … if (!tail.compareAndSet(qnode, myPred)) qnode.prev = myPred; return false; } What do I do when I time out? Art of Multiprocessor Programming

Art of Multiprocessor Programming Time-out Lock … if (!tail.compareAndSet(qnode, myPred)) qnode.prev = myPred; return false; } In case a node is timing out it first checks if tail points to its node, if Do I have a successor? If CAS fails: I do have a successor, tell it about myPred Art of Multiprocessor Programming

Time-out Lock If CAS succeeds: no successor, simply return false … if (!tail.compareAndSet(qnode, myPred)) qnode.prev = myPred; return false; } Redirecting to predecessor tells the successors that node belongs to aborted lock() request and that they must spin on its predecessor. If CAS succeeds: no successor, simply return false Art of Multiprocessor Programming

Art of Multiprocessor Programming Time-Out Unlock public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; } Art of Multiprocessor Programming

Time-out Unlock public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; } If the CAS() fails, the condition is true, there is a successor and so I must notify it what to wait on since I am timing out. If CAS failed: exists successor, notify successor it can enter Art of Multiprocessor Programming

Art of Multiprocessor Programming Timing-out Lock public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; } If the tail points to me, then no one is waiting and putting a null completes the unlock() CAS successful: set tail to null, no clean up since no successor waiting Art of Multiprocessor Programming

One Lock To Rule Them All? TTAS+Backoff, CLH, MCS, ToLock… Each better than others in some way There is no one solution Lock we pick really depends on: the application the hardware which properties are important In this lecture we saw a variety of spin locks that vary in characteristics and performance. Such a variety is useful, because no single algorithm is ideal for all applications. For some applications, complex algorithms work best, and for others, simple algorithms are preferable. The best choice usually depends on specific aspects of the application and the target architecture. Art of Multiprocessor Programming

Art of Multiprocessor Programming This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License. You are free: to Share — to copy, distribute and transmit the work to Remix — to adapt the work Under the following conditions: Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Art of Multiprocessor Programming