CS510 Concurrent Systems Class 1b Spin Lock Performance.

Slides:

Advertisements

Similar presentations

CS 603 Process Synchronization: The Colored Ticket Algorithm February 13, 2002.

Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization without Contention

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Mutual Exclusion By Shiran Mizrahi. Critical Section class Counter { private int value = 1; //counter starts at one public Counter(int c) { //constructor.

Chapter 6: Process Synchronization

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Multiple Processor Systems

Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.

CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.

CS533 - Concepts of Operating Systems 1 CS533 Concepts of Operating Systems Class 8 Synchronization on Multiprocessors.

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

CS533 - Concepts of Operating Systems 1 Class Discussion.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

1 Race Conditions/Mutual Exclusion Segment of code of a process where a shared resource is accessed (changing global variables, writing files etc) is called.

Symmetric Multiprocessors and Performance of SpinLock Techniques Based on Anderson’s paper “Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”

Introduction to Embedded Systems

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Kernel Locking Techniques by Robert Love presented by Scott Price.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Implementing Lock. From the Previous Lecture  The “too much milk” example shows that writing concurrent programs directly with load and store instructions.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

1 Critical Section Problem CIS 450 Winter 2003 Professor Jinhua Guo.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Implementing Mutual Exclusion Andy Wang Operating Systems COP 4610 / CGS 5765.

Multiprocessors – Locks

Processes and threads.

Lecture 19: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

Reactive Synchronization Algorithms for Multiprocessors

Multiprocessor Introduction and Characteristics of Multiprocessor

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Lecture 21: Synchronization and Consistency

Lecture: Coherence and Synchronization

Implementing Mutual Exclusion

CS533 Concepts of Operating Systems

Implementing Mutual Exclusion

CS510 Concurrent Systems Jonathan Walpole.

The University of Adelaide, School of Computer Science

CSE 153 Design of Operating Systems Winter 19

CS333 Intro to Operating Systems

Lecture: Coherence, Synchronization

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Presentation transcript:

CS510 Concurrent Systems Class 1b Spin Lock Performance

CS510 - Concurrent Systems 2 Introduction  Shared memory multiprocessors o Various different architectures o All have hardware support for mutual exclusion Various flavors of atomic read-modify instruction Can be used directly or to build higher level abstractions  This paper focuses on spin locks o Used to protect short critical sections o Arguably the simplest of the higher level abstractions  The challenge o How to implement scalable, low-latency spin locks on multiprocessors

CS510 - Concurrent Systems 3 Multiprocessor Architecture Overview  Two dimensions: o Interconnect type (bus or multistage network) o Cache coherence strategy  Six architectures considered: o Bus: no cache coherence o Bus: snoopy write through invalidation cache coherence o Bus: snoopy write-back invalidation cache coherence o Bus: snoopy distributed write cache coherence o Multistage network: no cache coherence o Multistage network: invalidation based cache coherence

CS510 - Concurrent Systems 4 Mutual Exclusion and Atomic Instructions  Example: Test-and-set instruction  A lock is a single word variable with two values o 0 = FALSE = not locked o 1 = TRUE = locked  Test-and-set does the following atomically: Load the (old) value of lock Store TRUE in lock If the loaded value was FALSE... Then you got the lock (so continue) If the loaded value was TRUE... Then someone else has the lock (so try again)

CS510 - Concurrent Systems 5 Using Test-and-Set in a Spin Lock  Spin on Test-and-Set while(TestAndSet(lock) = BUSY); Lock := CLEAR;  Tradeoff: frequent polling gets you the lock faster, but slows everyone else down! o Why?  If you fix this problem using a more complex algorithm latency may become an issue

CS510 - Concurrent Systems 6 Spin on Read Approach  Spin on read (Test-and-Test-and-Set) while(lock=BUSY or TestAndSet(lock)=BUSY); lock := CLEAR;  Intended for architectures with per-CPU caches  Why should it perform much better?  Why doesn’t it perform much better?

CS510 - Concurrent Systems 7 Why Quiescence is Slow for Spin on Read  When the lock is released its value is modified, hence all cached copies of it are invalidated  Subsequent reads on all processors miss in cache, hence generating bus contention  Many see the lock free at the same time because there is a delay in satisfying the cache miss of the one that will eventually succeed in getting the lock next  Many attempt to set it using TSL  Each attempt generates contention and invalidates all copies  All but one attempt fails, causing the CPU to revert to reading  The first read misses in the cache!  By the time all this is over, the critical section has completed and the lock has been freed again!

Spin on TSL vs Spin on Read CS510 - Concurrent Systems 8

Quiescence Time for Spin on Read CS510 - Concurrent Systems 9

10 Strategies for Improving Performance  Author presents 5 alternative approaches o 4 are based on CSMA-CD network strategies o Approaches differ by: Where to wait Whether wait time is determined statically or dynamically  Where to wait o Delay only on attempted set spin on read, notice release then delay before setting o Delay after every memory access Better for architectures where spin on read generates contention!

CS510 - Concurrent Systems 11 Delay Only on Attempted Set while(lock=BUSY or TestAndSet(lock)=BUSY) begin while (lock=BUSY); /* spin on read without delay */ delay(); /* delay before TestAndSet */ end;  Cuts contention and invalidations by adding latency between retries  Performance is good if: o Delay is short and there are few other spinners o Delay is long but there are many spinners

CS510 - Concurrent Systems 12 Delay in Spin on Read (every access) while(lock=BUSY or TestAndSet(lock)=BUSY) delay();  Basically, just check the lock less frequently  Good for architectures in which spin on read generates contention o Ie. those without caches

CS510 - Concurrent Systems 13 How Long to Delay?  Statically determined o There is no single “right” answer Sometimes there are many contending threads and sometimes there are few/none o If all processors are given the same delay and they conflict once they will conflict repeatedly! Except that one succeeds in the event of a conflict (unlike CSMA-CD networks!)  Dynamically determined o Based on what? o How can we estimate number of contending threads?

Static Delay on Lock Release  When a processor notices the lock has been released, it waits a fixed amount of time before trying a Test-And-Set  Each processor is assigned a different static delay (slot)  Few empty slots means good latency  Few crowded slots means little contention  Good performance with: o Fewer slots, fewer spinning processors o Many slots, more spinning processors CS510 - Concurrent Systems 14

Overhead vs. Number of Slots CS510 - Concurrent Systems 15

Variable Delay  Like Ethernet backoff  If processor “collides” with another processor, it backs off for a greater random interval each time o Indirectly, processors base backoff interval on the number of spinning processors while(lock=BUSY or TestAndSet(lock)=BUSY) delay(); delay += randomBackoff(); CS510 - Concurrent Systems 16

Problems with Backoff  Both dynamic and static backoff are bad when the critical section is long: they just keep backing off while the lock is being held o Failing in test-and-set is not necessarily a sign of many spinning threads!  Maximum time to delay should be bounded  Initial delay on arrival should be a fraction of the last delay CS510 - Concurrent Systems 17

CS510 - Concurrent Systems 18 A Different Approach - Queueing  Delay-based approaches separate contending accesses in time.  Queueing separates contending accesses in space  Naïve approach o Insert each waiting process into a queue o Each process spins on the flag of the process ahead of it All are spinning on different locations! No cache or bus contention o But the queue insertion and deletion operations require locks Not good for small critical sections – such as queue ops!

CS510 - Concurrent Systems 19 Queueing  A more efficient approach o Each arriving process uses an atomic read and increment instruction to get a unique sequence number o On completion of the critical section a process releases the process with the next highest sequence number How? Use a sequenced array of flags Each process is spinning reading its own flag (in a separate cache line) – based on its sequence number On release a process sets the flag of the process behind it in the logical queue (next sequence number) o... But you need an atomic read and increment instruction!

Queueing Initflags[0] := HAS_LOCK; flags[1..P-1] := MUST_WAIT; queueLast := 0; LockmyPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P]=MUST_WAIT); Unlockflags[myPlace mod P] := MUST_WAIT; flags[(myPlace+1) mod P] := HAS_LOCK; CS510 - Concurrent Systems 20

CS510 - Concurrent Systems 21 Queueing Performance  Works especially well for multistage networks – each flag can be on a separate module, so a single memory location isn’t saturated with requests  Works less well if there’s a bus without caches, because we still have the problem that each process has to poll for a single value in one place (memory)  Lock latency is increased due to overhead, so it has poor performance relative to other approaches when there’s no contention

CS510 - Concurrent Systems 22 Costs on different hardware  Distributed write coherence o All processors can share the same global “next” counter  Invalidation-based coherence o All processors should spin in a different cache line  Non-coherent multistage network o Processes should poll locations in different memory modules  Non-coherent bus o Polling can swamp bus o Needs a delay, based on how close to the front a process is

Benchmark Spin-lock Alternatives CS510 - Concurrent Systems 23

Spin-waiting Overhead for a Burst CS510 - Concurrent Systems 24

Network Hardware Solutions  Combining Networks o Combine requests to same lock (forward one, return other) o Combining benefit increases with increase in contention  Hardware Queuing o Blocking enter and exit instructions queue processes at memory module o Eliminate polling across the network  Goodman’s Queue Links o Stores the name of the next processor in the queue directly in each processor’s cache o Inform next processor asynchronously (via inter-processor interrupt?) CS510 - Concurrent Systems 25

Bus Hardware Solutions  Use additional bus with write broadcast coherence for TSL (push the new value)  Invalidate cache copies only when Test-and-Set succeeds  Read broadcast o Whenever some other processor reads a value which I know is invalid, I get a copy of that value too (piggyback) o Eliminates the cascade of read-misses  Special handling of Test-and-Set o Cache and bus controllers don’t mess with the bus if the lock is busy CS510 - Concurrent Systems 26

Conclusions  Spin-locking performance doesn’t scale easily  A variant of Ethernet back-off has good results when there is little lock contention  Queuing (parallelizing lock handoff) has good results when there is a lot of contention  A little supportive hardware goes a long way towards a healthy multiprocessor relationship CS510 - Concurrent Systems 27