Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.

Slides:



Advertisements
Similar presentations
Synchronization without Contention
Advertisements

CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support.
Operating Systems Part III: Process Management (Process Synchronization)
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Spin Locks and Contention Management The Art of Multiprocessor Programming Spring 2007.
Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Introduction Companion slides for
Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
ESE Einführung in Software Engineering X. CHAPTER Prof. O. Nierstrasz Wintersemester 2005 / 2006.
CS510 Concurrent Systems Class 1b Spin Lock Performance.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IV: OS Support.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.
CS533 - Concepts of Operating Systems 1 Class Discussion.
Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike.
Multicore Programming
CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Spin Locks and Contention
Mutual Exclusion Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.
Spin Locks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Mutual Exclusion Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Operating Systems CMPSC 473 Mutual Exclusion Lecture 11: October 5, 2010 Instructor: Bhuvan Urgaonkar.
CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.
Monitors and Blocking Synchronization Dalia Cohn Alperovich Based on “The Art of Multiprocessor Programming” by Herlihy & Shavit, chapter 8.
Concurrent Queues Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.
Programming Language Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.
Concurrent Stacks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
CS 2200 Presentation 18b MUTEX. Questions? Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640,
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
The Relative Power of Synchronization Operations Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Multiprocessors – Locks
Spin Locks and Contention
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Concurrency 2 CS 2110 – Spring 2016.
Atomic Operations in Hardware
Atomic Operations in Hardware
Concurrent Objects Companion slides for
Spin Locks and Contention
CS510 Concurrent Systems Jonathan Walpole.
Designing Parallel Algorithms (Synchronization)
Spin Locks and Contention Management
CS533 Concepts of Operating Systems
CS510 Concurrent Systems Jonathan Walpole.
CSE 153 Design of Operating Systems Winter 19
Presentation transcript:

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich

Mutual Exclusion Most programs aren’t embarrassingly parallel “critical sections” of the code must be executed by one thread at a time to ensure correctness use locks for mutual exclusion Art of Multiprocessor Programming2

Example: concurrent counter Art of Multiprocessor Programming3 Thread 2Thread 1 R1 W2

Art of Multiprocessor Programming4 Locks CS Resets lock upon exit lock critical section... …lock introduces sequential bottleneck

Art of Multiprocessor Programming5 What Should you do if you can’t get a lock? Keep trying –“spin” or “busy-wait” –Good if delays are short Give up the processor –Good if delays are long –Always good on uniprocessor (1)

Outline Spinlock review TAS-lock optimizations Queue locks Abortable locks Art of Multiprocessor Programming6

7 Review: Test-and-Set Atomic operation Test-and-set (addr,new_val) –Set the current value of the word addr to new_val –Return the old value TAS aka “getAndSet”

Art of Multiprocessor Programming8 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } (5)

Art of Multiprocessor Programming9 Test-and-Set Locks Locking –Lock is free: value is false –Lock is taken: value is true Acquire lock by calling TAS –If result is false, you win –If result is true, you lose Release lock by writing false

Art of Multiprocessor Programming10 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}

Art of Multiprocessor Programming11 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Lock state is AtomicBoolean

Art of Multiprocessor Programming12 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Keep trying until lock acquired

Art of Multiprocessor Programming13 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Release lock by resetting state to false

Art of Multiprocessor Programming14 Space Complexity TAS spin-lock has small “footprint” N thread spin-lock uses O(1) space

Art of Multiprocessor Programming15 Performance Experiment –n threads –Increment shared counter 1 million times How long should it take? How long does it take?

Art of Multiprocessor Programming16 Mystery #1 time threads TAS lock Ideal (1) What is going on?

Art of Multiprocessor Programming17 Bus-Based Architectures Bus cache memory cache

Art of Multiprocessor Programming18 Bus Processor Issues Load Request cache memory cache data

Art of Multiprocessor Programming19 Bus Processor Issues Load Request Bus cache memory cache data Gimme data

Art of Multiprocessor Programming20 cache Bus Memory Responds Bus memory cache data Got your data right here data

Art of Multiprocessor Programming21 Bus Processor Issues Load Request memory cache data Gimme data

Art of Multiprocessor Programming22 Bus Processor Issues Load Request Bus memory cache data Gimme data

Art of Multiprocessor Programming23 Bus Processor Issues Load Request Bus memory cache data I got data

Art of Multiprocessor Programming24 Bus Other Processor Responds memory cache data I got data data Bus

Art of Multiprocessor Programming25 Bus Other Processor Responds memory cache data Bus

Art of Multiprocessor Programming26 Cache Coherence We have lots of copies of data –Original copy in memory –Cached copies at processors Some processor modifies its own copy –What do we do with the others? –How to avoid confusion?

Art of Multiprocessor Programming27 Modify Cached Data Bus data memory cachedata (1)

Art of Multiprocessor Programming28 Modify Cached Data Bus data memory cachedata (1)

Art of Multiprocessor Programming29 memory Bus data Modify Cached Data cachedata

Art of Multiprocessor Programming30 memory Bus data Modify Cached Data cache What’s up with the other copies? data

Art of Multiprocessor Programming31 cache Bus Modified cache data memory cachedata Other caches invalidate data This cache acquires write permission

Art of Multiprocessor Programming32 cache Bus Modified cache data memory cachedata Memory can be updated later

Art of Multiprocessor Programming33 What’s wrong with TASLock? TAS invalidates cache lines Spinners –Miss in cache –Go to bus Thread wants to release lock –delayed behind spinners

Art of Multiprocessor Programming34 Test-and-Test-and-Set Locks Lurking stage –Wait until lock “looks” free –Spin while read returns true (lock taken) Pouncing state –As soon as lock “looks” available –Read returns false (lock free) –Call TAS to acquire lock –If TAS loses, back to lurking

Art of Multiprocessor Programming35 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; }

Art of Multiprocessor Programming36 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Wait until lock looks free

Art of Multiprocessor Programming37 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Then try to acquire it

Art of Multiprocessor Programming38 Graph TAS lock TTAS lock Ideal time threads

Art of Multiprocessor Programming39 Test-and-test-and-set Wait until lock “looks” free –Spin on local cache –No bus use while lock busy Problem: when lock is released –Invalidation storm …

Art of Multiprocessor Programming40 Local Spinning while Lock is Busy Bus memory busy

Art of Multiprocessor Programming41 Bus On Release memory freeinvalid free

Art of Multiprocessor Programming42 On Release Bus memory freeinvalid free miss Everyone misses, rereads (1)

Art of Multiprocessor Programming43 On Release Bus memory freeinvalid free TAS(…) Everyone tries TAS (1)

Art of Multiprocessor Programming44 An important observation spin lock time d r1dr1d r2dr2d If the lock looks free But I fail to get it There must be contention Better to back off than to collide again

Art of Multiprocessor Programming45 Solution: delay time d 2d4d spin lock If I fail to get lock –wait random duration before retry –Each subsequent failure doubles expected wait

Art of Multiprocessor Programming46 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}}

Art of Multiprocessor Programming47 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay

Art of Multiprocessor Programming48 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free

Art of Multiprocessor Programming49 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return

Art of Multiprocessor Programming50 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration

Art of Multiprocessor Programming51 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason

Art of Multiprocessor Programming52 Spin-Waiting Overhead TTAS Lock Backoff lock time threads

Art of Multiprocessor Programming53 Backoff: Other Issues Good –Easy to implement –Beats TTAS lock Bad –Must choose parameters carefully –Not portable across platforms

Summary: basic TAS-Lock Perform well for low contention, but basic spinlocks aren’t scalable All thread spin on the same shared memory location, causing a lot of bus traffic No fairness, so a thread might starve Art of Multiprocessor Programming54

Queue locks Keep FIFO Order Scalable locks Harder to implement Hurt performance for low contention Art of Multiprocessor Programming55

Art of Multiprocessor Programming56 Anderson Queue Lock flags next TFFFFFFF idle

Art of Multiprocessor Programming57 Anderson Queue Lock flags next TFFFFFFF acquiring getAndIncrement

Art of Multiprocessor Programming58 Anderson Queue Lock flags next TFFFFFFF acquiring getAndIncrement

Art of Multiprocessor Programming59 Anderson Queue Lock flags next TFFFFFFF acquired Mine!

Art of Multiprocessor Programming60 Anderson Queue Lock flags next TFFFFFFF acquired acquiring

Art of Multiprocessor Programming61 Anderson Queue Lock flags next TFFFFFFF acquired acquiring getAndIncrement

Art of Multiprocessor Programming62 Anderson Queue Lock flags next TFFFFFFF acquired acquiring getAndIncrement

Art of Multiprocessor Programming63 acquired Anderson Queue Lock flags next TFFFFFFF acquiring

Art of Multiprocessor Programming64 released Anderson Queue Lock flags next FTFFFFFF acquired

Problem: false sharing Each thread spins on different variable, so there is no reason for contention. But adjacent Array elements are contained within the same cacheline… Art of Multiprocessor Programming65

66 released The Solution: Padding flags next T///F/// acquired Line 1 Line 2 Art of Multiprocessor Programming Spin on my line

Art of Multiprocessor Programming67 Performance Shorter handover than backoff Curve is practically flat Scalable performance queue TTAS

Art of Multiprocessor Programming68 Anderson Queue Lock Good - Easy to implement Queue lock Bad –Not Space efficient What if unknown number of threads? What if small number of actual contenders?

Art of Multiprocessor Programming69 CLH Lock FIFO order Small, constant-size overhead per thread

Art of Multiprocessor Programming70 CLH Queue Lock class Qnode { AtomicBoolean locked = new AtomicBoolean(true); }

Art of Multiprocessor Programming71 CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }}

Art of Multiprocessor Programming72 CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Queue tail

Art of Multiprocessor Programming73 CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Thread-local Qnode

Art of Multiprocessor Programming74 CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Swap in my node

Art of Multiprocessor Programming75 CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Spin until predecessor releases lock

Art of Multiprocessor Programming76 Initially false tail idle

Art of Multiprocessor Programming77 Initially false tail idle

Art of Multiprocessor Programming78 Purple Wants the Lock false tail acquiring

Art of Multiprocessor Programming79 Purple Wants the Lock false tail acquiring true

Art of Multiprocessor Programming80 Purple Wants the Lock false tail acquiring true Swap

Art of Multiprocessor Programming81 Purple Has the Lock false tail acquired true

Art of Multiprocessor Programming82 Red Wants the Lock false tail acquired acquiring true

Art of Multiprocessor Programming83 Red Wants the Lock false tail acquired acquiring true Swap true

Art of Multiprocessor Programming84 Red Wants the Lock false tail acquired acquiring true

Art of Multiprocessor Programming85 Red Wants the Lock false tail acquired acquiring true

Art of Multiprocessor Programming86 Red Wants the Lock false tail acquired acquiring true Implicit Linked list

Art of Multiprocessor Programming87 CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; }

Art of Multiprocessor Programming88 CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } Notify successor

Art of Multiprocessor Programming89 CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } Recycle predecessor’s node

Art of Multiprocessor Programming90 Purple Releases false tail release acquiring false true false Bingo!

Art of Multiprocessor Programming91 Purple Releases tail released acquired true

Art of Multiprocessor Programming92 Space Usage Let –L = number of locks –N = number of threads ALock –O(LN) CLH lock –O(L+N)

Art of Multiprocessor Programming93 CLH Lock Good –Lock release affects predecessor only –Small, constant-sized space Bad –Doesn’t work for uncached NUMA architectures

Art of Multiprocessor Programming94 NUMA Architecturs Acronym: –Non-Uniform Memory Architecture Illusion: –Flat shared memory Truth: –No caches (sometimes) –Some memory regions faster than others

Art of Multiprocessor Programming95 MCS Lock FIFO order, list based Queue lock Similar to CLH Spin on local memory only, solving the NUMA problem

MCS lock Each node contains now a “next” field. Each node spins locally on its own “Locked” field upon release, notify next node you finished Art of Multiprocessor Programming96

Art of Multiprocessor Programming97 Abortable Locks What if you want to give up waiting for a lock? For example –Timeout –Database transaction aborted by user

Art of Multiprocessor Programming98 Back-off Lock Aborting is trivial –Just return from lock() call Extra benefit: –No cleaning up –Immediate return

Art of Multiprocessor Programming99 Queue Locks Can’t just quit –Thread in line behind will starve Need a graceful way out

Art of Multiprocessor Programming100 Abortable CLH Lock When a thread gives up –Removing node in a wait-free way is hard Idea: –let successor deal with it.

Art of Multiprocessor Programming101 Queue Locks locked true spinning true spinning

Art of Multiprocessor Programming102 Queue Locks locked true abor true spinning Time-out

Art of Multiprocessor Programming103 Queue Locks locked true abor true spinning Predecessor aborted

Art of Multiprocessor Programming104 Queue Locks locked true spinning

Art of Multiprocessor Programming105 One Lock To Rule Them All? TTAS+Backoff, CLH, MCS, ToLock… Each better than others in some way There is no one solution Lock we pick really depends on: – the application – the hardware – which properties are important

Art of Multiprocessor Programming106 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike 2.5 License You are free: –to Share — to copy, distribute and transmit the work –to Remix — to adapt the work Under the following conditions: –Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). –Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to – Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.