SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

Slides:

Advertisements

Similar presentations

CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support.

Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Global Environment Model. MUTUAL EXCLUSION PROBLEM The operations used by processes to access to common resources (critical sections) must be mutually.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Chapter 6: Process Synchronization

Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.

Spin Locks and Contention Management The Art of Multiprocessor Programming Spring 2007.

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Review: Multiprocessor Systems (MIMD)

Synchronization. Shared Memory Thread Synchronization Threads cooperate in multithreaded environments – User threads and kernel threads – Share resources.

Operating Systems ECE344 Ding Yuan Synchronization (I) -- Critical region and lock Lecture 5: Synchronization (I) -- Critical region and lock.

CS444/CS544 Operating Systems Synchronization 2/21/2006 Prof. Searleman

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

CS444/CS544 Operating Systems Synchronization 2/16/2007 Prof. Searleman

CS510 Concurrent Systems Class 1b Spin Lock Performance.

CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IV: OS Support.

CPS110: Implementing threads/locks on a uni-processor Landon Cox.

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Synchronization CSCI 444/544 Operating Systems Fall 2008.

More on Locks: Case Studies

Implementing Synchronization. Synchronization 101 Synchronization constrains the set of possible interleavings: Threads “agree” to stay out of each other’s.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Spin Locks and Contention

Chapter 28 Locks Chien-Chung Shen CIS, UD

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Spin Locks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

COSC 3407: Operating Systems Lecture 7: Implementing Mutual Exclusion.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Mutual Exclusion Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CS 2200 Presentation 18b MUTEX. Questions? Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Implementing Lock. From the Previous Lecture  The “too much milk” example shows that writing concurrent programs directly with load and store instructions.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640,

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Implementing Mutual Exclusion Andy Wang Operating Systems COP 4610 / CGS 5765.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Multiprocessors – Locks

Spin Locks and Contention

CS703 – Advanced Operating Systems

Atomic Operations in Hardware

Atomic Operations in Hardware

Chapter 5: Process Synchronization

CSCI 511 Operating Systems Chapter 5 (Part C) Monitor

Spin Locks and Contention

CS510 Concurrent Systems Jonathan Walpole.

CMPT 886: Computer Architecture Primer

Designing Parallel Algorithms (Synchronization)

Implementing Mutual Exclusion

CS533 Concepts of Operating Systems

Implementing Mutual Exclusion

CS510 Concurrent Systems Jonathan Walpole.

CSE 153 Design of Operating Systems Winter 19

CS333 Intro to Operating Systems

Chapter 6: Synchronization Tools

Lecture 18: Coherence and Synchronization

Presentation transcript:

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group Synchronization if(account_balance >= amount) { account_balance -= amount; } Thread 1: perform a withdrawal if(account_balance >= service_fee) { account_balance -= service_fee; } Thread 2: subtract service fee Unsynchronized Access Account balanced has changed between steps 2 and 4!!! Synchronized Access lock_aquire(account_balance_lock); if(account_balance >= amount) { account_balance -= amount; } lock_release(account_balance); lock_aquire(account_balance_lock); if(account_balance >= service_fee) { account_balance -= service_fee; } lock_release(account_balance);

SYNAR Systems Networking and Architecture Group Synchronization Primitives (SP) Synchronization primitives provide atomic access to a critical section Types of synchronization primitives – mutex – semaphore – lock – condition variable – etc. Synchronization primitives are provided by the OS Can also be implemented by a library (e.g., pthreads) or by the application Hardware provides special atomic instructions for implementation of synchronization primitives (test-and-set, compare-and-swap, etc.)

SYNAR Systems Networking and Architecture Group Implementation of SP Performance of applications that use SP is determined by an implementation of the SP A SP must be scalable – must continue to perform well as the number of contending threads increases We will look at several implementations of locks to understand how to create a scalable implementation

SYNAR Systems Networking and Architecture Group What should you do if you can’t get a lock? Keep trying – “spin” or “busy-wait” – Good if delays are short Give up the processor – Good if delays are long – Always good on uniprocessor Systems usually use a combination: – Spin for a while, then give up the processor We will focus on multiprocessors, so we’ll look at spinlock implementations © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group A Shared Memory Multiprocessor Bus cache memory cache © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Basic Spinlock CS Resets lock upon exit spin lock critical section... …lock suffers from contention Sequential Bottleneck  no parallelism © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Review: Test-and-Set We have a boolean value in memory Test-and-set (TAS) – Swap true with prior value – Return value tells if prior value was true or false Can reset just by writing false © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group TAS Provided by the hardware Example SPARC: an assembly instruction load-store unsigned byte ldstub public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Swap old and new values loads a byte from memory to a return register writes the value 0xFF into the addressed byte atomically. © Herlihy-Shavit 2007 TAS can be implemented in a high-level language. Example in Java:

SYNAR Systems Networking and Architecture Group TAS Locks Value of TAS’ed memory shows lock state: – Lock is free: value is false – Lock is taken: value is true Acquire lock by calling TAS: – If result is false, you win – If result is true, you lose Release lock by writing false

SYNAR Systems Networking and Architecture Group TAS Lock in SPARC Assembly spin_lock: busy_loop: ldstub [%o0],%o1 tst %o1 bne busy_loop nop ! delay slot for branch retl nop ! delay slot for branch loads old value into reg. o1. Writes “1” into memory at address in %o0. Test if %o1 equals to zero. If %o1 is not zero (old value is true), spin.

SYNAR Systems Networking and Architecture Group TAS Lock in Java class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); } Initialize lock state to false (unlocked) While lock is taken (true) spin. Release the lock – set state to false © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Performance of TAS Lock Experiment – N threads on a multiprocessor – Increment shared counter 1 million times (total) – The thread acquires a lock before incrementing the counter – Each thread does 1,000,000/N increments N does not exceed the number of processors  no thread switching overhead How long should it take? How long does it take?

SYNAR Systems Networking and Architecture Group Expected performance ideal Total time Number of threads no speedup because there is no parallelism © Herlihy-Shavit 2007 lock_acquire increment lock_release lock_acquire increment lock_release lock_acquire increment lock_release lock_acquire increment lock_release Thread 1 Thread 2 same as sequential execution

SYNAR Systems Networking and Architecture Group Actual Performance TAS lock Ideal Much worse than ideal Total time Number of threads © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Reasons for Bad TAS Lock Performance Has to do with cache behaviour on the multiprocessor system TAS causes a lot of invalidation misses – This hurts performance To understand what this means, let’s review how caches work

SYNAR Systems Networking and Architecture Group Processor Issues Load Request Bus cache memory cache data © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Another Processor Issues Load Request Bus cache memory cache data Bus I got data data Bus I want data © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group memory Bus Processor Modifies Data cache data Now other copies are invalid data © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Send Invalidation Message to Others memory Bus cache data Invalidate ! Bus Other caches lose read permission No need to change now: other caches can provide valid data © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Processor Asks for Data memory Bus cache data Bus I want data data © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Multiprocessor Caches: Summary Simultaneous reads and writes of shared data: – Make data invalid Invalidation is bad for performance On next data request: – Data must be fetched from another cache This slows down performance

SYNAR Systems Networking and Architecture Group What This Has to Do with TAS Locks Recall that TAS lock had bad performance Invalidations were the cause TAS lock Ideal Total time Number of threads Here is why: All spinners do load/store in a loop They all read/write the same location Cause lots of invalidations

SYNAR Systems Networking and Architecture Group A Solution: Test-And-Test-And-Set Lock Wait until lock “looks” free – Spin on local cache – No bus use while lock busy class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Wait until the lock looks free. We read the lock instead of TASing it. We avoid repeated invalidations. Now try to acquire it. © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group TTAS Lock Performance TAS lock TTAS lock Ideal Better, but still far from ideal Total time Number of threads © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group The Problem with TTAS Lock When the lock is released: – Everyone tries to acquire is – Everyone does TAS – There is a storm of invalidations Only one processor can use the bus at a time So all processors queue up, waiting for the bus, so they can perform the TAS

SYNAR Systems Networking and Architecture Group A Solution: TTAS Lock with Backoff Intuition: If I fail to get the lock there must be contention So I should back off before trying again Introduce a random “sleep” delay before trying to acquire the lock again © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group TTAS Lock with Backoff: Performance TAS lock TTAS lock Backoff lock Ideal Total time Number of threads © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Backoff Locks Better performance than TAS and TTAS Caveats: – Performance is sensitive to the choice of delay parameter – The delay parameter depends on the number of processors and their speed – Easy to tune for one platform – Difficult to write an implementation that will work well across multiple platforms © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group An Idea Avoid useless invalidations – By keeping a queue of threads Each thread – Notifies next in line – Without bothering the others © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Anderson Queue Lock flags next TFFFFFFF idle locations on which thread spin, one per thread Points to the next unused “spin” location acquiring getAndIncrement: atomically get value of “next”, and increment “next” pointer If “next” was TRUE, lock is acquired © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Acquiring a Held Lock flags next TFFFFFFF acquired acquiring getAndIncrement T released acquired © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Anderson Lock: Performance TAS lock TTAS lock Ideal Anderson lock Total time Number of threads Almost ideal. We avoid all unnecessary invalidations. Portable – no tunable parameters. © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Scalable Synchronization: Summary Making synchronization primitives scalable is tricky Performance tied to the hardware architecture We looked at these spinlocks: – TAS – poor performance due to invalidations – TTAS – avoids constant invalidations, but a storm of invalidations on lock release – TTAS with backoff – eliminates the storm of invalidations on release – Anderson Queue Lock – completely eliminates all useless invalidations One could think of other optimizations… For more information, look at the references in the syllabus