Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Slides:

Advertisements

Similar presentations

EECE : Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.

Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization without Contention

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Scalable Reader-Writer Synchronization for Shared- Memory Multiprocessors Mellor-Crummey and Scott Presented by Robert T. Bauer.

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.

Scalable Synchronous Queues By William N. Scherer III, Doug Lea, and Michael L. Scott Presented by Ran Isenberg.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Multiple Processor Systems

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

Scalable Reader Writer Synchronization John M.Mellor-Crummey, Michael L.Scott.

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

CS533 - Concepts of Operating Systems 1 Class Discussion.

ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Symmetric Multiprocessors and Performance of SpinLock Techniques Based on Anderson’s paper “Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”

Synchronization (Barriers) Parallel Processing (CS453)

Distributed Shared Memory Systems and Programming

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 9, 2003 Distributed Shared Memory.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Kernel Locking Techniques by Robert Love presented by Scott Price.

DISTRIBUTED COMPUTING

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

1 Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization.

Local-Spin Mutual Exclusion Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.

Scalable lock-free Stack Algorithm Wael Yehia York University February 8, 2010.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Synchronization Feb 2017 Topics Locks Barriers

Lecture 19: Coherence and Synchronization

Lecture 5: Synchronization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

Reactive Synchronization Algorithms for Multiprocessors

Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Lecture 21: Synchronization and Consistency

Lecture: Coherence and Synchronization

CS533 Concepts of Operating Systems

Lecture 4: Synchronization

CS510 Concurrent Systems Jonathan Walpole.

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

CSE 153 Design of Operating Systems Winter 19

Lecture: Coherence, Synchronization

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun Ham

Problem  Busy-waiting synchronization incurs high memory/network contention - Creation of hot spot = degradation of performance - Causes cache-line invalidation (for every write on lock)  Possible Approach : Add special-purpose hardware for synchronization - Add synchronization variable to the switching nodes on interconnection - Implement lock queuing mechanisms on cache controller  Suggestion in this paper : Use scalable synchronization algorithm (MCS) instead of special-purpose hardware

 Test and Set - Require : Test and Set (Atomic operation) - Problem : 1. Large Contention – Cache / Memory 2. Lack of Fairness - Random Order Review of Synchronization Algorithms LOCK while (test&set(x) == 1); UNLOCK x = 0;

 Test and Set with Backoff - Almost similar to Test and Set but has delay - Time : 1. Linear : Time = Time + Some Time 2. Exponential : Time = Time * Some constant - Performance : Reduced contention but still not fair Review of Synchronization Algorithms LOCK while (test&set(x) == 1) { delay(time); } UNLOCK x = 0;

 Ticket Lock - Requires : fetch and increment (Atomic Operation) - Advantage : Fair (FIFO) - Disadvantage : Contention (Memory/Network) Review of Synchronization Algorithms LOCK myticket = fetch & increment (&(L->next_ticket)); while(myticket!=L->now_serving) { delay(time * (myticket-L->now_serving)); } UNLOCK L->now_serving = L->now_serving+1;

 Anderson Lock (Array based queue lock) - Requires : fetch and increment (Atomic Operation) - Advantage : Fair (FIFO), No cache contention - Disadvantage : Requires coherent cache / Space Review of Synchronization Algorithms LOCK myplace= fetch & increment (&(L->next_location)); while(L->location[myplace] == must_wait) ; L->location[myplace]=must_wait; } UNLOCK L->location[myplace+1]=has_lock;

 MCS Lock – Based on Linked List  Acquire 1. Fetch & Store Last processor node (Get predecessor & set tail) 2. Set arriving processor node to locked 3. Set last processor node’s next node to arriving processor node 4. Spin till Locked=false MCS Lock 1234 tail 1234 Locked : False (Run) Locked : False (Run) Locked :True (Spin) Locked :True (Spin) Locked :True (Spin) Locked :True (Spin) Locked :True (Spin)

 MCS Lock – Based on Linked List  Release Check if next processor node is set (check if we completed acquisition) - If set, make next processor node unlocked MCS Lock 1234 tail Locked : False (Run) Locked :True (Spin) Locked :True (Spin) Locked :True (Spin) 1234 tail Locked : False (Finished) Locked : False (Run) Locked :True (Spin) Locked :True (Spin)

 MCS Lock – Based on Linked List  Release Check if next processor node is set (check if we completed acquisition) - If not set, check if tail points itself (compare & swap with null) - If not, wait till next processor node is set - Then, unlock next processor node MCS Lock 12 tail Locked : False (Run) 12 tail Locked : False (Run) 12 tail Locked : False (Finished) Locked : False (Run) Locked : True (Run)

 MCS Lock – Based on Linked List  MCS Lock – Concurrent Read Version MCS Lock – Concurrent Read Version

 Start_Read : - If predecessor is nill or active reader, reader_count++ (atomic) ; proceed; - Else, spin till (another Start_Read or End_Write) unblocks this => Then, this unblocks its successor reader (if any)  End_Read : - If successor is writer, set next_writer=successor - reader_count-- (atomic) - If last reader(reader_count==0), check next_writer and unblocks it  Start_Write : - If predecessor is nill and there’s no active reader(reader_count=0), proceed - Else, spin till (last End_Read ) unblocks this  End_Write : - If successor is reader, reader_count++ (atomic) and unblocks it MCS Lock – Concurrent Read Version

 Centralized counter barrier Keeps checking(test & set) centralized counter  Advantage : Simplicity  Disadvantage : Hot spot, Contention Review of Barriers

 Combining Tree Barrier  Advantage : Simplicity, Less contention, Parallelized fetch&increment  Disadvantage : Still spins on non-local location Review of Barriers

 Bidirectional Tournament Barrier  Winner is statically determined  Advantage : No need for fetch and op / Local Spin Review of Barriers

 Dissemination Barrier  Can be understood as a variation of tournament (Statically determined)  Suitable for MPI system Review of Barriers

 MCS Barrier (Arrival)  Similar to Combined Tree Barrier  Local Spin / O(P) Space / 2(P-2) communication / O(log p) critical path MCS Barriers

 MCS Barrier (Wakeup)  Similar to Combined Tree Barrier  Local Spin / O(P) Space / 2(P-2) communication / O(log p) critical path MCS Barriers

 Butterfly Machine result  Three scaled badly; Four scaled well. MCS was best  Backoff was effective Spin Lock Evaluation

 Butterfly Machine result  Measured consecutive lock acquisitions on separate processors instead of acquire/release pair from start to finish Spin Lock Evaluation

 Symmetry Machine Result  MCS and Anderson scales well  Ticket lock cannot be implemented in Symmetry due to lack of fetch and increment operation  Symmetry Result seems to be more reliable Spin Lock Evaluation

 Network Latency  MCS has greatly reduced increases in network latency  Local Spin reduces contention Spin Lock Evaluation

 Butterfly Machine  Dissemination was best  Bidirectional and MCS Tree was okay  Remote memory access degrades performance a lot Barrier Evaluation

 Symmetry Machine  Counter method was best  Dissemination was worst  Bus-based architecture: Cheap broadcast  MCS arrival tree outperforms counter for more than 16 processors Barrier Evaluation

Local Memory Evaluation

 Having a local memory is extremely important  It both affects performance and network contention  Dancehall system is not really scalable

Summary  This paper proposed a scalable spin-lock synchronization algorithm without network contention  This paper proposed a scalable barrier algorithm  This paper proved that network contention due to busy-wait synchronization is not really a problem  This paper proved an idea that hardware for QOSB lock would not be cost-effective when compared with MCS lock  This paper suggests the use of distributed memory or coherent caches rather than dance-hall memory without coherent caches

Discussion  What would be the primary disadvantage of MCS lock?  In what case MCS lock would have worse performance than other locks?  How do you think about special-purpose hardware based locks?  Is space usage of lock important?  Can we benefit from dancehall style memory architecture? (disaggregated memory ?)  Is there a way to implement energy-efficient spin-lock?