Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun Ham 2012. 2. 16

Problem  Busy-waiting synchronization incurs high memory/network contention - Creation of hot spot = degradation of performance - Causes cache-line invalidation (for every write on lock)  Possible Approach : Add special-purpose hardware for synchronization - Add synchronization variable to the switching nodes on interconnection - Implement lock queuing mechanisms on cache controller  Suggestion in this paper : Use scalable synchronization algorithm (MCS) instead of special-purpose hardware

 Test and Set - Require : Test and Set (Atomic operation) - Problem : 1. Large Contention – Cache / Memory 2. Lack of Fairness - Random Order Review of Synchronization Algorithms LOCK while (test&set(x) == 1); UNLOCK x = 0;

 Test and Set with Backoff - Almost similar to Test and Set but has delay - Time : 1. Linear : Time = Time + Some Time 2. Exponential : Time = Time * Some constant - Performance : Reduced contention but still not fair Review of Synchronization Algorithms LOCK while (test&set(x) == 1) { delay(time); } UNLOCK x = 0;

 Ticket Lock - Requires : fetch and increment (Atomic Operation) - Advantage : Fair (FIFO) - Disadvantage : Contention (Memory/Network) Review of Synchronization Algorithms LOCK myticket = fetch & increment (&(L->next_ticket)); while(myticket!=L->now_serving) { delay(time * (myticket-L->now_serving)); } UNLOCK L->now_serving = L->now_serving+1;

 Anderson Lock (Array based queue lock) - Requires : fetch and increment (Atomic Operation) - Advantage : Fair (FIFO), No cache contention - Disadvantage : Requires coherent cache / Space Review of Synchronization Algorithms LOCK myplace= fetch & increment (&(L->next_location)); while(L->location[myplace] == must_wait) ; L->location[myplace]=must_wait; } UNLOCK L->location[myplace+1]=has_lock;

 MCS Lock – Based on Linked List  Acquire 1. Fetch & Store Last processor node (Get predecessor & set tail) 2. Set arriving processor node to locked 3. Set last processor node’s next node to arriving processor node 4. Spin till Locked=false MCS Lock 1234 tail 1234 Locked : False (Run) Locked : False (Run) Locked :True (Spin) Locked :True (Spin) Locked :True (Spin) Locked :True (Spin) Locked :True (Spin)

 MCS Lock – Based on Linked List  Release Check if next processor node is set (check if we completed acquisition) - If set, make next processor node unlocked MCS Lock 1234 tail Locked : False (Run) Locked :True (Spin) Locked :True (Spin) Locked :True (Spin) 1234 tail Locked : False (Finished) Locked : False (Run) Locked :True (Spin) Locked :True (Spin)

 MCS Lock – Based on Linked List  Release Check if next processor node is set (check if we completed acquisition) - If not set, check if tail points itself (compare & swap with null) - If not, wait till next processor node is set - Then, unlock next processor node MCS Lock 12 tail Locked : False (Run) 12 tail Locked : False (Run) 12 tail Locked : False (Finished) Locked : False (Run) Locked : True (Run)

 MCS Lock – Based on Linked List  MCS Lock – Concurrent Read Version MCS Lock – Concurrent Read Version

 Start_Read : - If predecessor is nill or active reader, reader_count++ (atomic) ; proceed; - Else, spin till (another Start_Read or End_Write) unblocks this => Then, this unblocks its successor reader (if any)  End_Read : - If successor is writer, set next_writer=successor - reader_count-- (atomic) - If last reader(reader_count==0), check next_writer and unblocks it  Start_Write : - If predecessor is nill and there’s no active reader(reader_count=0), proceed - Else, spin till (last End_Read ) unblocks this  End_Write : - If successor is reader, reader_count++ (atomic) and unblocks it MCS Lock – Concurrent Read Version

 Centralized counter barrier Keeps checking(test & set) centralized counter  Advantage : Simplicity  Disadvantage : Hot spot, Contention Review of Barriers

 Combining Tree Barrier  Advantage : Simplicity, Less contention, Parallelized fetch&increment  Disadvantage : Still spins on non-local location Review of Barriers

 Bidirectional Tournament Barrier  Winner is statically determined  Advantage : No need for fetch and op / Local Spin Review of Barriers

 Dissemination Barrier  Can be understood as a variation of tournament (Statically determined)  Suitable for MPI system Review of Barriers

 MCS Barrier (Arrival)  Similar to Combined Tree Barrier  Local Spin / O(P) Space / 2(P-2) communication / O(log p) critical path MCS Barriers

 MCS Barrier (Wakeup)  Similar to Combined Tree Barrier  Local Spin / O(P) Space / 2(P-2) communication / O(log p) critical path MCS Barriers 0 1 2 345

 Butterfly Machine result  Three scaled badly; Four scaled well. MCS was best  Backoff was effective Spin Lock Evaluation

 Butterfly Machine result  Measured consecutive lock acquisitions on separate processors instead of acquire/release pair from start to finish Spin Lock Evaluation

 Symmetry Machine Result  MCS and Anderson scales well  Ticket lock cannot be implemented in Symmetry due to lack of fetch and increment operation  Symmetry Result seems to be more reliable Spin Lock Evaluation

 Network Latency  MCS has greatly reduced increases in network latency  Local Spin reduces contention Spin Lock Evaluation

 Butterfly Machine  Dissemination was best  Bidirectional and MCS Tree was okay  Remote memory access degrades performance a lot Barrier Evaluation

 Symmetry Machine  Counter method was best  Dissemination was worst  Bus-based architecture: Cheap broadcast  MCS arrival tree outperforms counter for more than 16 processors Barrier Evaluation

Local Memory Evaluation

 Having a local memory is extremely important  It both affects performance and network contention  Dancehall system is not really scalable

Summary  This paper proposed a scalable spin-lock synchronization algorithm without network contention  This paper proposed a scalable barrier algorithm  This paper proved that network contention due to busy-wait synchronization is not really a problem  This paper proved an idea that hardware for QOSB lock would not be cost-effective when compared with MCS lock  This paper suggests the use of distributed memory or coherent caches rather than dance-hall memory without coherent caches

Discussion  What would be the primary disadvantage of MCS lock?  In what case MCS lock would have worse performance than other locks?  How do you think about special-purpose hardware based locks?  Is space usage of lock important?  Can we benefit from dancehall style memory architecture? (disaggregated memory ?)  Is there a way to implement energy-efficient spin-lock?

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Similar presentations

Presentation on theme: "Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Similar presentations

Presentation on theme: "Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun."— Presentation transcript:

Similar presentations

About project

Feedback