Presentation is loading. Please wait.

Presentation is loading. Please wait.

John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.

Similar presentations


Presentation on theme: "John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott."— Presentation transcript:

1 John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

2 Dance Hall Machines?

3 Various insns known as fetch_and_ф insns: test_and_set, fetch_and_store, fetch_and_add, compare_and_swap Some can be used to simulate others but often with overhead Some lock types require a particular primitive to be implemented or to be implemented efficiently Atomic Instructions

4 type lock = (unlocked, locked) procedure acquire_lock (lock *L) while test_and_set (L) == locked ; procedure release_lock (lock *L) *L = unlocked Test_and_set: Basic

5 $ P $ P $ P Memory

6 type lock = (unlocked, locked) procedure acquire_lock (lock *L) while 1 if *L == unlocked if test_and_set (L) == unlocked return procedure release_lock (lock *L) *L = unlocked Test_and_set: test_and_test_and_set

7 $ P $ P $ P Memory

8 type lock = (unlocked, locked) procedure acquire_lock (lock *L) delay = 1 while test_and_set (L) == locked pause (delay) delay = delay * 2 procedure release_lock (lock *L) *L = unlocked Test_and_set: test_and_set with backoff

9 $ P $ P $ P Memory

10 type lock = record next_ticket = 0 now_serving = 0 procedure acquire_lock (lock *L) my_ticket = fetch_and_increment(L->next_ticket) while 1 if L->now_serving == my_ticket return procedure release_lock (lock *L) L->now_serving = L->now_serving + 1 Ticket Lock

11 Memory next_ticket now_serving $ P my_ticket $ P $ P

12 type lock = record slots = array [0…numprocs – 1] of (has_lock, must_wait) next_slot = 0 procedure acquire_lock (lock *L) my_place = fetch_and_increment (L->next_slot) // Various modulo work to handle overflow while L->slots[my_place] == must_wait ; L->slots[my_place] = must_wait procedure release_lock (lock *L) L->slots[my_place + 1] = has_lock Array-Based Queuing Locks

13 Memory next_slot slots $ P my_place $ P $ P

14 type qnode = record qnode *next bool locked type lock = qnode* procedure acquire_lock (lock *L, qnode *I) I->next = Null qnode *predecessor = fetch_and_store (L, I) if predecessor != Null I->locked = true predecessor->next = I while I->locked ; MCS Locks procedure release_lock (lock *L, qnode *I) if I->next == Null if compare_and_swap (L, I, Null) return while I->next == Null ; I->next->locked = false

15 MCS Locks L1-R 2-B 3-B 2-R 3-R3-E 4-B 5-B 4-R procedure release_lock (lock *L, qnode *I) if I->next == Null if compare_and_swap (L, I, Null) return while I->next == Null ; I->next->locked = false

16 MCS Locks Memory lock qnodes $ P lock next locked next locked next locked $ P lock next locked next locked next locked $ P lock next locked next locked next locked

17 Results: Scalability – Distributed Memory Architecture

18 Results: Scalability – Cache Coherent Architecture

19 Butterfly’s atomic insns are very expensive Butterfly can’t handle 24-bit pointers Results: Single Processor Lock/Release Time Times are in μsTest_and_setTicketAnderson (Queue)MCS Butterfly (Distributed) 34.938.765.771.3 Symmetry (Cache coherent) 7.0NA10.69.2

20 Results: Network Congestion Busy-wait LockIncrease in Network Latency Measured From Lock NodeIdle Node test_and_set1420%96% test_and_set w/ linear backoff882%67% test_and_set w/ exp. backoff32%4% ticket992%97% ticket w/ prop backoff53%8% Anderson75%67% MCS4%2%

21 Atomic insns >> normal insns && 1 processor latency is very important  don’t use MCS If processes might be preempted  test_and_set with exponential backoff Which lock should I use? fetch_and_store supported? fetch_and_increment supported? Yes No test_and_set w/ exp backoff Ticket MCS YesNo

22 Centralized Barrier P0 P1 P2 P3 01 2 3 4 

23 Software Combining Tree Barrier P0 P1 P2 P3 012 10  2 10 2   P0 P1 P2 P3

24  Tournament Barrier P0 P1 P2 P3      P0P1P2P3 W C L W L L

25 Dissemination Barrier P0 P1 P2 P3         P0P1P2P3

26 New Tree-Based Barrier P0 P1 P2 P3 0 1 2  0 0 0    3

27 Summary BarrierSpaceWakeupLocal SpinningNetwork Txns CentralizedO(1)broadcastnoO(p) or O(∞) Software Combining TreeO(p)treenoO(p × fan-in) or O(∞) TournamentO(plogp)treeyesO(p) DisseminationO(plogp)noneyesO(plogp) New Tree-BasedO(p)treeyes2p - 2

28 Results – Distributed Shared Memory BarrierSpaceWakeupLocal SpinningNetwork Txns CentralizedO(1)broadcastnoO(p) or O(∞) Software Combining TreeO(p)treenoO(p × fan-in) or O(∞) TournamentO(plogp)treeyesO(p) DisseminationO(plogp)noneyesO(plogp) New Tree-BasedO(p)treeyes2p - 2

29 Results – Broadcast-Based Cache-Coherent BarrierSpaceWakeupLocal SpinningNetwork Txns CentralizedO(1)broadcastnoO(p) or O(∞) Software Combining TreeO(p)treenoO(p × fan-in) or O(∞) TournamentO(plogp)treeyesO(p) DisseminationO(plogp)noneyesO(plogp) New Tree-BasedO(p)treeyes2p - 2

30 Results – Local vs. Remote Spinning BarrierNetwork Latency (local)Network Latency (remote) New Tree-Based10% increase124% increase Dissemination18% increase117% increase

31 Barrier Decision Tree Multiprocessor? Dissemination Barrier Centralized Barrier New Tree-Based Barrier (tree wakeup) New Tree-Based Barrier (central wakeup) Distributed Shared Memory Broadcast-Based Cache-Coherent

32 No dance hall No need for complicated hardware synch Need a full set of fetch_and_ф Architectural Recommendations


Download ppt "John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott."

Similar presentations


Ads by Google