Download presentation
Presentation is loading. Please wait.
Published byBrenna Braund Modified over 9 years ago
1
John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott
2
Dance Hall Machines?
3
Various insns known as fetch_and_ф insns: test_and_set, fetch_and_store, fetch_and_add, compare_and_swap Some can be used to simulate others but often with overhead Some lock types require a particular primitive to be implemented or to be implemented efficiently Atomic Instructions
4
type lock = (unlocked, locked) procedure acquire_lock (lock *L) while test_and_set (L) == locked ; procedure release_lock (lock *L) *L = unlocked Test_and_set: Basic
5
$ P $ P $ P Memory
6
type lock = (unlocked, locked) procedure acquire_lock (lock *L) while 1 if *L == unlocked if test_and_set (L) == unlocked return procedure release_lock (lock *L) *L = unlocked Test_and_set: test_and_test_and_set
7
$ P $ P $ P Memory
8
type lock = (unlocked, locked) procedure acquire_lock (lock *L) delay = 1 while test_and_set (L) == locked pause (delay) delay = delay * 2 procedure release_lock (lock *L) *L = unlocked Test_and_set: test_and_set with backoff
9
$ P $ P $ P Memory
10
type lock = record next_ticket = 0 now_serving = 0 procedure acquire_lock (lock *L) my_ticket = fetch_and_increment(L->next_ticket) while 1 if L->now_serving == my_ticket return procedure release_lock (lock *L) L->now_serving = L->now_serving + 1 Ticket Lock
11
Memory next_ticket now_serving $ P my_ticket $ P $ P
12
type lock = record slots = array [0…numprocs – 1] of (has_lock, must_wait) next_slot = 0 procedure acquire_lock (lock *L) my_place = fetch_and_increment (L->next_slot) // Various modulo work to handle overflow while L->slots[my_place] == must_wait ; L->slots[my_place] = must_wait procedure release_lock (lock *L) L->slots[my_place + 1] = has_lock Array-Based Queuing Locks
13
Memory next_slot slots $ P my_place $ P $ P
14
type qnode = record qnode *next bool locked type lock = qnode* procedure acquire_lock (lock *L, qnode *I) I->next = Null qnode *predecessor = fetch_and_store (L, I) if predecessor != Null I->locked = true predecessor->next = I while I->locked ; MCS Locks procedure release_lock (lock *L, qnode *I) if I->next == Null if compare_and_swap (L, I, Null) return while I->next == Null ; I->next->locked = false
15
MCS Locks L1-R 2-B 3-B 2-R 3-R3-E 4-B 5-B 4-R procedure release_lock (lock *L, qnode *I) if I->next == Null if compare_and_swap (L, I, Null) return while I->next == Null ; I->next->locked = false
16
MCS Locks Memory lock qnodes $ P lock next locked next locked next locked $ P lock next locked next locked next locked $ P lock next locked next locked next locked
17
Results: Scalability – Distributed Memory Architecture
18
Results: Scalability – Cache Coherent Architecture
19
Butterfly’s atomic insns are very expensive Butterfly can’t handle 24-bit pointers Results: Single Processor Lock/Release Time Times are in μsTest_and_setTicketAnderson (Queue)MCS Butterfly (Distributed) 34.938.765.771.3 Symmetry (Cache coherent) 7.0NA10.69.2
20
Results: Network Congestion Busy-wait LockIncrease in Network Latency Measured From Lock NodeIdle Node test_and_set1420%96% test_and_set w/ linear backoff882%67% test_and_set w/ exp. backoff32%4% ticket992%97% ticket w/ prop backoff53%8% Anderson75%67% MCS4%2%
21
Atomic insns >> normal insns && 1 processor latency is very important don’t use MCS If processes might be preempted test_and_set with exponential backoff Which lock should I use? fetch_and_store supported? fetch_and_increment supported? Yes No test_and_set w/ exp backoff Ticket MCS YesNo
22
Centralized Barrier P0 P1 P2 P3 01 2 3 4
23
Software Combining Tree Barrier P0 P1 P2 P3 012 10 2 10 2 P0 P1 P2 P3
24
Tournament Barrier P0 P1 P2 P3 P0P1P2P3 W C L W L L
25
Dissemination Barrier P0 P1 P2 P3 P0P1P2P3
26
New Tree-Based Barrier P0 P1 P2 P3 0 1 2 0 0 0 3
27
Summary BarrierSpaceWakeupLocal SpinningNetwork Txns CentralizedO(1)broadcastnoO(p) or O(∞) Software Combining TreeO(p)treenoO(p × fan-in) or O(∞) TournamentO(plogp)treeyesO(p) DisseminationO(plogp)noneyesO(plogp) New Tree-BasedO(p)treeyes2p - 2
28
Results – Distributed Shared Memory BarrierSpaceWakeupLocal SpinningNetwork Txns CentralizedO(1)broadcastnoO(p) or O(∞) Software Combining TreeO(p)treenoO(p × fan-in) or O(∞) TournamentO(plogp)treeyesO(p) DisseminationO(plogp)noneyesO(plogp) New Tree-BasedO(p)treeyes2p - 2
29
Results – Broadcast-Based Cache-Coherent BarrierSpaceWakeupLocal SpinningNetwork Txns CentralizedO(1)broadcastnoO(p) or O(∞) Software Combining TreeO(p)treenoO(p × fan-in) or O(∞) TournamentO(plogp)treeyesO(p) DisseminationO(plogp)noneyesO(plogp) New Tree-BasedO(p)treeyes2p - 2
30
Results – Local vs. Remote Spinning BarrierNetwork Latency (local)Network Latency (remote) New Tree-Based10% increase124% increase Dissemination18% increase117% increase
31
Barrier Decision Tree Multiprocessor? Dissemination Barrier Centralized Barrier New Tree-Based Barrier (tree wakeup) New Tree-Based Barrier (central wakeup) Distributed Shared Memory Broadcast-Based Cache-Coherent
32
No dance hall No need for complicated hardware synch Need a full set of fetch_and_ф Architectural Recommendations
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.