Presentation is loading. Please wait.

Presentation is loading. Please wait.

NUCA Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Efficient Synchronization for Non-Uniform.

Similar presentations


Presentation on theme: "NUCA Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Efficient Synchronization for Non-Uniform."— Presentation transcript:

1 NUCA Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Efficient Synchronization for Non-Uniform Communication Architecture Zoran Radovic and Erik Hagersten {zoran.radovic, erik.hagersten}@it.uu.se

2 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Synchronization Basics   Locks are used to protect the shared critical section data A:=0 BARRIER LOCK(L) A:=A+1 UNLOCK(L) LOCK(L) B:=A+5 UNLOCK(L)

3 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Simple Spin Locks  test_and_ test&set (TATAS), ‘84  TATAS with exponential backoff (TATAS_EXP), ‘90  Many variations P1 $ P2 $ P3 $ Pn $ Memory FREE Lock: P3 BUSY Busy-wait/ backoff FREEBUSY … TATAS_LOCK(L) { if (tas(L)) { do { if (*L) continue; } while (tas(L)); } } TATAS_UNLOCK(L) { *L = 0; // = FREE }

4 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Performance Under Contention Amount of Contention Spin locks w/ backoff CS Cost IF (more contention)  THEN less efficient CS … IF (more contention)  THEN less efficient CS …

5 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Making it Scalable: Queues …  First-come, first-served order  Starvation avoidance  Maximal fairness  Reduced traffic   Queue-based locks  HW: QOLB ‘89  SW: MCS ‘91  SW: CLH ‘93

6 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Queue Locks Under Contention Amount of Contention Spin locks w/ backoff CS Cost Queue-based locks IF (more contention)  THEN constant CS cost … IF (more contention)  THEN constant CS cost …

7 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Switch Non-Uniform Memory Architecture (NUMA)  Many NUMA optimizations are proposed  Page migration  Page replication P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10

8 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Non-Uniform Communication Architecture (NUCA)   NUCA examples (NUCA ratios):  1992: Stanford DASH (~ 4.5)  1996: Sequent NUMA-Q (~ 10)  1999: Sun WildFire (~ 6)  2000: Compaq DS-320 (~ 3.5)  Future: CMP, SMT (~ 10) NUCA ratio Switch P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10 Our NUCA …

9 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Our Goals  Design a scalable spin lock that exploits the NUCAs  Creating node affinity For lock handover For CS data  “Stable lock”  Reducing the traffic compared with the test&set locks

10 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Outline Background & Motivation NUMA vs. NUCA  The RH Lock  Performance Results  Application Study  Conclusions

11 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Key Ideas Behind RH Lock  Minimizing global traffic at lock-handover  Only one thread per node will try to acquire a remotely owned lock  Maximizing node locality of NUCAs  Handover the lock to a neighbor in the same node  Creates locality for the critical section (CS) data as well  Especially good for large CS and high contention  RH lock in a nutshell:  Double TATAS_EXP: one node-local lock + one “global”

12 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) The RH Lock Algorithm FREE P1 $ P2 $ P3 $ P16 $ Cabinet 1: Memory REMOTE P17 $ P18 $ P19 $ P32 $ Cabinet 2: Memory FREE REMOTE Lock1: Lock2: Lock1: Lock2: P2 2 P19 19 else: TATAS(my_TID, Lock) until FREE or L_FREE if “REMOTE”: Spin remotely CAS(FREE, REMOTE) until FREE (w/ exp backoff) …… FREE  CS 1 2 16 1REMOTE 32 L_FREE Acquire: SWAP(my_TID, Lock) If (FREE or L_FREE) You’ve got it! Release: CAS(my_TID, FREE) else L_FREE) 16 FREE  CS

13 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Our NUCA: Sun WildFire NUCA ratio Switch P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 6 14 WF

14 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) NUCA-performance 14

15 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) New Microbenchmark for (i = 0; i < iterations; i++) { LOCK(L); delay(critical_work); // CS UNLOCK(L); static_delay(); random_delay(); }  More realistic node handoffs for queue-based locks  Constant number of processors  Amount of Critical Section (CS) work can be increased  we can control the “amount of contention”

16 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Performance Results New microbenchmark, 2-node Sun WildFire, 28 CPUs WF 14

17 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Traffic Measurements New microbenchmark; critical_work = 1500

18 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Application Performance Raytrace Speedup WF

19 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Application Performance Raytrace Speedup WF

20 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) RH Lock Under Contention Amount of Contention Queue-based locks Spin locks w/ backoff CS Cost RH lock

21 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Total Traffic: Raytrace

22 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Application Performance 28-processor runs

23 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART)  First-come, first-served not desirable for NUCAs  The RH lock exploits NUCAs by  creating locality through CS affinity (stable lock)  reducing traffic compared with the test&set locks  The first lock that performs better under contention  Global traffic is significantly reduced  Applications with contented locks scale better with RH locks on NUCAs Conclusions

24 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Any Drawbacks?  Proof-of-concept NUCA-aware lock for 2 nodes  Hard to port to some architectures  Memory needs to be allocated/placed in different nodes  Lock storage is proportional to #NUCA nodes  Sensitive for starvation  “Non-uniform nature” of the algorithm  No mechanism for lowering the risk of starvation

25 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Can We Fix It?  We propose a new set of NUCA-aware locks  Hierarchical Backoff Locks (HBO)  HPCA-9: Anaheim, California, February 2003   Teaser …  Portable  Scalable to many NUCA nodes  Only cas atomic operations are used  Only node_id is needed  Lowers the risk of starvation

26 NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) http://www.it.uu.se/research/group/uart UART’s Home Page


Download ppt "NUCA Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Efficient Synchronization for Non-Uniform."

Similar presentations


Ads by Google