Download presentation
Presentation is loading. Please wait.
1
NUCA Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Efficient Synchronization for Non-Uniform Communication Architecture Zoran Radovic and Erik Hagersten {zoran.radovic, erik.hagersten}@it.uu.se
2
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Synchronization Basics Locks are used to protect the shared critical section data A:=0 BARRIER LOCK(L) A:=A+1 UNLOCK(L) LOCK(L) B:=A+5 UNLOCK(L)
3
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Simple Spin Locks test_and_ test&set (TATAS), ‘84 TATAS with exponential backoff (TATAS_EXP), ‘90 Many variations P1 $ P2 $ P3 $ Pn $ Memory FREE Lock: P3 BUSY Busy-wait/ backoff FREEBUSY … TATAS_LOCK(L) { if (tas(L)) { do { if (*L) continue; } while (tas(L)); } } TATAS_UNLOCK(L) { *L = 0; // = FREE }
4
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Performance Under Contention Amount of Contention Spin locks w/ backoff CS Cost IF (more contention) THEN less efficient CS … IF (more contention) THEN less efficient CS …
5
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Making it Scalable: Queues … First-come, first-served order Starvation avoidance Maximal fairness Reduced traffic Queue-based locks HW: QOLB ‘89 SW: MCS ‘91 SW: CLH ‘93
6
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Queue Locks Under Contention Amount of Contention Spin locks w/ backoff CS Cost Queue-based locks IF (more contention) THEN constant CS cost … IF (more contention) THEN constant CS cost …
7
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Switch Non-Uniform Memory Architecture (NUMA) Many NUMA optimizations are proposed Page migration Page replication P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10
8
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Non-Uniform Communication Architecture (NUCA) NUCA examples (NUCA ratios): 1992: Stanford DASH (~ 4.5) 1996: Sequent NUMA-Q (~ 10) 1999: Sun WildFire (~ 6) 2000: Compaq DS-320 (~ 3.5) Future: CMP, SMT (~ 10) NUCA ratio Switch P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10 Our NUCA …
9
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Our Goals Design a scalable spin lock that exploits the NUCAs Creating node affinity For lock handover For CS data “Stable lock” Reducing the traffic compared with the test&set locks
10
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Outline Background & Motivation NUMA vs. NUCA The RH Lock Performance Results Application Study Conclusions
11
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Key Ideas Behind RH Lock Minimizing global traffic at lock-handover Only one thread per node will try to acquire a remotely owned lock Maximizing node locality of NUCAs Handover the lock to a neighbor in the same node Creates locality for the critical section (CS) data as well Especially good for large CS and high contention RH lock in a nutshell: Double TATAS_EXP: one node-local lock + one “global”
12
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) The RH Lock Algorithm FREE P1 $ P2 $ P3 $ P16 $ Cabinet 1: Memory REMOTE P17 $ P18 $ P19 $ P32 $ Cabinet 2: Memory FREE REMOTE Lock1: Lock2: Lock1: Lock2: P2 2 P19 19 else: TATAS(my_TID, Lock) until FREE or L_FREE if “REMOTE”: Spin remotely CAS(FREE, REMOTE) until FREE (w/ exp backoff) …… FREE CS 1 2 16 1REMOTE 32 L_FREE Acquire: SWAP(my_TID, Lock) If (FREE or L_FREE) You’ve got it! Release: CAS(my_TID, FREE) else L_FREE) 16 FREE CS
13
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Our NUCA: Sun WildFire NUCA ratio Switch P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 6 14 WF
14
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) NUCA-performance 14
15
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) New Microbenchmark for (i = 0; i < iterations; i++) { LOCK(L); delay(critical_work); // CS UNLOCK(L); static_delay(); random_delay(); } More realistic node handoffs for queue-based locks Constant number of processors Amount of Critical Section (CS) work can be increased we can control the “amount of contention”
16
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Performance Results New microbenchmark, 2-node Sun WildFire, 28 CPUs WF 14
17
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Traffic Measurements New microbenchmark; critical_work = 1500
18
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Application Performance Raytrace Speedup WF
19
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Application Performance Raytrace Speedup WF
20
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) RH Lock Under Contention Amount of Contention Queue-based locks Spin locks w/ backoff CS Cost RH lock
21
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Total Traffic: Raytrace
22
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Application Performance 28-processor runs
23
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) First-come, first-served not desirable for NUCAs The RH lock exploits NUCAs by creating locality through CS affinity (stable lock) reducing traffic compared with the test&set locks The first lock that performs better under contention Global traffic is significantly reduced Applications with contented locks scale better with RH locks on NUCAs Conclusions
24
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Any Drawbacks? Proof-of-concept NUCA-aware lock for 2 nodes Hard to port to some architectures Memory needs to be allocated/placed in different nodes Lock storage is proportional to #NUCA nodes Sensitive for starvation “Non-uniform nature” of the algorithm No mechanism for lowering the risk of starvation
25
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Can We Fix It? We propose a new set of NUCA-aware locks Hierarchical Backoff Locks (HBO) HPCA-9: Anaheim, California, February 2003 Teaser … Portable Scalable to many NUCA nodes Only cas atomic operations are used Only node_id is needed Lowers the risk of starvation
26
NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) http://www.it.uu.se/research/group/uart UART’s Home Page
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.