Download presentation
Presentation is loading. Please wait.
1
RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] RH Lock: A Scalable Hierarchical Spin Lock Zoran Radovic and Erik Hagersten {zoranr, eh}@it.uu.se 2nd ANNUAL WORKSHOP ON MEMORY PERFORMANCE ISSUES (WMPI 2002) May 25, 2002, Anchorage, Alaska
2
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Synchronization History Spin-Locks test_and_set (TAS), e.g., IBM System/360, ’64 Rudolph and Segall, ISCA’84 test_and_ test_and_set (TATAS) TATAS with exponential backoff (TATAS_EXP), ’90 – ’91 P1 $ P2 $ P3 $ Pn $ Memory FREE Lock: P3 BUSY Busy-wait/ backoff FREEBUSY …
3
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance, 12 years ago … Traditional microbenchmark for (i = 0; i < iterations; i++) { ACQUIRE(lock); // Null Critical Section (CS) RELEASE(lock); } Thanks: Michael L. Scott IF (more contention) THEN less efficient CS … IF (more contention) THEN less efficient CS …
4
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Making it Scalable: Queues … Spin on your predecessor’s flag First-come first-served order Queue-Based Locks QOLB/QOSB ’89 MCS ’91 CLH ’93
5
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance, May 2002 Traditional microbenchmark 16 Sun Enterprise E6000 SMP
6
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Synchronization Today Commercial applications use spin-locks (!) usually TATAS & TATAS_EXP with timeout for recovery from transaction deadlock recovery from preemption of the lock holder POSIX threads: pthread_mutex_lock pthread_mutex_unlock HPC: runtime systems, OpenMP, …
7
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Switch Non-Uniform Memory Architecture (NUMA) NUMA optimizations Page migration Page replication P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10
8
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Non-Uniform Communication Architecture (NUCA) NUCA examples (NUCA ratios): 1992: Stanford DASH (~ 4.5) 1996: Sequent NUMA-Q (~ 10) 1999: Sun WildFire (~ 6) 2000: Compaq DS-320 (~ 3.5) Future: CMP, SMT (~ 10) NUCA ratio Switch P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10 Our NUCA …
9
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Our NUCA: Sun WildFire Two E6000 connected through a hardware-coherent interface with a raw bandwidth of 800 MB/s in each direction 16 UltraSPARC II (250 MHz) CPUs per node 8 GB memory NUCA ratio 6
10
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance on our NUCA 16
11
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Our Goals Demonstrate that the first-come first-served nature of queue-based locks is unwanted for NUCAs new microbenchmark: “more realistic” behavior, and real application study Design a scalable spin lock that exploits the NUCAs creating a controlled unfairness (stable lock), and reducing the traffic compared with the test&set locks
12
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Outline History & Background NUMA vs. NUCA Experimentation Environment The RH Lock Performance Results Application Performance Conclusions
13
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Key Ideas Behind RH Lock Minimizing global traffic at lock-handover Only one thread per node will try to acquire a “remote” lock Maximizing node locality of NUCAs Handover the lock to a neighbor in the same node Creates locality for the critical section (CS) data as well Especially good for large CS and high contention RH lock in a nutshell: Double TATAS_EXP: one node-local lock + one “global”
14
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) The RH Lock Algorithm FREE P1 $ P2 $ P3 $ P16 $ Cabinet 1: Memory REMOTE P17 $ P18 $ P19 $ P32 $ Cabinet 2: Memory FREE REMOTE Lock1: Lock2: Lock1: Lock2: P2 2 P19 19 else: TATAS(my_TID, Lock) until FREE or L_FREE if “REMOTE”: Spin remotely CAS(FREE, REMOTE) until FREE (w/ exp backoff) …… FREE CS 1 2 16 1REMOTE 32 L_FREE Acquire: SWAP(my_TID, Lock) If (FREE or L_FREE) You’ve got it! Release: CAS(my_TID, FREE) else L_FREE) 16 FREE CS IF (more contention) THEN more efficient CS IF (more contention) THEN more efficient CS
15
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance Results Traditional microbenchmark, 2-node Sun WildFire
16
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Controlling Unfairness … FREE P1 $ P2 $ P3 $ Pn $ Cabinet 1: Memory FREE Lock1: Lock2: P2 TID void rh_acquire_slowpath(rh_lock *L) {... if ((random() % FAIR_FACTOR) == 0) be_fare = TRUE; else be_fare = FALSE;... } void rh_release(rh_lock *L) { if (be_fare) *L = FREE; else if (cas(L, my_tid, FREE) != my_tid) *L = L_FREE; } L_FREE
17
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Node-handoffs Traditional microbenchmark, 2-node Sun WildFire
18
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) New Microbenchmark for (i = 0; i < iterations; i++) { ACQUIRE(lock); // Critical Section (CS) work RELEASE(lock); // Non-CS work STATIC part + // Non-CS work RANDOM part } More realistic node-handoffs for queue-based locks Constant number of processors The amount of Critical Section (CS) work can be increased we can control the “amount of contention”
19
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance Results New microbenchmark, 2-node Sun WildFire, 28 CPUs WF 14
20
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Application Performance (1) Methodology The SPLASH-2 programs 14 apps We study only applications with more then 10,000 acquire/release operations Barnes, Cholesky, FMM, Radiosity, Raytrace, Volrend, and Water-Nsq Synchronization algorithms TATAS, TATAS_EXP, MCS, CLH, and RH 2-node Sun WildFire ProgramLock Acquires Barnes69,193 Cholesky74,284 FFT32 FMM80,528 LU-c & LU-nc32 Ocean-c6,304 Ocean-nc6,656 Radiosity295,627 Radix32 Raytrace366,450 Volrend38,456 Water-Nsq112,415 Water-Sp510
21
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Application Performance (2) Raytrace Speedup WF 0 1 2 3 4 5 6 7 8 0481216202428 Number of Processors Speedup TATAS TATAS_EXP MCS CLH RH
22
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Single-Processor Results Traditional microbenchmark, null CS TATAS97 ns TATAS_EXP97 ns MCS202 ns CLH137 ns RH121 ns 1: for (i = 0; i < iterations; i++) { 2: ACQUIRE(lock); 3: RELEASE(lock); 4: }
23
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance Results Traditional microbenchmark, single-node E6000 Bind all threads to only one of the E6000 nodes As expected: RH lock TATAS_EXP
24
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) First-come first-served not desirable for NUCAs The RH lock exploits NUCAs by creating locality through controlled unfairness (stable lock) reducing traffic compared with the test&set locks The only lock that performs better under contention A critical section (CS) guarded by the RH lock take less than half the time to execute with the same CS guarded by any other lock Raytrace on 30 CPUs: 1.83 – 5.70 “better” Works best for NUCA with a few large “nodes” Conclusions
25
RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) http://www.it.uu.se/research/group/uart UART’s Home Page
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.