Presentation is loading. Please wait.

Presentation is loading. Please wait.

RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] RH Lock: A Scalable Hierarchical.

Similar presentations


Presentation on theme: "RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] RH Lock: A Scalable Hierarchical."— Presentation transcript:

1 RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] RH Lock: A Scalable Hierarchical Spin Lock Zoran Radovic and Erik Hagersten {zoranr, eh}@it.uu.se 2nd ANNUAL WORKSHOP ON MEMORY PERFORMANCE ISSUES (WMPI 2002) May 25, 2002, Anchorage, Alaska

2 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Synchronization History  Spin-Locks  test_and_set (TAS), e.g., IBM System/360, ’64  Rudolph and Segall, ISCA’84 test_and_ test_and_set (TATAS)  TATAS with exponential backoff (TATAS_EXP), ’90 – ’91 P1 $ P2 $ P3 $ Pn $ Memory FREE Lock: P3 BUSY Busy-wait/ backoff FREEBUSY …

3 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance, 12 years ago … Traditional microbenchmark for (i = 0; i < iterations; i++) { ACQUIRE(lock); // Null Critical Section (CS) RELEASE(lock); } Thanks: Michael L. Scott IF (more contention)  THEN less efficient CS … IF (more contention)  THEN less efficient CS …

4 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Making it Scalable: Queues …  Spin on your predecessor’s flag  First-come first-served order  Queue-Based Locks  QOLB/QOSB ’89  MCS ’91  CLH ’93

5 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance, May 2002 Traditional microbenchmark 16  Sun Enterprise E6000 SMP

6 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Synchronization Today  Commercial applications use spin-locks (!)  usually TATAS & TATAS_EXP with timeout for recovery from transaction deadlock recovery from preemption of the lock holder  POSIX threads: pthread_mutex_lock pthread_mutex_unlock  HPC: runtime systems, OpenMP, …

7 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Switch Non-Uniform Memory Architecture (NUMA)  NUMA optimizations  Page migration  Page replication P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10

8 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Non-Uniform Communication Architecture (NUCA)  NUCA examples (NUCA ratios):  1992: Stanford DASH (~ 4.5)  1996: Sequent NUMA-Q (~ 10)  1999: Sun WildFire (~ 6)  2000: Compaq DS-320 (~ 3.5)  Future: CMP, SMT (~ 10) NUCA ratio Switch P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10 Our NUCA …

9 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Our NUCA: Sun WildFire  Two E6000 connected through a hardware-coherent interface with a raw bandwidth of 800 MB/s in each direction  16 UltraSPARC II (250 MHz) CPUs per node  8 GB memory  NUCA ratio  6

10 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance on our NUCA 16

11 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Our Goals  Demonstrate that the first-come first-served nature of queue-based locks is unwanted for NUCAs  new microbenchmark: “more realistic” behavior, and  real application study  Design a scalable spin lock that exploits the NUCAs  creating a controlled unfairness (stable lock), and  reducing the traffic compared with the test&set locks

12 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Outline History & Background NUMA vs. NUCA Experimentation Environment  The RH Lock  Performance Results  Application Performance  Conclusions

13 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Key Ideas Behind RH Lock  Minimizing global traffic at lock-handover  Only one thread per node will try to acquire a “remote” lock  Maximizing node locality of NUCAs  Handover the lock to a neighbor in the same node  Creates locality for the critical section (CS) data as well  Especially good for large CS and high contention  RH lock in a nutshell:  Double TATAS_EXP: one node-local lock + one “global”

14 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) The RH Lock Algorithm FREE P1 $ P2 $ P3 $ P16 $ Cabinet 1: Memory REMOTE P17 $ P18 $ P19 $ P32 $ Cabinet 2: Memory FREE REMOTE Lock1: Lock2: Lock1: Lock2: P2 2 P19 19 else: TATAS(my_TID, Lock) until FREE or L_FREE if “REMOTE”: Spin remotely CAS(FREE, REMOTE) until FREE (w/ exp backoff) …… FREE  CS 1 2 16 1REMOTE 32 L_FREE Acquire: SWAP(my_TID, Lock) If (FREE or L_FREE) You’ve got it! Release: CAS(my_TID, FREE) else L_FREE) 16 FREE  CS IF (more contention)  THEN more efficient CS IF (more contention)  THEN more efficient CS

15 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance Results Traditional microbenchmark, 2-node Sun WildFire

16 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Controlling Unfairness … FREE P1 $ P2 $ P3 $ Pn $ Cabinet 1: Memory FREE Lock1: Lock2: P2 TID void rh_acquire_slowpath(rh_lock *L) {... if ((random() % FAIR_FACTOR) == 0) be_fare = TRUE; else be_fare = FALSE;... } void rh_release(rh_lock *L) { if (be_fare) *L = FREE; else if (cas(L, my_tid, FREE) != my_tid) *L = L_FREE; } L_FREE

17 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Node-handoffs Traditional microbenchmark, 2-node Sun WildFire

18 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) New Microbenchmark for (i = 0; i < iterations; i++) { ACQUIRE(lock); // Critical Section (CS) work RELEASE(lock); // Non-CS work STATIC part + // Non-CS work RANDOM part }  More realistic node-handoffs for queue-based locks  Constant number of processors  The amount of Critical Section (CS) work can be increased  we can control the “amount of contention”

19 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance Results New microbenchmark, 2-node Sun WildFire, 28 CPUs WF 14

20 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Application Performance (1) Methodology  The SPLASH-2 programs  14 apps  We study only applications with more then 10,000 acquire/release operations  Barnes, Cholesky, FMM, Radiosity, Raytrace, Volrend, and Water-Nsq  Synchronization algorithms  TATAS, TATAS_EXP, MCS, CLH, and RH  2-node Sun WildFire ProgramLock Acquires Barnes69,193 Cholesky74,284 FFT32 FMM80,528 LU-c & LU-nc32 Ocean-c6,304 Ocean-nc6,656 Radiosity295,627 Radix32 Raytrace366,450 Volrend38,456 Water-Nsq112,415 Water-Sp510

21 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Application Performance (2) Raytrace Speedup WF 0 1 2 3 4 5 6 7 8 0481216202428 Number of Processors Speedup TATAS TATAS_EXP MCS CLH RH

22 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Single-Processor Results Traditional microbenchmark, null CS TATAS97 ns TATAS_EXP97 ns MCS202 ns CLH137 ns RH121 ns 1: for (i = 0; i < iterations; i++) { 2: ACQUIRE(lock); 3: RELEASE(lock); 4: }

23 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance Results Traditional microbenchmark, single-node E6000  Bind all threads to only one of the E6000 nodes As expected: RH lock  TATAS_EXP

24 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART)  First-come first-served not desirable for NUCAs  The RH lock exploits NUCAs by  creating locality through controlled unfairness (stable lock)  reducing traffic compared with the test&set locks  The only lock that performs better under contention  A critical section (CS) guarded by the RH lock take less than half the time to execute with the same CS guarded by any other lock  Raytrace on 30 CPUs: 1.83 – 5.70 “better”  Works best for NUCA with a few large “nodes” Conclusions

25 RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) http://www.it.uu.se/research/group/uart UART’s Home Page


Download ppt "RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] RH Lock: A Scalable Hierarchical."

Similar presentations


Ads by Google