RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] RH Lock: A Scalable Hierarchical.

Slides:



Advertisements
Similar presentations
Synchronization without Contention
Advertisements

Nov 18, 2005 Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for.
Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.
Chapter 6: Process Synchronization
Spin Locks and Contention Management The Art of Multiprocessor Programming Spring 2007.
Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.
Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.
Performance and power consumption evaluation of concurrent queue implementations 1 Performance and power consumption evaluation of concurrent queue implementations.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Multiple Processor Systems
UMA Bus-Based SMP Architectures
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Synchronization. Shared Memory Thread Synchronization Threads cooperate in multithreaded environments – User threads and kernel threads – Share resources.
Operating Systems ECE344 Ding Yuan Synchronization (I) -- Critical region and lock Lecture 5: Synchronization (I) -- Critical region and lock.
Introduction to MIMD architectures
Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.
Architecture Research Team (UART)1 Zoran Radović and Erik Hagersten {zoranr, Uppsala University Information Technology.
Euro-Par Uppsala Architecture Research Team [UART] | Uppsala University Dept. of Information Technology Div. of.
NUCA Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Efficient Synchronization for Non-Uniform.
HBO Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Hierarchical Back-Off (HBO) Locks for Non-Uniform.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
Trends in Cluster Architecture Steve Lumetta David Culler University of California at Berkeley Computer Science Division.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –
Scalable Reader Writer Synchronization John M.Mellor-Crummey, Michael L.Scott.
CS533 - Concepts of Operating Systems 1 Class Discussion.
Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich.
Synchronization CSCI 444/544 Operating Systems Fall 2008.
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
Synchronization and Scheduling in Multiprocessor Operating Systems
WildFire: A Scalable Path for SMPs Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008.
9/8/2015cse synchronization-p1 © Perkins DW Johnson and University of Washington1 Synchronization Part 1 CSE 410, Spring 2008 Computer Systems.
1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
The IBM VM CS450/550 Section 2 Stephen Kam. IBM VM - Origins Originally an experimental OS called “CP-67” Designed to run on the IBM System/360 Model.
Reactive Spin-locks: A Self-tuning Approach Phuong Hoai Ha Marina Papatriantafilou Philippas Tsigas I-SPAN ’05, Las Vegas, Dec. 7 th – 9 th, 2005.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Kernel Locking Techniques by Robert Love presented by Scott Price.
A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.
DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
WildFire: A Scalable Path for SMPs Erick Hagersten and Michael Koster Sun Microsystems Inc. Presented by Terry Arnold II.
CS 2200 Presentation 18b MUTEX. Questions? Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy.
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640,
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Background Computer System Architectures Computer System Software.
EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.
Multiprocessors – Locks
Reactive Synchronization Algorithms for Multiprocessors
Course Outline Introduction in algorithms and applications
Yiannis Nikolakopoulos
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Implementing an OpenMP Execution Environment on InfiniBand Clusters
CSE 153 Design of Operating Systems Winter 19
Presentation transcript:

RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] RH Lock: A Scalable Hierarchical Spin Lock Zoran Radovic and Erik Hagersten {zoranr, 2nd ANNUAL WORKSHOP ON MEMORY PERFORMANCE ISSUES (WMPI 2002) May 25, 2002, Anchorage, Alaska

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Synchronization History  Spin-Locks  test_and_set (TAS), e.g., IBM System/360, ’64  Rudolph and Segall, ISCA’84 test_and_ test_and_set (TATAS)  TATAS with exponential backoff (TATAS_EXP), ’90 – ’91 P1 $ P2 $ P3 $ Pn $ Memory FREE Lock: P3 BUSY Busy-wait/ backoff FREEBUSY …

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance, 12 years ago … Traditional microbenchmark for (i = 0; i < iterations; i++) { ACQUIRE(lock); // Null Critical Section (CS) RELEASE(lock); } Thanks: Michael L. Scott IF (more contention)  THEN less efficient CS … IF (more contention)  THEN less efficient CS …

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Making it Scalable: Queues …  Spin on your predecessor’s flag  First-come first-served order  Queue-Based Locks  QOLB/QOSB ’89  MCS ’91  CLH ’93

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance, May 2002 Traditional microbenchmark 16  Sun Enterprise E6000 SMP

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Synchronization Today  Commercial applications use spin-locks (!)  usually TATAS & TATAS_EXP with timeout for recovery from transaction deadlock recovery from preemption of the lock holder  POSIX threads: pthread_mutex_lock pthread_mutex_unlock  HPC: runtime systems, OpenMP, …

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Switch Non-Uniform Memory Architecture (NUMA)  NUMA optimizations  Page migration  Page replication P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Non-Uniform Communication Architecture (NUCA)  NUCA examples (NUCA ratios):  1992: Stanford DASH (~ 4.5)  1996: Sequent NUMA-Q (~ 10)  1999: Sun WildFire (~ 6)  2000: Compaq DS-320 (~ 3.5)  Future: CMP, SMT (~ 10) NUCA ratio Switch P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10 Our NUCA …

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Our NUCA: Sun WildFire  Two E6000 connected through a hardware-coherent interface with a raw bandwidth of 800 MB/s in each direction  16 UltraSPARC II (250 MHz) CPUs per node  8 GB memory  NUCA ratio  6

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance on our NUCA 16

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Our Goals  Demonstrate that the first-come first-served nature of queue-based locks is unwanted for NUCAs  new microbenchmark: “more realistic” behavior, and  real application study  Design a scalable spin lock that exploits the NUCAs  creating a controlled unfairness (stable lock), and  reducing the traffic compared with the test&set locks

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Outline History & Background NUMA vs. NUCA Experimentation Environment  The RH Lock  Performance Results  Application Performance  Conclusions

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Key Ideas Behind RH Lock  Minimizing global traffic at lock-handover  Only one thread per node will try to acquire a “remote” lock  Maximizing node locality of NUCAs  Handover the lock to a neighbor in the same node  Creates locality for the critical section (CS) data as well  Especially good for large CS and high contention  RH lock in a nutshell:  Double TATAS_EXP: one node-local lock + one “global”

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) The RH Lock Algorithm FREE P1 $ P2 $ P3 $ P16 $ Cabinet 1: Memory REMOTE P17 $ P18 $ P19 $ P32 $ Cabinet 2: Memory FREE REMOTE Lock1: Lock2: Lock1: Lock2: P2 2 P19 19 else: TATAS(my_TID, Lock) until FREE or L_FREE if “REMOTE”: Spin remotely CAS(FREE, REMOTE) until FREE (w/ exp backoff) …… FREE  CS REMOTE 32 L_FREE Acquire: SWAP(my_TID, Lock) If (FREE or L_FREE) You’ve got it! Release: CAS(my_TID, FREE) else L_FREE) 16 FREE  CS IF (more contention)  THEN more efficient CS IF (more contention)  THEN more efficient CS

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance Results Traditional microbenchmark, 2-node Sun WildFire

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Controlling Unfairness … FREE P1 $ P2 $ P3 $ Pn $ Cabinet 1: Memory FREE Lock1: Lock2: P2 TID void rh_acquire_slowpath(rh_lock *L) {... if ((random() % FAIR_FACTOR) == 0) be_fare = TRUE; else be_fare = FALSE;... } void rh_release(rh_lock *L) { if (be_fare) *L = FREE; else if (cas(L, my_tid, FREE) != my_tid) *L = L_FREE; } L_FREE

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Node-handoffs Traditional microbenchmark, 2-node Sun WildFire

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) New Microbenchmark for (i = 0; i < iterations; i++) { ACQUIRE(lock); // Critical Section (CS) work RELEASE(lock); // Non-CS work STATIC part + // Non-CS work RANDOM part }  More realistic node-handoffs for queue-based locks  Constant number of processors  The amount of Critical Section (CS) work can be increased  we can control the “amount of contention”

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance Results New microbenchmark, 2-node Sun WildFire, 28 CPUs WF 14

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Application Performance (1) Methodology  The SPLASH-2 programs  14 apps  We study only applications with more then 10,000 acquire/release operations  Barnes, Cholesky, FMM, Radiosity, Raytrace, Volrend, and Water-Nsq  Synchronization algorithms  TATAS, TATAS_EXP, MCS, CLH, and RH  2-node Sun WildFire ProgramLock Acquires Barnes69,193 Cholesky74,284 FFT32 FMM80,528 LU-c & LU-nc32 Ocean-c6,304 Ocean-nc6,656 Radiosity295,627 Radix32 Raytrace366,450 Volrend38,456 Water-Nsq112,415 Water-Sp510

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Application Performance (2) Raytrace Speedup WF Number of Processors Speedup TATAS TATAS_EXP MCS CLH RH

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Single-Processor Results Traditional microbenchmark, null CS TATAS97 ns TATAS_EXP97 ns MCS202 ns CLH137 ns RH121 ns 1: for (i = 0; i < iterations; i++) { 2: ACQUIRE(lock); 3: RELEASE(lock); 4: }

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) Performance Results Traditional microbenchmark, single-node E6000  Bind all threads to only one of the E6000 nodes As expected: RH lock  TATAS_EXP

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART)  First-come first-served not desirable for NUCAs  The RH lock exploits NUCAs by  creating locality through controlled unfairness (stable lock)  reducing traffic compared with the test&set locks  The only lock that performs better under contention  A critical section (CS) guarded by the RH lock take less than half the time to execute with the same CS guarded by any other lock  Raytrace on 30 CPUs: 1.83 – 5.70 “better”  Works best for NUCA with a few large “nodes” Conclusions

RH Locks WMPI 2002, AlaskaUppsala Architecture Research Team (UART) UART’s Home Page