NUCA Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Efficient Synchronization for Non-Uniform.

Slides:

Advertisements

Similar presentations

CS 603 Process Synchronization: The Colored Ticket Algorithm February 13, 2002.

Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Nov 18, 2005 Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for.

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.

Local-spin, Abortable Mutual Exclusion Joe Rideout.

Outline CPU caches Cache coherence Placement of data Hardware synchronization instructions Correctness: Memory model & compiler Performance: Programming.

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.

Performance and power consumption evaluation of concurrent queue implementations 1 Performance and power consumption evaluation of concurrent queue implementations.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Multiple Processor Systems

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University

Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling Squillante & Lazowska, IEEE TPDS 4(2), February 1993.

Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Architecture Research Team (UART)1 Zoran Radović and Erik Hagersten {zoranr, Uppsala University Information Technology.

RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] RH Lock: A Scalable Hierarchical.

Euro-Par Uppsala Architecture Research Team [UART] | Uppsala University Dept. of Information Technology Div. of.

HBO Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Hierarchical Back-Off (HBO) Locks for Non-Uniform.

July Terry Jones, Integrated Computing & Communications Dept Fast-OS.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

Cs238 CPU Scheduling Dr. Alan R. Davis. CPU Scheduling The objective of multiprogramming is to have some process running at all times, to maximize CPU.

Scalable Reader Writer Synchronization John M.Mellor-Crummey, Michael L.Scott.

CS533 - Concepts of Operating Systems 1 Class Discussion.

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

CS533 - Concepts of Operating Systems 1 Class Discussion.

Synchronization and Scheduling in Multiprocessor Operating Systems

WildFire: A Scalable Path for SMPs Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Bulk Synchronous Parallel Processing Model Jamie Perkins.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

Reactive Spin-locks: A Self-tuning Approach Phuong Hoai Ha Marina Papatriantafilou Philippas Tsigas I-SPAN ’05, Las Vegas, Dec. 7 th – 9 th, 2005.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

An Efficient Lock Protocol for Home-based Lazy Release Consistency Electronics and Telecommunications Research Institute (ETRI) 2001/5/16 HeeChul Yun.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Kernel Locking Techniques by Robert Love presented by Scott Price.

CS573 Data Privacy and Security Secure data outsourcing – Combining encryption and fragmentation.

Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

DISTRIBUTED COMPUTING

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.

Making a DSM Consistency Protocol Hierarchy-Aware: An Efficient Synchronization Scheme Gabriel Antoniu, Luc Bougé, Sébastien Lacour IRISA / INRIA & ENS.

1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

WildFire: A Scalable Path for SMPs Erick Hagersten and Michael Koster Sun Microsystems Inc. Presented by Terry Arnold II.

Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.

Don’t be lazy, be consistent: Postgres-R, A new way to implement Database Replication Paper by Bettina Kemme and Gustavo Alonso, VLDB 2000 Presentation.

CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.

Multiprocessors – Locks

Outline CPU caches Cache coherence Placement of data

Faster Data Structures in Transactional Memory using Three Paths

Reactive Synchronization Algorithms for Multiprocessors

Designing Parallel Algorithms (Synchronization)

Course Outline Introduction in algorithms and applications

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

Lecture: Coherence and Synchronization

Presented by Neha Agrawal

CS510 - Portland State University

Presentation transcript:

NUCA Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Efficient Synchronization for Non-Uniform Communication Architecture Zoran Radovic and Erik Hagersten {zoran.radovic,

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Synchronization Basics   Locks are used to protect the shared critical section data A:=0 BARRIER LOCK(L) A:=A+1 UNLOCK(L) LOCK(L) B:=A+5 UNLOCK(L)

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Simple Spin Locks  test_and_ test&set (TATAS), ‘84  TATAS with exponential backoff (TATAS_EXP), ‘90  Many variations P1 $ P2 $ P3 $ Pn $ Memory FREE Lock: P3 BUSY Busy-wait/ backoff FREEBUSY … TATAS_LOCK(L) { if (tas(L)) { do { if (*L) continue; } while (tas(L)); } } TATAS_UNLOCK(L) { *L = 0; // = FREE }

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Performance Under Contention Amount of Contention Spin locks w/ backoff CS Cost IF (more contention)  THEN less efficient CS … IF (more contention)  THEN less efficient CS …

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Making it Scalable: Queues …  First-come, first-served order  Starvation avoidance  Maximal fairness  Reduced traffic   Queue-based locks  HW: QOLB ‘89  SW: MCS ‘91  SW: CLH ‘93

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Queue Locks Under Contention Amount of Contention Spin locks w/ backoff CS Cost Queue-based locks IF (more contention)  THEN constant CS cost … IF (more contention)  THEN constant CS cost …

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Switch Non-Uniform Memory Architecture (NUMA)  Many NUMA optimizations are proposed  Page migration  Page replication P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Non-Uniform Communication Architecture (NUCA)   NUCA examples (NUCA ratios):  1992: Stanford DASH (~ 4.5)  1996: Sequent NUMA-Q (~ 10)  1999: Sun WildFire (~ 6)  2000: Compaq DS-320 (~ 3.5)  Future: CMP, SMT (~ 10) NUCA ratio Switch P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10 Our NUCA …

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Our Goals  Design a scalable spin lock that exploits the NUCAs  Creating node affinity For lock handover For CS data  “Stable lock”  Reducing the traffic compared with the test&set locks

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Outline Background & Motivation NUMA vs. NUCA  The RH Lock  Performance Results  Application Study  Conclusions

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Key Ideas Behind RH Lock  Minimizing global traffic at lock-handover  Only one thread per node will try to acquire a remotely owned lock  Maximizing node locality of NUCAs  Handover the lock to a neighbor in the same node  Creates locality for the critical section (CS) data as well  Especially good for large CS and high contention  RH lock in a nutshell:  Double TATAS_EXP: one node-local lock + one “global”

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) The RH Lock Algorithm FREE P1 $ P2 $ P3 $ P16 $ Cabinet 1: Memory REMOTE P17 $ P18 $ P19 $ P32 $ Cabinet 2: Memory FREE REMOTE Lock1: Lock2: Lock1: Lock2: P2 2 P19 19 else: TATAS(my_TID, Lock) until FREE or L_FREE if “REMOTE”: Spin remotely CAS(FREE, REMOTE) until FREE (w/ exp backoff) …… FREE  CS REMOTE 32 L_FREE Acquire: SWAP(my_TID, Lock) If (FREE or L_FREE) You’ve got it! Release: CAS(my_TID, FREE) else L_FREE) 16 FREE  CS

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Our NUCA: Sun WildFire NUCA ratio Switch P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory WF

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) NUCA-performance 14

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) New Microbenchmark for (i = 0; i < iterations; i++) { LOCK(L); delay(critical_work); // CS UNLOCK(L); static_delay(); random_delay(); }  More realistic node handoffs for queue-based locks  Constant number of processors  Amount of Critical Section (CS) work can be increased  we can control the “amount of contention”

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Performance Results New microbenchmark, 2-node Sun WildFire, 28 CPUs WF 14

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Traffic Measurements New microbenchmark; critical_work = 1500

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Application Performance Raytrace Speedup WF

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Application Performance Raytrace Speedup WF

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) RH Lock Under Contention Amount of Contention Queue-based locks Spin locks w/ backoff CS Cost RH lock

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Total Traffic: Raytrace

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Application Performance 28-processor runs

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART)  First-come, first-served not desirable for NUCAs  The RH lock exploits NUCAs by  creating locality through CS affinity (stable lock)  reducing traffic compared with the test&set locks  The first lock that performs better under contention  Global traffic is significantly reduced  Applications with contented locks scale better with RH locks on NUCAs Conclusions

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Any Drawbacks?  Proof-of-concept NUCA-aware lock for 2 nodes  Hard to port to some architectures  Memory needs to be allocated/placed in different nodes  Lock storage is proportional to #NUCA nodes  Sensitive for starvation  “Non-uniform nature” of the algorithm  No mechanism for lowering the risk of starvation

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) Can We Fix It?  We propose a new set of NUCA-aware locks  Hierarchical Backoff Locks (HBO)  HPCA-9: Anaheim, California, February 2003   Teaser …  Portable  Scalable to many NUCA nodes  Only cas atomic operations are used  Only node_id is needed  Lowers the risk of starvation

NUCA Locks Supercomputing 2002Uppsala Architecture Research Team (UART) UART’s Home Page