HBO Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Hierarchical Back-Off (HBO) Locks for Non-Uniform.

Slides:



Advertisements
Similar presentations
The Effect of Network Total Order, Broadcast, and Remote-Write on Network- Based Shared Memory Computing Robert Stets, Sandhya Dwarkadas, Leonidas Kontothanassis,
Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Nov 18, 2005 Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for.
Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Multiple Processor Systems
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Introduction to MIMD architectures
Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.
RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] RH Lock: A Scalable Hierarchical.
Euro-Par Uppsala Architecture Research Team [UART] | Uppsala University Dept. of Information Technology Div. of.
NUCA Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Efficient Synchronization for Non-Uniform.
Concurrency.
Trends in Cluster Architecture Steve Lumetta David Culler University of California at Berkeley Computer Science Division.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Synchronization and Scheduling in Multiprocessor Operating Systems
WildFire: A Scalable Path for SMPs Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
PMIT-6102 Advanced Database Systems
1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Lecture 17 Page 1 CS 111 Online Distributed Computing CS 111 On-Line MS Program Operating Systems Peter Reiher.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.
Embedded System Lab. 김해천 Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist,
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
DISTRIBUTED COMPUTING
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
WildFire: A Scalable Path for SMPs Erick Hagersten and Michael Koster Sun Microsystems Inc. Presented by Terry Arnold II.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 7.
Synchronization Questions answered in this lecture: Why is synchronization necessary? What are race conditions, critical sections, and atomic operations?
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Multiprocessors – Locks
CS5102 High Performance Computer Systems Thread-Level Parallelism
Background on the need for Synchronization
Atomic Operations in Hardware
Lecture 1: Parallel Architecture Intro
Designing Parallel Algorithms (Synchronization)
Scalable Parallel Interoperable Data Analytics Library
Course Outline Introduction in algorithms and applications
CS510 - Portland State University
Presentation transcript:

HBO Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Hierarchical Back-Off (HBO) Locks for Non-Uniform Communication Architectures Zoran Radovic and Erik Hagersten {zoran.radovic, HPCA-9 Ninth International Symposium on High Performance Computer Architecture Anaheim, California, February 8-12, 2003

HBO Locks Architecture Research Team (UART) Synchronization Basics   Locks are used to protect the shared critical section data   Common software- based solutions:  Simple spin-locks TATAS (‘84) TATAS_EXP (‘90)  Queue-based locks MCS (‘91) CLH (‘93) A:=0 BARRIER LOCK(L) A:=A+1 UNLOCK(L) LOCK(L) B:=A+5 UNLOCK(L)

HBO Locks Architecture Research Team (UART) Raytrace Speedup Sun WildFire (WF) 14 WF

HBO Locks Architecture Research Team (UART) Vasaloppet “Contention Problem in Sweden” Traditional cross-country ski race 55 miles … miles to go… CS

HBO Locks Architecture Research Team (UART) Spin Locks Under Contention Amount of Contention Spin locks w/ backoff Critical Section (CS) Cost IF (more contention)  THEN less efficient CS … “The more important the slower it runs…” IF (more contention)  THEN less efficient CS … “The more important the slower it runs…”

HBO Locks Architecture Research Team (UART) Queue-based Locks Amount of Contention Spin locks w/ backoff CS Cost Queue-based locks IF (more contention)  THEN constant CS cost … IF (more contention)  THEN constant CS cost …

HBO Locks Architecture Research Team (UART) This Talk Amount of Contention Queue-based locks Spin locks w/ backoff HBO locks CS Cost IF (more contention)  THEN more efficient CS … “The more important the faster it runs…” IF (more contention)  THEN more efficient CS … “The more important the faster it runs…”

HBO Locks Architecture Research Team (UART) Raytrace Speedup HBO Locks Sun WildFire (WF) 14 WF

HBO Locks Architecture Research Team (UART)Outline Background & Motivation  NUMA vs. NUCA Architectures  Hierarchical Back-Off (HBO) Locks  HBO  HBO_GT  HBO_GT with starvation detection/avoidance  Performance Results  Conclusions

HBO Locks Architecture Research Team (UART) Switch Non-Uniform Memory Architecture (NUMA)  Many NUMA optimizations are proposed  Page migration  speed up accesses to “private” data  Page replication  speed up reads to “shared” data  Does not help communication…  E.g., synchronization P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10 Access time ratio...

HBO Locks Architecture Research Team (UART) A “new” property of NUMAs…  NUCA Non-Uniform Communication Architecture (NUCA)   NUCA examples (NUCA ratios):  1992: Stanford DASH (~ 4.5)  1996: Sequent NUMA-Q (~ 10)  1999: Sun WildFire (~ 6)  2000: Compaq DS-320 (~ 3.5)  Future: CMP, SMT (~ 10) NUCA ratio Switch P1 $ P2 $ P3 $ Pn $ P1 $ P2 $ P3 $ Pn $ Memory 1 2 – 10 NUCA optimizations are getting important for future architectures! NUCA optimizations are getting important for future architectures!...

HBO Locks Architecture Research Team (UART) Our Goals Design scalable spin locks that exploit NUCAs  Create communication affinity  Keep the lock in the neighborhood [Mr. Rogers, 1968]  Speeds up lock handover  Lowers the access cost to critical section (CS) data  Reduce remote “probing” traffic  Portable and scalable to many NUCA nodes

HBO Locks Architecture Research Team (UART) The HBO Lock (the simplest HBO)  What do we need?  node_id  Compare&swap ( CAS ) atomic operation CAS (Lock_address, FREE, node_id)  lock-acquire:  If the lock-value is in the state FREE: The node_id is CAS -ed into the lock location  Else: 2 cases (for 2 levels of non-uniformity): The lock is “local”  TATAS_EXP with small backoff The lock is “remote”  TATAS_EXP with large backoff  Simple but fairly effective… Creates Communication Affinity

HBO Locks Architecture Research Team (UART) … The HBO_GT Lock GT = Global Throttling FREE P $ P $ P $ P $ Node 2 : Memory P $ P $ P $ P $ Node 5 : Memory FREE Lock1: Lock2: P FREE2 P Local spinning Remote spinning (w/ exp. backoff) …… FREE  CS222 (remote_node_id) FREE Lock3: 0x my_is_ spinning: 0x my_is_ spinning: Probing... (with CAS) addr(Lock1) Read a node- local flag...

HBO Locks Architecture Research Team (UART) The HBO_GT Lock GT = Global Throttling A couple of nanoseconds later …

HBO Locks Architecture Research Team (UART) … The HBO_GT Lock GT = Global Throttling FREE P $ P $ P $ P $ Node 2 : Memory P $ P $ P $ P $ Node 5 : Memory FREE Lock1: Lock2: 5 P Local spinning Remote spinning (w/ exp. backoff) …… FREE  CS55 (remote_node_id) FREE Lock3: 0x my_is_ spinning: 0x my_is_ spinning: Probing... (with CAS) addr(Lock1) Read a node- local flag... 5 P

HBO Locks Architecture Research Team (UART) Our NUCA: Sun WildFire NUCA ratio Switch P1 $ P2 $ P3 $ P14 $ P1 $ P2 $ P3 $ P14 $ Memory WF...

HBO Locks Architecture Research Team (UART) Traditional Microbenchmark for (i = 0; i < iterations; i++) { LOCK(L); /* null/small Critical Section */ UNLOCK(L); }  For each thread:

HBO Locks Architecture Research Team (UART) NUCA-performance Traditional microbenchmark WF

HBO Locks Architecture Research Team (UART) New Microbenchmark critical_work for (i = 0; i < iterations; i++) { LOCK(L); delay(critical_work); // CS UNLOCK(L); static_delay(); random_delay(); }  More realistic node handoffs for queue-locks  Constant number of processors  Control the “amount of contention”

HBO Locks Architecture Research Team (UART) Performance Results New microbenchmark, 2-node Sun WildFire, 28 CPUs WF 14 Fairness?

HBO Locks Architecture Research Team (UART) Fairness Study New microbenchmark, 2-node Sun WildFire, 28 CPUs t

HBO Locks Architecture Research Team (UART) Application Performance Raytrace Speedup WF

HBO Locks Architecture Research Team (UART) Application Performance Raytrace Speedup WF

HBO Locks Architecture Research Team (UART) HBO Locks Under Contention Amount of Contention Queue-based locks Spin locks w/ backoff CS Cost HBO locks

HBO Locks Architecture Research Team (UART) Total Traffic: Raytrace 1.11x 1.45x

HBO Locks Architecture Research Team (UART) Application Performance 28-processor runs

HBO Locks Architecture Research Team (UART)  First-come, first-served not desirable for NUCAs  The HBO lock exploits NUCAs by  creating locality through CS affinity (stable lock)  reducing traffic compared with the test&set locks  HBO performs better under contention  Traffic is significantly reduced  Applications with contented locks scale better with HBO locks on NUCAs  Starvation detection/avoidance in the paper…Conclusions

HBO Locks Architecture Research Team (UART) UART’s Home Page Supported by Sun Microsystems, Inc., and the Parallel and Scientific Computing Institute (PSCI)