The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Slides:

Advertisements

Similar presentations

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

The University of Adelaide, School of Computer Science

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.

CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.

Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

5.6 Semaphores Semaphores –Software construct that can be used to enforce mutual exclusion –Contains a protected variable Can be accessed only via wait.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

6: Process Synchronization 1 1 PROCESS SYNCHRONIZATION I This is about getting processes to coordinate with each other. How do processes work with resources.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS533 - Concepts of Operating Systems 1 Class Discussion.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Symmetric Multiprocessors and Performance of SpinLock Techniques Based on Anderson’s paper “Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”

More on Locks: Case Studies

Concurrency, Mutual Exclusion and Synchronization.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

CS 2200 Presentation 18b MUTEX. Questions? Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Implementing Lock. From the Previous Lecture  The “too much milk” example shows that writing concurrent programs directly with load and store instructions.

1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan.

Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )

The University of Adelaide, School of Computer Science

Multiprocessors – Locks

Lecture 19: Coherence and Synchronization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

Reactive Synchronization Algorithms for Multiprocessors

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Lecture 21: Synchronization and Consistency

Lecture: Coherence and Synchronization

Implementing Mutual Exclusion

Lecture 10: Consistency Models

CS533 Concepts of Operating Systems

CS510 Concurrent Systems Jonathan Walpole.

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

CSE 153 Design of Operating Systems Winter 19

Lecture 24: Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 11: Consistency Models

Presentation transcript:

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park

Introduction  In shared-memory multiprocessors, each processor can directly access memory  For consistency of the data structure, we need a method to serialize the operations done on it  Shared-memory multiprocessors provide some form of hardware support for mutual exclusion - atomic instructions

Why lock is needed?  If the operations on critical sections are simple enough  Encapsulate these operations into single atomic instruction  Mutual exclusion is directly guaranteed  Each processor attempts to access the shared data waits its turn without returning control back to software  If the operations are not simple  A LOCK is needed  If the lock is busy, waiting is done in software  Two choices, block or spin

The topics of this paper  Are there efficient algorithms for software spinning for busy lock?  5 software solutions are presented  Are more complex kinds of hardware support needed for performance?  Hardware solutions for ‘Multistage Interconnection Network Multiprocessors’ and ‘Single Bus Multiprocessors’ are presented

Multiprocessor Architectures  How processors are connected to memory  Multistage interconnection network or Bus  Where or not each processor has a coherent private cache  Yes or No  What is the coherence protocol  Invalidation-based or Distributed-write

For the performance  Minimize the communication bandwidth  Minimize the delay between a lock is released and reacquired  Minimize latency by using simple algorithm  When there is no lock contention

The problem of spinning  Spin on Test-and-Set  The performance of spinning on test-and-set degrades as the number of spinning processors increases  The lock holder must contend with spinning processors to access the lock location and other locations for normal operation

The problem of spinning – Spin on TAS P1P2P3P4 MEMORY BUS, Write-Through, Invalidation-based, Spin on Read lock := CLEAR; while (TestAndSet(lock) = BUSY) lock := CLEAR;

The problem of spinning  Spin on Read (Test-and-Test-and-Set)  Use cache to reduce the cost of spinning  When lock is released, each cache is updated or invalidated  The waiting processor sees the change and performs a test- and-set  When critical section is small, this is as poor as spin on test- and-set  This is most pronounced for systems with invalidation-based cache coherence, but also occurs with distributed-write

The problem of spinning – Spin on read P1P2P3P4 MEMORY BUS, Write-Through, Invalidation-based while (lock = BUSY or TestAndSet(lock) = BUSY)

Reasons for the poor performance of spin on read  There is a separation between detecting that the lock is released and attempting to acquire it with a test-and-set instruction  More than one test-and-set can occur  Cache is invalidated by test-and-set even if the value is not changed  Invalidation-based cache coherence requires O(P) bus or network cycle to broadcast invalidation

Problem of spinning Measurement Result1

Problem of spinning Measurement Result2

Software solutions Delay Alternatives  Insert delay into the spinning loop  Where to insert delay  After the lock has been released  After every separate access to the lock  The length of delay  Static or dynamic  Lock latency is not affected because processors try to get lock before delay

Delay Alternatives  Delay after Spinning processor Notices Lock has been Released  Reduce the number of test-and-sets when spin on read  Each processor can be statically assigned a separate slot, or amount of time to delay  The spinning processor with smallest delay gets the lock  Others may resume spinning without test-and-set  When there are few spinning processors, using fewer slots is better  When there are many spinning processors, using fewer slots results in many attempts to test-and-set

Delay Alternatives  Vary spinning behavior based on the number of waiting processors  The number of collision = The number of processors  Initially assume that there are no other waiting processors  Try to test-and-set->fail->collision  Double the delay up to some limit

Delay Alternatives  Delay Between Each Memory Reference  Can be used on architectures without cache or with invalidation-based cache  Reduce bandwidth consumption of spinning processors  Mead delay can be set statically or dynamically  More frequently polling improves performance when there are few spinning processors

Software Solutions Queuing in Shared Memory  Each processor insert itself into a queue then spins on a separate memory location flag  When a processor finishes with critical section, it sets the flag next processor in the queue  Only one cache read miss occurs  Maintaining queue is expensive – much worse for small critical sections

Queuing Init flags[0] := HAS_LOCK; flags[1..P-1] := MUST_SAIT; queueLast := 0; Lock myPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P] = MUST_WAIT) ; CRITICAL SECTION; Unlock flags[myPlace mod P] := MUST_WAIT; flags[(myPlace + 1) mod P] := HAS_LOCK;

Queuing Implementations among architectures  Distributed-write cache coherence  All processors share counter  To release lock, a processor writes its sequence number into shared counter  Each cache is updated, directly notifying the next processor to get lock  Invalidation-based cache coherence  Each processor should wait on a flag in a separate cache block  One of caches is invalidated and one read miss occurs  Multistage-network without coherence  Each processor should wait on a flag in a separate cache block  Have to poll to learn when it is their turn

Queuing Implementations  Bus without coherence  Processors must poll to find out if it is their turn  This can swamp bus  A delay can be inserted between each poll according to the position of processors in the queue and the execution time of critical sections  Without atomic read-and-increment instruction  Lock is needed  One of delay alternatives above may be helpful for contention  Problem : Increment lock latency  Increment of counter  Make its location 0, set another location  If there is no contention, this latency is loss of performance

Measurement Results of Software Alternatives1

Measurement Result of Software Alternatives2

Measurement Result of Software Alternatives3

Hardware Solutions Multistage Interconnection Network Multiprocessors  Combining networks  For spin on test-and-set  Only one of test-and-set requests are forwarded to memory and all other requests are returned with the value set  Lock latency may increase  Hardware queuing at the memory module  Eliminates polling across the network without coherence  Issues ‘enter’ and ‘exit’ instructions to the memory module  Lock latency is likely to be better than software queuing  Caches to hold queue links  Stores the name of the next processors in the queue directly in each processor’s cache

Hardware Solutions Single Bus Multiprocessors  Read broadcast  Eliminates duplicate read miss requests  If a read occurs in the bus that is invalid in a processor’s cache, the cache takes the data and make itself valid  Thus invalid caches of processors can be validated by another processor’s read  Special handling test-and-set requests in the cache  Processor can spin on test-and-set, acquiring the lock quickly when it is free without consuming bandwidth when it is busy  If test-and-set seems to fail, it is not committed

Conclusion  Simple method of spin-waiting degrade performance as the number of spinning processors increases  Software queuing and backoff have good performance even for large numbers of spinning processors  Backoff has better performance when there is no contention, queuing performs best when there are contention  Special hardware support can improve performance, too