John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.

Slides:

Advertisements

Similar presentations

EECE : Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.

Advertisements

Multiple Processor Systems

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization without Contention

The University of Adelaide, School of Computer Science

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Local-Spin Algorithms Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Multiple Processor Systems

Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Local-Spin Algorithms Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

Scalable Reader Writer Synchronization John M.Mellor-Crummey, Michael L.Scott.

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Symmetric Multiprocessors and Performance of SpinLock Techniques Based on Anderson’s paper “Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”

More on Locks: Case Studies

Synchronization (Barriers) Parallel Processing (CS453)

CS510 Concurrent Systems Introduction to Concurrency.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 9, 2003 Distributed Shared Memory.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Anshul Kumar, CSE IITD CSL718 : Multiprocessors Synchronization, Memory Consistency 17th April, 2006.

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

1 Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization.

Local-Spin Mutual Exclusion Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CS510 Concurrent Systems Jonathan Walpole. Introduction to Concurrency.

Synchronization Feb 2017 Topics Locks Barriers

Lecture 5: Synchronization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

Reactive Synchronization Algorithms for Multiprocessors

Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Lecture 21: Synchronization and Consistency

Lecture: Coherence and Synchronization

CS533 Concepts of Operating Systems

CS510 Concurrent Systems Jonathan Walpole.

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture: Coherence, Synchronization

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Dance Hall Machines?

Various insns known as fetch_and_ф insns: test_and_set, fetch_and_store, fetch_and_add, compare_and_swap Some can be used to simulate others but often with overhead Some lock types require a particular primitive to be implemented or to be implemented efficiently Atomic Instructions

type lock = (unlocked, locked) procedure acquire_lock (lock *L) while test_and_set (L) == locked ; procedure release_lock (lock *L) *L = unlocked Test_and_set: Basic

$ P $ P $ P Memory

type lock = (unlocked, locked) procedure acquire_lock (lock *L) while 1 if *L == unlocked if test_and_set (L) == unlocked return procedure release_lock (lock *L) *L = unlocked Test_and_set: test_and_test_and_set

$ P $ P $ P Memory

type lock = (unlocked, locked) procedure acquire_lock (lock *L) delay = 1 while test_and_set (L) == locked pause (delay) delay = delay * 2 procedure release_lock (lock *L) *L = unlocked Test_and_set: test_and_set with backoff

$ P $ P $ P Memory

type lock = record next_ticket = 0 now_serving = 0 procedure acquire_lock (lock *L) my_ticket = fetch_and_increment(L->next_ticket) while 1 if L->now_serving == my_ticket return procedure release_lock (lock *L) L->now_serving = L->now_serving + 1 Ticket Lock

Memory next_ticket now_serving $ P my_ticket $ P $ P

type lock = record slots = array [0…numprocs – 1] of (has_lock, must_wait) next_slot = 0 procedure acquire_lock (lock *L) my_place = fetch_and_increment (L->next_slot) // Various modulo work to handle overflow while L->slots[my_place] == must_wait ; L->slots[my_place] = must_wait procedure release_lock (lock *L) L->slots[my_place + 1] = has_lock Array-Based Queuing Locks

Memory next_slot slots $ P my_place $ P $ P

type qnode = record qnode *next bool locked type lock = qnode* procedure acquire_lock (lock *L, qnode *I) I->next = Null qnode *predecessor = fetch_and_store (L, I) if predecessor != Null I->locked = true predecessor->next = I while I->locked ; MCS Locks procedure release_lock (lock *L, qnode *I) if I->next == Null if compare_and_swap (L, I, Null) return while I->next == Null ; I->next->locked = false

MCS Locks L1-R 2-B 3-B 2-R 3-R3-E 4-B 5-B 4-R procedure release_lock (lock *L, qnode *I) if I->next == Null if compare_and_swap (L, I, Null) return while I->next == Null ; I->next->locked = false

MCS Locks Memory lock qnodes $ P lock next locked next locked next locked $ P lock next locked next locked next locked $ P lock next locked next locked next locked

Results: Scalability – Distributed Memory Architecture

Results: Scalability – Cache Coherent Architecture

Butterfly’s atomic insns are very expensive Butterfly can’t handle 24-bit pointers Results: Single Processor Lock/Release Time Times are in μsTest_and_setTicketAnderson (Queue)MCS Butterfly (Distributed) Symmetry (Cache coherent) 7.0NA

Results: Network Congestion Busy-wait LockIncrease in Network Latency Measured From Lock NodeIdle Node test_and_set1420%96% test_and_set w/ linear backoff882%67% test_and_set w/ exp. backoff32%4% ticket992%97% ticket w/ prop backoff53%8% Anderson75%67% MCS4%2%

Atomic insns >> normal insns && 1 processor latency is very important  don’t use MCS If processes might be preempted  test_and_set with exponential backoff Which lock should I use? fetch_and_store supported? fetch_and_increment supported? Yes No test_and_set w/ exp backoff Ticket MCS YesNo

Centralized Barrier P0 P1 P2 P 

Software Combining Tree Barrier P0 P1 P2 P    P0 P1 P2 P3

 Tournament Barrier P0 P1 P2 P3      P0P1P2P3 W C L W L L

Dissemination Barrier P0 P1 P2 P3         P0P1P2P3

New Tree-Based Barrier P0 P1 P2 P     3

Summary BarrierSpaceWakeupLocal SpinningNetwork Txns CentralizedO(1)broadcastnoO(p) or O(∞) Software Combining TreeO(p)treenoO(p × fan-in) or O(∞) TournamentO(plogp)treeyesO(p) DisseminationO(plogp)noneyesO(plogp) New Tree-BasedO(p)treeyes2p - 2

Results – Distributed Shared Memory BarrierSpaceWakeupLocal SpinningNetwork Txns CentralizedO(1)broadcastnoO(p) or O(∞) Software Combining TreeO(p)treenoO(p × fan-in) or O(∞) TournamentO(plogp)treeyesO(p) DisseminationO(plogp)noneyesO(plogp) New Tree-BasedO(p)treeyes2p - 2

Results – Broadcast-Based Cache-Coherent BarrierSpaceWakeupLocal SpinningNetwork Txns CentralizedO(1)broadcastnoO(p) or O(∞) Software Combining TreeO(p)treenoO(p × fan-in) or O(∞) TournamentO(plogp)treeyesO(p) DisseminationO(plogp)noneyesO(plogp) New Tree-BasedO(p)treeyes2p - 2

Results – Local vs. Remote Spinning BarrierNetwork Latency (local)Network Latency (remote) New Tree-Based10% increase124% increase Dissemination18% increase117% increase

Barrier Decision Tree Multiprocessor? Dissemination Barrier Centralized Barrier New Tree-Based Barrier (tree wakeup) New Tree-Based Barrier (central wakeup) Distributed Shared Memory Broadcast-Based Cache-Coherent

No dance hall No need for complicated hardware synch Need a full set of fetch_and_ф Architectural Recommendations