EECE 5501 5.5: Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.

Slides:

Advertisements

Similar presentations

Chapter 5 Part I: Shared Memory Multiprocessors

Advertisements

Symmetric Multiprocessors: Synchronization and Sequential Consistency.

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization without Contention

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:

John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Mutual Exclusion.

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Synchronization (Barriers) Parallel Processing (CS453)

Synchronization CSE 661 – Parallel and Vector Architectures Dr. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Anshul Kumar, CSE IITD CSL718 : Multiprocessors Synchronization, Memory Consistency 17th April, 2006.

Operating Systems CMPSC 473 Mutual Exclusion Lecture 11: October 5, 2010 Instructor: Bhuvan Urgaonkar.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

1 Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization.

Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.

Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Multiprocessors – Locks

Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:

Synchronization Feb 2017 Topics Locks Barriers

Prof John D. Kubiatowicz

Lecture 19: Coherence and Synchronization

Lecture 5: Synchronization

Atomic Operations in Hardware

Atomic Operations in Hardware

Parallel Shared Memory

Lecture 18: Coherence and Synchronization

Reactive Synchronization Algorithms for Multiprocessors

Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Lecture 5: Snooping Protocol Design Issues

Shared Memory Systems Miodrag Bolic.

Lecture 21: Synchronization and Consistency

Lecture: Coherence and Synchronization

Hardware-Software Trade-offs in Synchronization and Data Layout

Lecture 25: Multiprocessors

CS533 Concepts of Operating Systems

Lecture 4: Synchronization

CS510 Concurrent Systems Jonathan Walpole.

Lecture: Coherence, Synchronization

Lecture 21: Synchronization & Consistency

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

Process/Thread Synchronization (Part 2)

Presentation transcript:

EECE : Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization event –Acquire method –Waiting algorithm Busy waiting (processor cannot do other work) Blocking (higher overhead, state must be saved) –Release method

EECE 5502 Implementing Mutual Exclusion (Lock-Unlock) Hardware solution –Use a set of LOCK bus lines Expensive and nonscalable P1P1 P2P2 P3P3 PpPp LOCK1 LOCK2 LOCK3 LOCK4

EECE 5503 Software solution –Requires hardware support for an atomic test-and-set operation –Example lock:ldreg, location cmpreg, #0 bnzlock stlocation, #1 ret unlock: stlocation, #0 ret Does this work?

EECE 5504 Simple software test-and-set lock –lock:t&sreg, location bnzreg, lock ret –unlock:stlocation, #0 ret Other possible atomic instructions –Swap reg, location –Fetch&op (operation) location fetch&inclocation fetch&addreg, location –Compare&swapreg1, reg2, location /* if (reg1 = M[location]) then M[Location] reg2 */

EECE 5505 Performance of t&s Locks Figure 5.29 Based on following code –Lock(L); critical-section( c ); /* c time in crit. sec. */ unlock(L); Exponential backoff (like CSMA) –If (lock is unsuccessful) then wait ( k * f i ) time units before another attempt constants chosen based on experiments

EECE 5506 Test-and-test-and-set Lock Could be basis for better solution Operation –Lock:testreg, location bnzreg, lock t&sreg, location bnzreg, lock ret

EECE 5507 Performance Goals for Locks Low latency Low traffic Scalability Low storage cost Fairness –Starvation should be avoided swap lock? t&s lock? test-and-t&s lock? Evaluation of locks:

EECE 5508 (LL, SC) Primitives LL (load-locked) –Loads synchronization variable into a register SC (store-conditional) –Tries to store the register value into the synchronization variable memory iff no other processor has written to that location (or cache block) since the LL

EECE 5509 lock:LLreg1, location bnzreg, lock /* if locked, try again */ SClocation, reg2 beqz lock/* if sc failed, start again */ ret unlock:st location, #0 ret

EECE Comments on LL-SC Only certain undo-able instructions are permitted between LL and SC Many different types of fetch&op instructions can be implemented SC does not generate invalidations upon a failure Only one processor can perform LL or SC at any given time instant

EECE Ticket Lock LOCK:LLreg1, ticket addreg2, reg1, #1 SCticket, reg2 beqz lock LOCK1:loadreg3, LED cmpreg1, reg3 bnzLOCK1 ret Unlock:loadreg1, LED increg1 storeLED, reg1 ticketLED

EECE Array-based LOCK LOCK:LLreg1, ticket addreg2, reg1, #1 (mod p) SCticket, reg2 beqzlock storeptr, reg2 LOCK1:loadreg3, LED[reg1] cmpreg3, #1 bnzLOCK1 storeLED[reg1], #0 ret Unlock:loadreg1, ptr storeLED[reg1], #1 ret ticket LED … …

EECE Comments on LL-SC LL-SC does not generate bus traffic if LL fails LL-SC does not generate invalidations if SC fails LL-SC does generate read-miss bus traffic even if SC fails O(p) traffic per lock acquisition LL-SC is not a fair lock

EECE Comments on Ticket Lock Operates like the ticket system at a bank Every process wanting to acquire the lock takes a ticket number and then busy-waits on a global now- serving number To release the lock, a process increments the now- serving number Ticket is fair, generates low bus traffic, and uses a constant amount of small storage Main problem: When now-serving changes, all processors cached copies are invalidated, and they all incur a read miss

EECE Comments on Array-Based Lock Uses fetch&increment to obtain a unique location on which to busy-wait (not a value) Lock data structure contains an array of p locations (each in a separate cache block) Acquire –Use fetch&increment to obtain next available location in lock array (with wraparound) Release –Write unlocked to the next array location It is fair, uses O(p) space, and is more scalable than ticket lock since only 1 processor read-misses

EECE Comparison Comparative performance: Fig –LL-SC with exponential backoff is best NOTE –… if a process holding a lock stops or slows down while it is in its critical section, all other processes may have to wait. [pp ] Try to avoid locks Try to use LL-SC type operations instead of actual locks

EECE Barriers Hardware barrier –Use special bus line and wired-OR Software barrier –Use locks, shared counters, and flags –E.g., refer to p. 354 of text

EECE Centralized barrier BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; mycount = bar_name.counter++; UNLOCK(bar_name.lock); if (mycount == p) { bar_name.counter = 0; bar_name.flag = 1; } else while (bar_name.flag == 0) { } } Problem with this code?

EECE Centralized barrier has potential problem with flag re-initialization Centralized barrier with sense reversal BARRIER (bar_name, p) { local_sense = !(local_sense); LOCK(bar_name.lock); mycount = bar_name.counter++; if (mycount == p) { UNLOCK(bar_name.lock); bar_name.counter = 0; bar_name.flag = local_sense; } else { UNLOCK(bar_name.lock); while (bar_name.flag != local_sense) { } } }

EECE Improving Barrier Performance Use software combining tree –With a bus, this has no significant benefit Use a special bus primitive to reduce the number of bus transactions for read misses in a centralized barrier –A processor monitors the bus and aborts its read miss if it sees the response to a read miss to the same location (by another processor)

EECE Implications for Software Use details of H/W design to design better, more efficient S/W –Keep machine fixed and examine how to improve parallel programs Programmer s Bag of Tricks –Assign tasks to reduce spatial interleaving of access patterns –Structure data to reduce spatial interleaving of access patterns E.g., 4D arrays instead of 2D arrays for equation solver kernel

EECE –Beware of conflict misses Figure 5.34 –Sizing dimensions of allocated arrays to powers of 2 is bad –This is a problem with direct-mapped caches –Use per-processor heaps Heap reservoir of memory space for a process –Copy data to increase spatial locality –Pad arrays Refer to Figure 5.36 Try to avoid false sharing within a cache block –Determine how to organize arrays of records Which data will be used together? Refer to Figure 5.36 –Align arrays to cache block boundaries Array should begin at cache block boundary