Synchronization CSE 661 – Parallel and Vector Architectures Dr. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.

Slides:



Advertisements
Similar presentations
EECE : Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.
Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:
Concurrency: Mutual Exclusion and Synchronization Chapter 5.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Mutual Exclusion By Shiran Mizrahi. Critical Section class Counter { private int value = 1; //counter starts at one public Counter(int c) { //constructor.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Mutual Exclusion.
Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Multiple Processor Systems
Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.
CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
5.6 Semaphores Semaphores –Software construct that can be used to enforce mutual exclusion –Contains a protected variable Can be accessed only via wait.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
Synchron. CSE 4711 The Need for Synchronization Multiprogramming –“logical” concurrency: processes appear to run concurrently although there is only one.
Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Computer Architecture II 1 Computer architecture II Lecture 9.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Synchronization (Barriers) Parallel Processing (CS453)
Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.
Concurrency, Mutual Exclusion and Synchronization.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.
MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.
Concurrency: Mutual Exclusion and Synchronization Chapter 5.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.
1 Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization.
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
Multiprocessors – Locks
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:
CS703 – Advanced Operating Systems
Background on the need for Synchronization
Prof John D. Kubiatowicz
Lecture 19: Coherence and Synchronization
Lecture 5: Synchronization
Reactive Synchronization Algorithms for Multiprocessors
Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Designing Parallel Algorithms (Synchronization)
Lecture 5: Snooping Protocol Design Issues
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Shared Memory Systems Miodrag Bolic.
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
Hardware-Software Trade-offs in Synchronization
Hardware-Software Trade-offs in Synchronization and Data Layout
Lecture 4: Synchronization
CSE 153 Design of Operating Systems Winter 19
Lecture: Coherence, Synchronization
Programming with Shared Memory Specifying parallelism
Lecture 21: Synchronization & Consistency
Lecture: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
Presentation transcript:

Synchronization CSE 661 – Parallel and Vector Architectures Dr. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 2 Types of Synchronization  “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast”  Types of Synchronization  Mutual Exclusion  Lock – Unlock  Event synchronization  Point-to-point  Group  Global (barriers)

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 3 History and Perspectives  Much debate over hardware primitives over the years  High-level language advocates want hardware locks/barriers  But it goes against the “RISC” philosophy and has other problems  Speed versus flexibility  But how much hardware support?  Atomic read-modify-write instructions  Separate lock lines on the bus  Specialized lock registers (Cray XMP)  Lock locations in memory  Hardware full/empty bits (Tera)

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 4 Role of User and System  User wants to use high-level synchronization operations  Locks, barriers,... etc.  Doesn’t care about implementation  System designer  How much hardware support in implementation?  Speed versus cost and flexibility  Waiting algorithm difficult to implement in hardware  Popular trend:  System provides simple atomic read-modify-write primitives  Software libraries implement lock, barrier algorithms using these  But some propose and implement full-hardware synchronization

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 5 Atomic Read-Modify-Write Operations  Most modern methods use a form of atomic read-modify-write  IBM 370  Atomic compare&swap for multiprogramming  Three operands: location, register to compare with, register to swap with  Intel x86  Instructions can be prefixed with a lock modifier (atomic execution)  SPARC  Atomic register-memory swap and compare&swap instructions  MIPS and IBM Power: pair of instructions  Load-Locked (LL) and Store-Conditional (SC)  Later used by PowerPC and DEC Alpha too  Allows a variety of higher-level RMW operations to be constructed

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 6 Components of a Synchronization Event  Acquire method  Acquire right to the synchronization  Enter critical section or proceed past event synchronization  Waiting algorithm  Wait for synchronization to become available when it isn’t  Lock is not free, or event has not yet occurred  Release method  Enable other processors to acquire right to the synchronization  Implementation of unlock  Notifying waiting process at a point-to-point event that event has occurred  Last process arriving at a barrier to release the waiting processes  Waiting algorithm is independent of type of synchronization

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 7 Waiting Algorithms  Blocking  Waiting process or thread is de-scheduled  Overhead of context-switch, suspending/resuming a process/thread  Processor becomes available to other processes or threads  Busy-waiting  Waiting process repeatedly test a location until it changes value  Releasing process sets the location  Consumes processor resources, cycles, and cache bandwidth  Busy-waiting better when  Scheduling overhead is larger than the expected wait time  Processor resources are not needed for other processes  Hybrid methods: busy-wait for a while, then block

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 8 Mutual Exclusion: Hardware Locks  Separate lock lines on the bus  Holder of a lock asserts the line  Priority mechanism for multiple requestors  Inflexible, so not popular for general purpose use  Few locks can be in use at a time (one per lock line)  Hardwired waiting algorithm  Lock registers (Cray XMP)  Set of registers shared among processors  Primarily used to provide atomicity for higher-level software locks

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 9 Simple Software Lock: First Attempt lock:ldregister, location /* load register ← location */ bnzregister, lock /* if register not 0, try again */ stlocation, #1 /* store 1 to mark it locked */ ret /* return control to caller */ unlock:st location, #0 /* store 0 to unlock location */ ret /* return control to caller */  Problem: lock needs atomicity in its own implementation  Load (test) and Store (set) of lock variable by a process not atomic  Solution: atomic read-modify-write or exchange instructions  Atomically test value of location and set it to another value

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 10 Atomic Exchange Instructions  Test&Set register, location  Value in location read into specified register  Constant 1 stored into location  Swap register, location  Contents of memory location and register are exchanged  Fetch&Op register, location  Value in location is read into specified register  Apply operation Op in Fetch&Op and update memory location  Examples: Fetch&Inc, Fetch&Dec, Fetch&Add, etc.  Atomicity of instruction is ensured  The read and write components cannot be split  Atomic instructions can be used to build locks

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 11 Simple Test&Set Lock lock:t&sregister, location /* test-and-set location */ bnzregister, lock /* if not 0, try again */ ret /* return control to caller */ unlock:st location, #0 /* store 0 into location */ ret /* return control to caller */  Test&Set succeeds if a zero is loaded into register  Otherwise, busy wait until lock is released by another process  Performance: test&set generates bus traffic  Happens when multiple processors compete to acquire the same lock  Test&Set causes bus transaction when cache block is shared or invalid  Bus read-exclusive transaction that invalidates others (invalidation protocol)  Bus-update transaction (update protocol)

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 12 Test&Set Lock With Backoff backoff:compute backoff delay busy wait loop lock:t&sregister, location /* test-and-set location */ bnzregister, backoff /* if not 0, try again */ ret /* return control to caller */ unlock:st location, #0 /* store 0 into location */ ret /* return control to caller */  Basic Idea of Backoff  Insert delay after an unsuccessful attempt to acquire lock  Reduce frequency of issuing test&sets while waiting  Linear backoff: unsuccessful i th time delay = k × i (constant k)  Exponential backoff: i th time delay = k i (works quite well empirically)

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 13 Test&Set Lock Performance  Same total number of lock calls, independent of # of processors p  Measure time per lock transfer; time in critical section not counted  Irregular nature is due to timing dependence of the bus contention effects lock(l); critical_section(c); unlock(l); On SGI Challenge

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 14 Performance Criteria for Locks  Low Latency  If a lock is free and no other processor is trying to acquire it  Low Traffic  If many processors are trying to acquire a lock  Processors should do so with as little bus traffic as possible  Contention can slow down lock acquisition and unrelated bus transactions  Scalability  Latency and bus traffic should not increase with number of processors  Low Storage Cost  Lock storage should be very small, independent of p  Fairness  Ideally, processors should acquire locks in same order as their requests

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 15 Test-and-Test&Set Lock lock:ldregister, location /* load first from location*/ bnzregister, lock /* if not 0, try again */ t&sregister, location /* test-and-set */ bnzregister, lock /* if not 0, try again */ ret /* return to caller */ unlock:st location, #0 /* write 0 to location */ ret /* return to caller */  Improved busy-waiting with load rather than test&set  Keep testing with ordinary load in cache  No bus transactions  Cached lock variable will be invalidated when release occurs  When value changes to 0, try to obtain lock with test&set  Only one attempt will succeed; others will fail and start testing again

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 16 Challenges  Test-and-Test&Set Lock  Slightly higher latency than simple Test&Set, but less traffic  But still all processors rush out on release to …  Read miss on load and then test&set  Traffic is O(p 2 ) for p processors, each to acquire the lock once  Each one of the p releases causes an invalidation that results in  O(p) cache misses on load  O(p) processes trying to perform the test&set resulting in O(p) transactions  Synchronization may have different needs at different times  Lock accessed with low or high contention  Different requirements: low latency, high throughput, fairness  Rich area of software-hardware interactions  Luckily, better hardware primitives as well as algorithms exist

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 17 Improved Hardware Primitives: LL-SC  Load-Locked or Load-Linked (LL)  Load variable into a register and keep track of address  Address is saved in a lock address register and a lock flag is set  LL may be followed by instructions that modify the register value  Store-Conditional (SC)  Tries to store back the same memory location read by load-locked  Provided no one has written to the variable since this processor’s LL  Checks the lock flag and address register for an intervening write  If the flag has been reset by another write, then SC fails  If SC succeeds, instructions between LL-SC happened atomically  If it fails, doesn’t write nor generate invalidations (need to retry LL)  Success or failure indicated by a return value

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 18 Load-Locked & Store-Conditional - contd  Store-Conditional fails when  Detects intervening write to same address (lock flag is reset)  A context switch happened before SC (lock flag is reset)  Cache block was replaced (lock flag is reset)  However, failure of SC does not generate a bus transaction  LL-SC are not lock-unlock respectively  Completion of Load-Locked does not mean obtaining exclusive access  Successful LL-SC pair does not even guarantee exclusive access  Only guarantee no conflicting write to lock variable between them  Failure of SC will have to retry the LL-SC code sequence

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 19 Simple Lock with LL-SC lock: llreg, location /* load locked reg ← location */ bnzreg, lock /* try again if not zero */ movereg, #1 /* reg ← 1 */ sclocation, reg /* store cond location ← 1 */ beqzreg, lock /* try again if sc failed*/ ret unlock: st location, #0 /* write 0 to location */ ret  Test with LL, but failed SC attempts don’t generate invalidations  Better than Test-and-Test&Set, but still not a fair lock  Traffic is reduced from O(p 2 ) to O(p) per lock acquisition for p processors  Each one of the p releases causes an invalidation that results in  O(p) bus read transactions to shared location  Separate read misses to same block can be combined if all caches observe bus read  Only one bus transaction for successful SC that causes invalidations  Failed SC do not generate bus transactions (advantage over test&set)

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 20 Implementing Atomic Operations  LL-SC pair can be used to implement atomic r-m-w operations  Test&Set, Swap, Fetch&Op, and other custom operations  Better than to build a lock-unlock around simple shared variable update  Example on Implementing Fetch&Inc with LL-SC pair: Fetch&Inc: llreg, location /* load locked reg ← location */ addreg, reg, #1 /* increment register */ sclocation, reg /* store cond location ← reg */ beqzreg, Fetch&Inc /* retry if sc failed*/ ret  But keep it small so SC is likely to succeed  Don’t include store instructions between LL & SC  Cannot be undone

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 21 Advanced Locking Algorithms  Problem with Simple LL-SC lock  Doesn’t reduce traffic to minimum and not a fair lock  Read misses by all waiting processes after lock release and SC by winner  Can use backoff with LL-SC to reduce bursty bus read traffic  Advanced algorithms (useful for test&set and LL-SC)  Only one process to attempt to acquire lock upon release  Rather than all processes rushing to do test&set and issue invalidations  Only one process to have read miss when a lock is released  Rather than all processes rushing to read the lock variable (useful for LL-SC)  Ticket lock achieves first  Array-based lock achieves both, but with a cost in memory space  Both are fair (FIFO) locks as well

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 22 Ticket Lock  Works like waiting line at bank  Two shared counters per lock:  next_ticket and now_serving  Acquire: fetch&inc my_private_ticket ← next_ticket  Atomic operation when acquiring lock only, can be implemented with LL-SC  Waiting: busy wait until my_private_ticket is equal to now_serving  No atomic operation for checking since each process has a unique ticket  Release: increment now_serving  Fair FIFO order  O(p) read misses at release, since all spin on now-serving  Like simple LL-SC lock, but key difference is fairness  Backoff can reduce number of read misses on release  Backoff proportional to my_private_ticket – now_serving may work well

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 23 Array-Based Lock  Waiting processes poll on different locations in an array of size p  Requires a shared array lock[ ] and a shared next_address per lock  Distant lock[ ] elements in different cache blocks to avoid false sharing  Acquire:  fetch&add my_address ← next_address, block_size  Wait until lock[my_address] == 1  Release:  lock[my_address] = 0;  lock[my_address + block_size] = 1; // to wakeup next  Performance:  O(1) traffic per acquire with coherent caches  FIFO ordering as in ticket lock, but O(p) space per lock

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 24 Lock Performance on SGI Challenge  All locks are implemented using LL-SC  Delays are inserted as processor cycles, converted to μsecs  Fixed number of lock calls are executed, independent of p  Delays c and d are subtracted out of total time  So that only time for lock acquisition, release, and contention is measured  Three scenarios: (1) c = 0, d = 0 (2) c = 3.64 μs, d = 0, and (3) c = 3.64 μs, d = 1.29 μs Loop: lock(l); critical_section(c); unlock(l); delay(d);

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 25 Lock Performance on SGI Challenge Loop: lock(l); critical_section(c); unlock(l); delay(d); l Array-based 6 LL-SC n LL-SC, exponential u Ticket s Ticket, proportional

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 26 Lock Performance – cont’d  Simple LL-SC lock seems to perform better  LL-SC does best at small p due to unfairness  Especially when same processor does SC followed directly by LL  Same processor acquires the lock before another processor gets chance  With delay after release, performance of simple LL-SC deteriorates  LL-SC with exponential backoff performs best, but not necessarily fair  Ticket lock with proportional backoff scales well  Backoff duration proportional to: myTicketNumber – nowServing  Array-based lock also scales well  Only correct processor issues a read  Time per lock acquire is constant and does not increase with p  Real benchmarks required for better comparisons

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 27 Point-to-Point Event Synchronization  Software Methods:  Busy-waiting: use ordinary variables as flags  Blocking: use semaphores and system support  Software Algorithms:  Flags as control variables  Value to control synchronization: when initial value (say 0) changes P1P2 a = f(x);// set awhile (a == 0) ;// do nothing – busy wait b = g(a);// use a P1P2 a = f(x);// set awhile (flag == 0) ;// do nothing – busy wait flag = 1;b = g(a);// use a

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 28 Hardware Support  Full/Empty bit with each word in memory  Fine-grained word-level producer-consumer synchronization  Set to “full” when word is written (special store) with produced data  Producer does so when bit is “empty” and then set it to “full”  Clear to “empty” when word is consumed (special load)  Consumer reads word if “full” and then clears it to “empty”  Hardware preserves atomicity of bit manipulation with read or write  Concerns about flexibility  Single-producer-multiple-consumer synchronization  Multiple writes before consumer reads  Language/compiler support for the use of the special load/store

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 29 Global Barrier Event Synchronization  Hardware barriers  Wired-AND line separate from address/data bus  Set input high when arrive, wait for output to be high to leave  In practice, multiple wires to allow reuse  Useful when barriers are global and very frequent  Difficult to support arbitrary subset of processors  even harder with multiple processes per processor  Difficult to dynamically change number and identity of participants  e.g. latter due to process migration  Not common today on bus-based machines  Software algorithms implemented using locks, flags, counters  Let’s look at software algorithms with simple hardware primitives

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 30 struct bar_type { int counter = 0; lock_type lock; int flag = 0; } BARRIER (bar, p) { LOCK(bar.lock); if (bar.counter == 0) bar.flag = 0; // reset flag if first to reach mycount = bar.counter++; // mycount is private UNLOCK(bar.lock); if (mycount == p) { // last to arrive bar.counter = 0; // reset for next barrier bar.flag = 1; // release waiters } else while (bar.flag == 0) {}; // busy wait for release } A Simple Centralized Software Barrier

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 31 Simple Centralized Barrier – cont’d  Barrier counter counts number of processes that have arrived  Lock barrier and increment counter when a process arrives  First process to arrive resets flag on which to busy-wait  Last process resets counter and sets flag to release p–1 waiting processes  However, there is one problem with the flag some computation … BARRIER(bar,p) some more computation … BARRIER(bar,p)  First process to exit first barrier enters second barrier  Resets bar.flag to 0 when other processes did not exit first barrier  Some processes might get stuck and will never be able to leave first barrier

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 32 Centralized Barrier with Sense Reversal  Sense reversal: toggles flag value (0 and 1) in consecutive times  Waiting condition changes in consecutive instances of barrier BARRIER (bar, p) { local_sense = !(bar.flag); // private sense variable LOCK(bar.lock); bar.counter++; if (bar.counter == p) { // last process to arrive UNLOCK(bar.lock); bar.counter = 0; bar.flag = local_sense; // release waiters } else { UNLOCK(bar.lock); while (bar.flag != local_sense); // busy wait }

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 33 Improving Barrier Algorithms Centralized Contention Tree structured Little contention  Centralized Barrier: all p processors contend for same lock-flag  Software Combining Tree  Only 2 processors access the same location in a binary tree  Valuable in distributed network: communicate along parallel paths  Ordinary reads/writes instead of locks at each node

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 34 Barrier Performance on SGI Challenge  Centralized barrier involving p processors does quite well  Requires O(p) bus transactions  Combining tree has similar total number of bus transactions  Can perform much better in a distributed network with parallel paths Number of processors T ime (  s) l l l l l l l l u u u u u u u u l Centralized u Combining tree

SynchronizationCSE 661 – Parallel and Vector Architectures – © Dr. Muhamed Mudawar – KFUPM Slide 35 Synchronization Summary  Rich interaction of hardware-software tradeoffs  Hardware support still subject of debate  Flexibility of LL and SC made them increasingly popular  Must evaluate hardware primitives/software algorithms together  Primitives determine which algorithms perform well  Evaluation methodology is challenging  Use of delays in micro-benchmarks  Should use both micro-benchmarks and real workloads  Simple software algorithms do well on a bus  Algorithms that ensure constant-time access do exist, but more complex  More sophisticated techniques for distributed machines