10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek.

Slides:



Advertisements
Similar presentations
Operating Systems Semaphores II
Advertisements

Symmetric Multiprocessors: Synchronization and Sequential Consistency.
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
Synchronization without Contention
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
1 Chapter 5 Concurrency: Mutual Exclusion and Synchronization Principals of Concurrency Mutual Exclusion: Hardware Support Semaphores Readers/Writers Problem.
Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Chapter 6: Process Synchronization
5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
CH7 discussion-review Mahmoud Alhabbash. Q1 What is a Race Condition? How could we prevent that? – Race condition is the situation where several processes.
Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
Race Conditions CS550 Operating Systems. Review So far, we have discussed Processes and Threads and talked about multithreading and MPI processes by example.
Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Synchronization (Barriers) Parallel Processing (CS453)
CS510 Concurrent Systems Introduction to Concurrency.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
CS510 Concurrent Systems Jonathan Walpole. Introduction to Concurrency.
Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.
Multiprocessors – Locks
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Background on the need for Synchronization
Lecture 19: Coherence and Synchronization
Lecture 5: Synchronization
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.
Chapter 5: Process Synchronization
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Designing Parallel Algorithms (Synchronization)
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
Lecture 4: Synchronization
The University of Adelaide, School of Computer Science
CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization
CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization
CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization
Lecture 17 Multiprocessors and Thread-Level Parallelism
CSE 153 Design of Operating Systems Winter 19
CS333 Intro to Operating Systems
Chapter 6: Synchronization Tools
Lecture: Coherence, Synchronization
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Process/Thread Synchronization (Part 2)
CSE 542: Operating Systems
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006ELEG652-06F2 Synchronization The orchestration of two or more threads (or processes) to complete a task in a correct manner and avoid any data races Data Race or Race Condition –“There is an anomaly of concurrent accesses by two or more threads to a shared memory and at least one of the accesses is a write” Atomicity and / or serialibility

10/20/2006ELEG652-06F3 Atomicity Atomic  From the Greek “Atomos” which means indivisible An “All or None” scheme An instruction (or a group of them) will appear as if it was (they were) executed in a single try –All side effects of the instruction (s) in the block are seen in its totality or not all Side effects  Writes and (Causal) Reads to the variables inside the atomic block

10/20/2006ELEG652-06F4 Atomicity Word aligned load and stores are atomic in almost all architectures Unaligned and bigger than word accesses are usually not atomic What happens when non-atomic operations goes wrong –The final result will be a garbled combination of values –Complete operations might be lost in the process Strong Versus Weak Atomicity

10/20/2006ELEG652-06F5 Synchronization Applied to Shared Variables Synchronization might enforce ordering or not High level Synchronization types –Semaphores –Mutex –Barriers –Critical Sections –Monitors –Conditional Variables

10/20/2006ELEG652-06F6 Semaphores Intelligent Counters of Resources –Zero Means not available Abstract data which has two operations involved –P  probeer te verlagen: “try to decrease” Waits (Busy waits or sleeps) if the resource is not available. –V  verhoog: “increase.” Frees the resource Binary V.S. Blocking V.S. Counting Semaphores –Binary: Initial Value will allow threads to obtain it –Blocking: Initial Value will block the threads –Counting: Initial Value is not zero Note: P and V are atomic operations!!!!

10/20/2006ELEG652-06F7 Mutex Mutual Exclusion Lock A binary semaphore to ensure that one thread (and only one) will access the resource –P  Lock the mutex –V  Unlock the mutex It doesn’t enforce ordering Fine V.S. Coarse grained

10/20/2006ELEG652-06F8 Barriers A high level programming construct Ensure that all participating threads will wait at a program point for all other (participating) threads to arrive, before they can continue Types of Barriers –Tree Barriers (Software Assisted) –Centralized Barriers –Tournament Barriers –Fine grained Barriers –Butterfly style Barriers –Consistency Barriers (i.e. #pragma omp flush)

10/20/2006ELEG652-06F9 Critical Sections A piece of code that is executed by one and only one thread at any point in time If T1 finds CS in use, then it waits until the CS is free for it to use it Special Case: –Conditional Critical Sections: Threads waits on a “given” signal to resume execution. –Better implemented with lock free techniques (i.e. Transactional Memory)

10/20/2006ELEG652-06F10 Monitors and Conditional Variables A monitor consists of: –A set of procedures to work on shared variables –A set of shared variables –An invariant –A lock to protect from access by other threads Conditional Variables –The invariant in a monitor (but it can be used in other schemes) –It is a signal place holder for other threads activities

10/20/2006ELEG652-06F11 Much More … However, all of these are abstractions Major elements –A synchronization element that ensure atomicity Locks!!!! –A synchronization element that ensure ordering Barriers!!!! Implementations and types –Common types of atomic primitives –Read – Modify – Write Back cycles Synch Overhead may break a system –Unnecessary consistency actions –Communication cost between threads Why Distributed Memory Machines have “implicit” synchronization?

10/20/2006ELEG652-06F12 Topic 5a Locks

10/20/2006ELEG652-06F13 Implementation Atomic Primitives –Fetch and Φ operations Read – Modify – Write Cycles Test and Set Fetch and Store –Exchange register and memory Fetch and Add Compare and Swap –Conditionally exchange the value of a memory location

10/20/2006ELEG652-06F14 Implementation Use by programmers to implement more complex synchronization constructs Waiting behavior –Scheduler based: The process / thread is de- scheduled and will be scheduled in a future time –Busy Wait: The process / thread polls on the resource until it is available –Dependent on the Hardware / OS / Scheduler behavior

10/20/2006ELEG652-06F15 Types of (Software) Locks The Spin Lock Family The Simple Test and Set Lock –Polls a shared Boolean variable: A binary semaphore –Uses Fetch and Φ operations to operate on the binary semaphore –Expensive!!!! Waste bandwidth Generate Extra Busses transactions –The test test and set approach Just poll when the lock is in use

10/20/2006ELEG652-06F16 Types of (Software) Locks The Spin Lock Family Delay based Locks –Spin Locks in which a delay has been introduced in testing the lock –Constant delay –Exponentional Back-off Best Results –The test test and set scheme is not needed

10/20/2006ELEG652-06F17 Types of (Software) Locks The Spin Lock Family Pseudo code: enum LOCK_ACTIONS = {LOCKED, UNLOCKED}; void acquire_lock(lock_t L) { int delay = 1; while(! test_and_set(L, LOCKED) ) { sleep(delay); delay *= 2; } void release_lock(lock_t L) { L = UNLOCKED; }

10/20/2006ELEG652-06F18 Types of (Software) Locks The Ticket Lock Reduce the # of Fetch and Φ operations –Only one per lock acquisition Strongly fair lock –No starvation A FIFO service Implementation: Two counters –A Request and Release Counters

10/20/2006ELEG652-06F19 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 00 RequestRelease T1 acquires the lock

10/20/2006ELEG652-06F20 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 10 RequestRelease T2 requests the lock

10/20/2006ELEG652-06F21 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 20 RequestRelease T3 requests the lock

10/20/2006ELEG652-06F22 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 31 RequestRelease T1 releases the lock T2 gets the lock T4 requests the lock

10/20/2006ELEG652-06F23 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 41 RequestRelease T5 requests the lock

10/20/2006ELEG652-06F24 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 51 RequestRelease T1 requests the lock

10/20/2006ELEG652-06F25 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 52 RequestRelease T2 releases the lock T3 acquires the lock

10/20/2006ELEG652-06F26 Reduce the number of Fetch and Φ operations –Only read ops on the release counter However, still a lot of memory and network bandwidth wasted. Back off techniques also used –Exponentional Back off A bad idea –Constant Delay Minimum time of holding a lock –Proportional Back off Dependent on how many are waiting for the lock Types of (Software) Locks The Ticket Lock

10/20/2006ELEG652-06F27 Types of (Software) Locks The Ticket LockPseudocode: unsigned int next_ticket = 0; unsigned int now_serving = 0; void acquire_lock() { unsigned int my_ticket = fetch_and_increment(next_ticket); while{ sleep(my_ticket - now_serving); if(now_serving == my_ticket) return; } void release_lock() { now_serving = now_serving + 1; }

10/20/2006ELEG652-06F28 Types of (Software) Locks The Array Based Queue Lock Contention on the release counter Cache Coherence and memory traffic –Invalidation of the counter variable and the request to a single memory bank Two elements –An Array and a tail pointer that index such array –The array is as big as the number of processor –Fetch and store  Address of the array element –Fetch and increment  Tail pointer FIFO ordering

10/20/2006ELEG652-06F29 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 EnterWait Tail The tail pointer points to the beginning of the array The all array elements except the first one are marked to wait

10/20/2006ELEG652-06F30 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 EnterWait Tail T1 Gets the lock

10/20/2006ELEG652-06F31 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 EnterWait Tail T2 Requests

10/20/2006ELEG652-06F32 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 EnterWait Tail T3 requests

10/20/2006ELEG652-06F33 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 WaitEnterWait Tail T1 releases T2 Gets

10/20/2006ELEG652-06F34 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 WaitEnterWait Tail T4 Requests

10/20/2006ELEG652-06F35 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 WaitEnterWait Tail T1 requests

10/20/2006ELEG652-06F36 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 Wait EnterWait Tail T2 releases T3 gets

10/20/2006ELEG652-06F37 Types of (Software) Locks The Queue Locks It uses too much memory –Linear space (relative to the number of processors) per lock. Array –Easy to implement Linked List: QNODE –Cache management

10/20/2006ELEG652-06F38 Types of (Software) Locks The MCS Lock Characteristics –FIFO ordering –Spins on locally accessible flag variables –Small amount of space per lock –Works equally well on machines with and without coherent caches Similar to the QNODE implementation of queue locks –QNODES are assigned to local memory –Threads spins on local memory

10/20/2006ELEG652-06F39 MCS: How it works? Each processor enqueues its own private lock variable into a queue and spins on it –key: spin locally CC model: spin in local cache DSM model: spin in local private memory –No contention On lock release, the releaser unlocks the next lock in the queue –Only have bus/network contention on actual unlock –No starvation (order of lock acquisitions defined by the list)

10/20/2006ELEG652-06F40 MCS Lock Requires atomic instruction: –compare-and-swap –fetch-and-store If there is no compare-and-swap –an alternative release algorithm extra complexity loss of strict FIFO ordering theoretical possibility of starvation Detail: Mellor-Crummey and Scott ’ s 1991 paper

10/20/2006ELEG652-06F41 MCS: Example Tail FlagNextF = 1Next Tail FlagNext Tail Init Proc 1 gets Proc 2 tries CPU 1 CPU 2 CPU 3 CPU 4 CPU 1 holds the “real” lock CPU 2, CPU 3 and CPU 4 spins on the flag When CPU 1 releases, it releases the lock and change the flag variable of the next in the list

10/20/2006ELEG652-06F42 Implementation Modern Alternatives Fetch and Φ operations –They are restrictive –Not all architecture support all of them Problem: A general one atomic op is hard!!! Solution: Provide two primitives to generate atomic operations Load Linked and Store Conditional –Remember PowerPC lwarx and stwcx instructions

10/20/2006ELEG652-06F43 An Example Swap try:movR3, R4 ldR2, 0(R1) stR3, 0(R1) movR4, R2 Exchange the contents of register R4 with memory location pointed by R1 Not Atomic!!!!

10/20/2006ELEG652-06F44 An Example Atomic Swap try:movR3, R4 llR2, 0(R1) scR3, 0(R1) beqzR3, try movR4, R2 Swap (Fetch and store) using ll and sc In case that another processor writes to the value pointed by R1 before the sc can complete, the reservation (usually keep in register) is lost. This means that the sc will fail and the code will loop back and try again.

10/20/2006ELEG652-06F45 Another Example Fetch and Increment and Spin Lock try:llR2, 0(R1) addiR2, R2, #1 scR2, 0(R1) beqzR2, try Fetch and Increment using ll-sc liR2, #1 lockit:exchR2, 0(R1) bnez R2, lockit Spin Lock using ll-sc The exch instruction is equivalent to the Atomic Swap Instruction Block presented earlier Assume that the lock is not cacheable Note: 0  Unlocked; 1  Locked

10/20/2006ELEG652-06F46 Performance Penalty Example Suppose there are 10 processors on a bus that each try to lock a variable simultaneously. Assume that each bus transaction (read miss or write miss) is 100 clock cycles long. You can ignore the time of the actual read or write of a lock held in the cache, as well as the time the lock is held (they won’t matter much!) Determine the performance penalty.

10/20/2006ELEG652-06F47 Answer It takes over 12,000 cycles total for all processor to pass through the lock! Note: the contention of the lock and the serialization of the bus transactions. See example on pp 596, Henn/Patt, 3 rd Ed.

10/20/2006ELEG652-06F48 Performance Penalty Assume the same example as before (100 cycles per bus transaction, 10 processors) but consider the case of a queue lock which only updates on a miss Paterson and Hennesy p 603

10/20/2006ELEG652-06F49 Performance Penalty Answer: –First time: n+1 –Subsequent access: 2(n-1) –Total: 3n – 1 –29 Bus cycles or 2900 clock cycles

10/20/2006ELEG652-06F50 Implementing Locks Using Coherence lockit:ldR2, 0(R1) bnez R2, lockit li R2, #1 exch R2, 0(R1) bnezR2, lockit StepP0P1P2StateBus 1Has LockSpins SNone 2Set Lock = 0Inv rcvd EWrite Inv from P0 3Cache Miss SWB from P0. 4WaitsLock = 0SCache Miss (P2) satisfied 5Lock = 0Swap  Cache Miss SCache Miss (P1) satisfied 6Swap  Cache miss Completes swap returns 0 and L=1 E(P2)Inv from P2 7Swap completes returns 1 and set L=1 EnterE(P1)WB 8Spins L = 0None lockit:llR2, 0(R1) bnez R2, lockit li R2, #1 sc R2, 0(R1) beqzR2, lockit

10/20/2006ELEG652-06F51 Some Graphs Increase in Network Latency on a Butterfly. Sixty Processor Performance of spin locks on a butterfly. The x-axis represents processors and y-axis represents time in microseconds Extracted from “Algorithms for Scalable Synchronization on Shared.” John M. Mellor-Crummer and Michael L. Scott. January 1991

10/20/2006ELEG652-06F52 Topic 5b Barriers

10/20/2006ELEG652-06F53 The Barrier Construct The idea for software barriers –A program point in which all participating threads wait for each other to arrive to this point before continuing Difficulty –Overhead of synchronizing the threads –Network and Memory bandwidth issues Implementation –Centralized Simple to implement with locks –Tree based Better with bandwidth

10/20/2006ELEG652-06F54 Centralized Barriers A normal barrier in which all threads / processors waits for each other “serially” Typical Implementation: Two spin locks –One waits for all threads to arrives –One keeps tally of the arrived threads A thread arrives to the barrier and increment the counter by one (atomically) Check if you are the last one –If you aren’t then wait –If you are, unblock (awake) the rest of the threads

10/20/2006ELEG652-06F55 Centralized Barrier Pseudo Code int count = 0; bool sense = true; void central_barrier(){ lock(L); if (count == 0) sense = 0; count ++; unlock(L); if(count == PROCESSORS){ sense = 1; count = 0; } else spin(sense == 1); } It may deadlock or malfunction

10/20/2006ELEG652-06F56 Centralized Barrier T1 T2 T3 Barrier 1 Work Barrier 2 T1 arrives to the barrier, increments count and spins T2 arrives to the barrier, increments count and spins T3 arrives to the barrier, increments count and change sense T3 is delayed and T1 do Work T1 reaches the next barrier, increments count and it is delayed T3 starts again and reset the count T2 and T3 arrives to the barrier and forever spin

10/20/2006ELEG652-06F57 Centralized Barrier Pseudo Code: Reverse Sense Barrier int count = 0; bool sense = true; void central_barrier(){ static bool local_sense = true; local_sense = ! local_sense; lock(L); count ++; if(count == PROCESSORS){ count = 0; sense = local_sense; } unlock(L); spin(sense == local_sense); } It will wait since the spin target can be either from the previous barrier (old local_sense) or from the current barrier (local_sense)

10/20/2006ELEG652-06F58 Centralized Barrier Performance Suppose there are 10 processors on a bus that each try to execute a barrier simultaneously. Assume that each bus transaction is 100 clock cycles, as before. You can ignore the time of the actual read or write of a lock held in the cache as the time to execute other non-synchronization operations in the barrier implementation. Determine the number of bus transactions required for all 10 processors to reach the barrier, be released from the barrier and exit the barrier. Assume that the bus is totally fair, so that every pending request is serviced before a new request and that the processors are equally fast. Don’t worry about counting the processors out of the barrier. How long will the entire process take? Patterson and Hennesy Page 598

10/20/2006ELEG652-06F59 Centralized Barrier Steps through the barrier –Assume that ll-sc lock is used –LL the lock  i times –SC the lock  i times –Load Count  1 time –LL the lock again  i -1 times –Store Count  1 time –Store lock  1 time –Load sense  2 times –Total transaction for the ith processor: 3i + 4 –Total: (3n n)/2 – 1 –204 bus cycles  20,400 clock cycles

10/20/2006ELEG652-06F60 Tree Type Barriers The software combining tree barrier –A shared variable becomes a tree of access –Each parent node will combine the results of each its children –A group of processor per leaf –Last processor update the leaf and then moves up –A two pass scheme: From down to up  Update count From up to down  Update sense and resume –Objectives Reduces Memory Contention –Disadvantages Spins on memory locations which positions cannot be statically determinated

10/20/2006ELEG652-06F61 Tree Type Barriers Butterfly Barrier –Based on the Butterfly network scheme for broadcasting and reduction –Pairwise optimizations At step k: Processor i signals processor i xor 2 k –In case that the number of processors are not a power of two then existing processor will participate. –Max Synchronizations: 2 Floor(log 2 P) R0 R1 R2 R3

10/20/2006ELEG652-06F62 Tree Type Barriers Dissemination Barrier –Similar to Butterfly but with less maximum synchronization operations  floor(log 2 P) –At step k: Processor i signals processor (i + 2 k ) mod P –Advantages: The flags that each processor spins are statically assigned (Better locality)

10/20/2006ELEG652-06F63 Tree Type Barriers Tournament Barriers –A tree style barrier –A round of the tournament A level of the tree –Winners are statically decided No fetch and Φ operations are needed –Processor i sets a flag that is being awaited by processor j, then processor i drops from the tournament and j continues –The final processor wakes all others –Types CREW (concurrent read exclusive write): Global variable to signal back EREW (exclusive read exclusive write): Separate flags in which each processor spins separate.

10/20/2006ELEG652-06F64 Bibliography Paterson and Hennessy. “Chapter 6: Multiprocessors and Thread Level Parallelism” Mellor-Crummey, John; Scott, Michael. “Algorithms for Scalable Synchronization on Shared Memory Multiprocessors”. January 1991.