Download presentation
Presentation is loading. Please wait.
1
10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek
2
10/20/2006ELEG652-06F2 Synchronization The orchestration of two or more threads (or processes) to complete a task in a correct manner and avoid any data races Data Race or Race Condition –“There is an anomaly of concurrent accesses by two or more threads to a shared memory and at least one of the accesses is a write” Atomicity and / or serialibility
3
10/20/2006ELEG652-06F3 Atomicity Atomic From the Greek “Atomos” which means indivisible An “All or None” scheme An instruction (or a group of them) will appear as if it was (they were) executed in a single try –All side effects of the instruction (s) in the block are seen in its totality or not all Side effects Writes and (Causal) Reads to the variables inside the atomic block
4
10/20/2006ELEG652-06F4 Atomicity Word aligned load and stores are atomic in almost all architectures Unaligned and bigger than word accesses are usually not atomic What happens when non-atomic operations goes wrong –The final result will be a garbled combination of values –Complete operations might be lost in the process Strong Versus Weak Atomicity
5
10/20/2006ELEG652-06F5 Synchronization Applied to Shared Variables Synchronization might enforce ordering or not High level Synchronization types –Semaphores –Mutex –Barriers –Critical Sections –Monitors –Conditional Variables
6
10/20/2006ELEG652-06F6 Semaphores Intelligent Counters of Resources –Zero Means not available Abstract data which has two operations involved –P probeer te verlagen: “try to decrease” Waits (Busy waits or sleeps) if the resource is not available. –V verhoog: “increase.” Frees the resource Binary V.S. Blocking V.S. Counting Semaphores –Binary: Initial Value will allow threads to obtain it –Blocking: Initial Value will block the threads –Counting: Initial Value is not zero Note: P and V are atomic operations!!!!
7
10/20/2006ELEG652-06F7 Mutex Mutual Exclusion Lock A binary semaphore to ensure that one thread (and only one) will access the resource –P Lock the mutex –V Unlock the mutex It doesn’t enforce ordering Fine V.S. Coarse grained
8
10/20/2006ELEG652-06F8 Barriers A high level programming construct Ensure that all participating threads will wait at a program point for all other (participating) threads to arrive, before they can continue Types of Barriers –Tree Barriers (Software Assisted) –Centralized Barriers –Tournament Barriers –Fine grained Barriers –Butterfly style Barriers –Consistency Barriers (i.e. #pragma omp flush)
9
10/20/2006ELEG652-06F9 Critical Sections A piece of code that is executed by one and only one thread at any point in time If T1 finds CS in use, then it waits until the CS is free for it to use it Special Case: –Conditional Critical Sections: Threads waits on a “given” signal to resume execution. –Better implemented with lock free techniques (i.e. Transactional Memory)
10
10/20/2006ELEG652-06F10 Monitors and Conditional Variables A monitor consists of: –A set of procedures to work on shared variables –A set of shared variables –An invariant –A lock to protect from access by other threads Conditional Variables –The invariant in a monitor (but it can be used in other schemes) –It is a signal place holder for other threads activities
11
10/20/2006ELEG652-06F11 Much More … However, all of these are abstractions Major elements –A synchronization element that ensure atomicity Locks!!!! –A synchronization element that ensure ordering Barriers!!!! Implementations and types –Common types of atomic primitives –Read – Modify – Write Back cycles Synch Overhead may break a system –Unnecessary consistency actions –Communication cost between threads Why Distributed Memory Machines have “implicit” synchronization?
12
10/20/2006ELEG652-06F12 Topic 5a Locks
13
10/20/2006ELEG652-06F13 Implementation Atomic Primitives –Fetch and Φ operations Read – Modify – Write Cycles Test and Set Fetch and Store –Exchange register and memory Fetch and Add Compare and Swap –Conditionally exchange the value of a memory location
14
10/20/2006ELEG652-06F14 Implementation Use by programmers to implement more complex synchronization constructs Waiting behavior –Scheduler based: The process / thread is de- scheduled and will be scheduled in a future time –Busy Wait: The process / thread polls on the resource until it is available –Dependent on the Hardware / OS / Scheduler behavior
15
10/20/2006ELEG652-06F15 Types of (Software) Locks The Spin Lock Family The Simple Test and Set Lock –Polls a shared Boolean variable: A binary semaphore –Uses Fetch and Φ operations to operate on the binary semaphore –Expensive!!!! Waste bandwidth Generate Extra Busses transactions –The test test and set approach Just poll when the lock is in use
16
10/20/2006ELEG652-06F16 Types of (Software) Locks The Spin Lock Family Delay based Locks –Spin Locks in which a delay has been introduced in testing the lock –Constant delay –Exponentional Back-off Best Results –The test test and set scheme is not needed
17
10/20/2006ELEG652-06F17 Types of (Software) Locks The Spin Lock Family Pseudo code: enum LOCK_ACTIONS = {LOCKED, UNLOCKED}; void acquire_lock(lock_t L) { int delay = 1; while(! test_and_set(L, LOCKED) ) { sleep(delay); delay *= 2; } void release_lock(lock_t L) { L = UNLOCKED; }
18
10/20/2006ELEG652-06F18 Types of (Software) Locks The Ticket Lock Reduce the # of Fetch and Φ operations –Only one per lock acquisition Strongly fair lock –No starvation A FIFO service Implementation: Two counters –A Request and Release Counters
19
10/20/2006ELEG652-06F19 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 00 RequestRelease T1 acquires the lock
20
10/20/2006ELEG652-06F20 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 10 RequestRelease T2 requests the lock
21
10/20/2006ELEG652-06F21 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 20 RequestRelease T3 requests the lock
22
10/20/2006ELEG652-06F22 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 31 RequestRelease T1 releases the lock T2 gets the lock T4 requests the lock
23
10/20/2006ELEG652-06F23 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 41 RequestRelease T5 requests the lock
24
10/20/2006ELEG652-06F24 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 51 RequestRelease T1 requests the lock
25
10/20/2006ELEG652-06F25 Types of (Software) Locks The Ticket Lock T1T2T3T4T5 52 RequestRelease T2 releases the lock T3 acquires the lock
26
10/20/2006ELEG652-06F26 Reduce the number of Fetch and Φ operations –Only read ops on the release counter However, still a lot of memory and network bandwidth wasted. Back off techniques also used –Exponentional Back off A bad idea –Constant Delay Minimum time of holding a lock –Proportional Back off Dependent on how many are waiting for the lock Types of (Software) Locks The Ticket Lock
27
10/20/2006ELEG652-06F27 Types of (Software) Locks The Ticket LockPseudocode: unsigned int next_ticket = 0; unsigned int now_serving = 0; void acquire_lock() { unsigned int my_ticket = fetch_and_increment(next_ticket); while{ sleep(my_ticket - now_serving); if(now_serving == my_ticket) return; } void release_lock() { now_serving = now_serving + 1; }
28
10/20/2006ELEG652-06F28 Types of (Software) Locks The Array Based Queue Lock Contention on the release counter Cache Coherence and memory traffic –Invalidation of the counter variable and the request to a single memory bank Two elements –An Array and a tail pointer that index such array –The array is as big as the number of processor –Fetch and store Address of the array element –Fetch and increment Tail pointer FIFO ordering
29
10/20/2006ELEG652-06F29 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 EnterWait Tail The tail pointer points to the beginning of the array The all array elements except the first one are marked to wait
30
10/20/2006ELEG652-06F30 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 EnterWait Tail T1 Gets the lock
31
10/20/2006ELEG652-06F31 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 EnterWait Tail T2 Requests
32
10/20/2006ELEG652-06F32 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 EnterWait Tail T3 requests
33
10/20/2006ELEG652-06F33 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 WaitEnterWait Tail T1 releases T2 Gets
34
10/20/2006ELEG652-06F34 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 WaitEnterWait Tail T4 Requests
35
10/20/2006ELEG652-06F35 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 WaitEnterWait Tail T1 requests
36
10/20/2006ELEG652-06F36 Types of (Software) Locks The Array Based Queue Lock T1 T2T3 T4 T5 Wait EnterWait Tail T2 releases T3 gets
37
10/20/2006ELEG652-06F37 Types of (Software) Locks The Queue Locks It uses too much memory –Linear space (relative to the number of processors) per lock. Array –Easy to implement Linked List: QNODE –Cache management
38
10/20/2006ELEG652-06F38 Types of (Software) Locks The MCS Lock Characteristics –FIFO ordering –Spins on locally accessible flag variables –Small amount of space per lock –Works equally well on machines with and without coherent caches Similar to the QNODE implementation of queue locks –QNODES are assigned to local memory –Threads spins on local memory
39
10/20/2006ELEG652-06F39 MCS: How it works? Each processor enqueues its own private lock variable into a queue and spins on it –key: spin locally CC model: spin in local cache DSM model: spin in local private memory –No contention On lock release, the releaser unlocks the next lock in the queue –Only have bus/network contention on actual unlock –No starvation (order of lock acquisitions defined by the list)
40
10/20/2006ELEG652-06F40 MCS Lock Requires atomic instruction: –compare-and-swap –fetch-and-store If there is no compare-and-swap –an alternative release algorithm extra complexity loss of strict FIFO ordering theoretical possibility of starvation Detail: Mellor-Crummey and Scott ’ s 1991 paper
41
10/20/2006ELEG652-06F41 MCS: Example Tail FlagNextF = 1Next Tail FlagNext Tail Init Proc 1 gets Proc 2 tries CPU 1 CPU 2 CPU 3 CPU 4 CPU 1 holds the “real” lock CPU 2, CPU 3 and CPU 4 spins on the flag When CPU 1 releases, it releases the lock and change the flag variable of the next in the list
42
10/20/2006ELEG652-06F42 Implementation Modern Alternatives Fetch and Φ operations –They are restrictive –Not all architecture support all of them Problem: A general one atomic op is hard!!! Solution: Provide two primitives to generate atomic operations Load Linked and Store Conditional –Remember PowerPC lwarx and stwcx instructions
43
10/20/2006ELEG652-06F43 An Example Swap try:movR3, R4 ldR2, 0(R1) stR3, 0(R1) movR4, R2 Exchange the contents of register R4 with memory location pointed by R1 Not Atomic!!!!
44
10/20/2006ELEG652-06F44 An Example Atomic Swap try:movR3, R4 llR2, 0(R1) scR3, 0(R1) beqzR3, try movR4, R2 Swap (Fetch and store) using ll and sc In case that another processor writes to the value pointed by R1 before the sc can complete, the reservation (usually keep in register) is lost. This means that the sc will fail and the code will loop back and try again.
45
10/20/2006ELEG652-06F45 Another Example Fetch and Increment and Spin Lock try:llR2, 0(R1) addiR2, R2, #1 scR2, 0(R1) beqzR2, try Fetch and Increment using ll-sc liR2, #1 lockit:exchR2, 0(R1) bnez R2, lockit Spin Lock using ll-sc The exch instruction is equivalent to the Atomic Swap Instruction Block presented earlier Assume that the lock is not cacheable Note: 0 Unlocked; 1 Locked
46
10/20/2006ELEG652-06F46 Performance Penalty Example Suppose there are 10 processors on a bus that each try to lock a variable simultaneously. Assume that each bus transaction (read miss or write miss) is 100 clock cycles long. You can ignore the time of the actual read or write of a lock held in the cache, as well as the time the lock is held (they won’t matter much!) Determine the performance penalty.
47
10/20/2006ELEG652-06F47 Answer It takes over 12,000 cycles total for all processor to pass through the lock! Note: the contention of the lock and the serialization of the bus transactions. See example on pp 596, Henn/Patt, 3 rd Ed.
48
10/20/2006ELEG652-06F48 Performance Penalty Assume the same example as before (100 cycles per bus transaction, 10 processors) but consider the case of a queue lock which only updates on a miss Paterson and Hennesy p 603
49
10/20/2006ELEG652-06F49 Performance Penalty Answer: –First time: n+1 –Subsequent access: 2(n-1) –Total: 3n – 1 –29 Bus cycles or 2900 clock cycles
50
10/20/2006ELEG652-06F50 Implementing Locks Using Coherence lockit:ldR2, 0(R1) bnez R2, lockit li R2, #1 exch R2, 0(R1) bnezR2, lockit StepP0P1P2StateBus 1Has LockSpins SNone 2Set Lock = 0Inv rcvd EWrite Inv from P0 3Cache Miss SWB from P0. 4WaitsLock = 0SCache Miss (P2) satisfied 5Lock = 0Swap Cache Miss SCache Miss (P1) satisfied 6Swap Cache miss Completes swap returns 0 and L=1 E(P2)Inv from P2 7Swap completes returns 1 and set L=1 EnterE(P1)WB 8Spins L = 0None lockit:llR2, 0(R1) bnez R2, lockit li R2, #1 sc R2, 0(R1) beqzR2, lockit
51
10/20/2006ELEG652-06F51 Some Graphs Increase in Network Latency on a Butterfly. Sixty Processor Performance of spin locks on a butterfly. The x-axis represents processors and y-axis represents time in microseconds Extracted from “Algorithms for Scalable Synchronization on Shared.” John M. Mellor-Crummer and Michael L. Scott. January 1991
52
10/20/2006ELEG652-06F52 Topic 5b Barriers
53
10/20/2006ELEG652-06F53 The Barrier Construct The idea for software barriers –A program point in which all participating threads wait for each other to arrive to this point before continuing Difficulty –Overhead of synchronizing the threads –Network and Memory bandwidth issues Implementation –Centralized Simple to implement with locks –Tree based Better with bandwidth
54
10/20/2006ELEG652-06F54 Centralized Barriers A normal barrier in which all threads / processors waits for each other “serially” Typical Implementation: Two spin locks –One waits for all threads to arrives –One keeps tally of the arrived threads A thread arrives to the barrier and increment the counter by one (atomically) Check if you are the last one –If you aren’t then wait –If you are, unblock (awake) the rest of the threads
55
10/20/2006ELEG652-06F55 Centralized Barrier Pseudo Code int count = 0; bool sense = true; void central_barrier(){ lock(L); if (count == 0) sense = 0; count ++; unlock(L); if(count == PROCESSORS){ sense = 1; count = 0; } else spin(sense == 1); } It may deadlock or malfunction
56
10/20/2006ELEG652-06F56 Centralized Barrier T1 T2 T3 Barrier 1 Work Barrier 2 T1 arrives to the barrier, increments count and spins T2 arrives to the barrier, increments count and spins T3 arrives to the barrier, increments count and change sense T3 is delayed and T1 do Work T1 reaches the next barrier, increments count and it is delayed T3 starts again and reset the count T2 and T3 arrives to the barrier and forever spin
57
10/20/2006ELEG652-06F57 Centralized Barrier Pseudo Code: Reverse Sense Barrier int count = 0; bool sense = true; void central_barrier(){ static bool local_sense = true; local_sense = ! local_sense; lock(L); count ++; if(count == PROCESSORS){ count = 0; sense = local_sense; } unlock(L); spin(sense == local_sense); } It will wait since the spin target can be either from the previous barrier (old local_sense) or from the current barrier (local_sense)
58
10/20/2006ELEG652-06F58 Centralized Barrier Performance Suppose there are 10 processors on a bus that each try to execute a barrier simultaneously. Assume that each bus transaction is 100 clock cycles, as before. You can ignore the time of the actual read or write of a lock held in the cache as the time to execute other non-synchronization operations in the barrier implementation. Determine the number of bus transactions required for all 10 processors to reach the barrier, be released from the barrier and exit the barrier. Assume that the bus is totally fair, so that every pending request is serviced before a new request and that the processors are equally fast. Don’t worry about counting the processors out of the barrier. How long will the entire process take? Patterson and Hennesy Page 598
59
10/20/2006ELEG652-06F59 Centralized Barrier Steps through the barrier –Assume that ll-sc lock is used –LL the lock i times –SC the lock i times –Load Count 1 time –LL the lock again i -1 times –Store Count 1 time –Store lock 1 time –Load sense 2 times –Total transaction for the ith processor: 3i + 4 –Total: (3n 2 + 11n)/2 – 1 –204 bus cycles 20,400 clock cycles
60
10/20/2006ELEG652-06F60 Tree Type Barriers The software combining tree barrier –A shared variable becomes a tree of access –Each parent node will combine the results of each its children –A group of processor per leaf –Last processor update the leaf and then moves up –A two pass scheme: From down to up Update count From up to down Update sense and resume –Objectives Reduces Memory Contention –Disadvantages Spins on memory locations which positions cannot be statically determinated
61
10/20/2006ELEG652-06F61 Tree Type Barriers Butterfly Barrier –Based on the Butterfly network scheme for broadcasting and reduction –Pairwise optimizations At step k: Processor i signals processor i xor 2 k –In case that the number of processors are not a power of two then existing processor will participate. –Max Synchronizations: 2 Floor(log 2 P) 1 234567 R0 R1 R2 R3
62
10/20/2006ELEG652-06F62 Tree Type Barriers Dissemination Barrier –Similar to Butterfly but with less maximum synchronization operations floor(log 2 P) –At step k: Processor i signals processor (i + 2 k ) mod P –Advantages: The flags that each processor spins are statically assigned (Better locality)
63
10/20/2006ELEG652-06F63 Tree Type Barriers Tournament Barriers –A tree style barrier –A round of the tournament A level of the tree –Winners are statically decided No fetch and Φ operations are needed –Processor i sets a flag that is being awaited by processor j, then processor i drops from the tournament and j continues –The final processor wakes all others –Types CREW (concurrent read exclusive write): Global variable to signal back EREW (exclusive read exclusive write): Separate flags in which each processor spins separate.
64
10/20/2006ELEG652-06F64 Bibliography Paterson and Hennessy. “Chapter 6: Multiprocessors and Thread Level Parallelism” Mellor-Crummey, John; Scott, Michael. “Algorithms for Scalable Synchronization on Shared Memory Multiprocessors”. January 1991.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.