Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:

Slides:



Advertisements
Similar presentations
Symmetric Multiprocessors: Synchronization and Sequential Consistency.
Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Chapter 6: Process Synchronization
5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency I Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )
Multiprocessors – Locks
Symmetric Multiprocessors: Synchronization and Sequential Consistency
CS5102 High Performance Computer Systems Thread-Level Parallelism
CS703 – Advanced Operating Systems
Lecture 21 Synchronization
G.Anuradha Reference: William Stallings
Lecture 19: Coherence and Synchronization
Lecture 5: Synchronization
Atomic Operations in Hardware
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Atomic Operations in Hardware
Lecture 18: Coherence and Synchronization
Chapter 5: Process Synchronization
The University of Adelaide, School of Computer Science
Lecture 11: Mutual Exclusion
CPE 631 Session 22: Multiprocessors (Part 3)
The University of Adelaide, School of Computer Science
Symmetric Multiprocessors: Synchronization and Sequential Consistency
CPE 631 Session 21: Multiprocessors (Part 2)
Cache Coherence Protocols 15th April, 2006
Designing Parallel Algorithms (Synchronization)
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Shared Memory Systems Miodrag Bolic.
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
Concurrency: Mutual Exclusion and Process Synchronization
Lecture 25: Multiprocessors
Professor David A. Patterson Computer Science 252 Spring 1998
Lecture 4: Synchronization
Kernel Synchronization II
Lecture 25: Multiprocessors
Lecture 11: Mutual Exclusion
Lecture 26: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
CS333 Intro to Operating Systems
Lecture 24: Multiprocessors
Chapter 6: Synchronization Tools
Lecture: Coherence, Synchronization
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 21: Synchronization & Consistency
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

CS5102 High Performance Computer Systems Synchronization Prof. Chung-Ta King Department of Computer Science National Tsing Hua University, Taiwan (Slides are from textbook, Prof. O. Mutlu, Prof. Hsien-Hsin Lee, Prof. K. Asanovic, http://compas.cs.stonybrook.edu/courses/cse502-s14/)

Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization: the basics (Sec. 5.5) Models of memory consistency (Sec. 5.6)

Introduction Coherence protocols guarantee that a reading processor sees most current update to shared data Make caches in SMPs as invisible as in single-core systems  Typically deal with one cache block at a time However, coherence protocols do not: Specify on the interaction of accesses to multiple blocks Ensure that processors access shared data in certain order Ensure that at most k processors access a shared HW (I/O device) or SW (linked list) resource at a time Force processors to start executing particular sections of code together Operations of processors need to be synchronized

Why Synchronization? Need to know when it is safe for different threads/processes/processors to use shared data Producer-consumer: a consumer must wait until the producer has produced data (access ordering) Mutual exclusion: only one processor uses a hardware or software resource (specified as memory) at a given time Barrier: all processors start executing together producer consumer Shared Resource P1 P2

Access Ordering Pass data between threads/processes/processors Use semaphores: P(S) and V(S) What happen if data and mutex can be cached in private caches of the processors? Processor 1 Processor 2 ... data = 1; P(mutex); ... V(mutex); if (data==1)

Synchronize among Many Processors Synchronize operations (e.g. computation phases) among many threads/processes/processors Use barrier: barrier() Processor 1 Processor 2 ... data[1] = a; barrier(B1); sum += data[1]; ... data[2] = b; barrier(B1); sum += data[2];

Mutual Exclusion At most k=1 threads/processes/processors can access shared resources at a time Use locks or semaphores: lock()/unlock() or P()/V() Processor 1 Processor 2 ... P(mutex); sum = sum + a; V(mutex); ... P(mutex); sum = sum + b; V(mutex); Critical Section (CS)

Critical Sections A sequence of code that only one thread/processor can execute at a time Provides mutual exclusion A thread/processor has exclusive access to the code and thus the data that the code accesses  indirect protection Guarantees that only one thread/processor can update shared data at a time To execute a critical section, a thread/processor Acquires a lock/semaphore that guards it Executes its code to operate on the protected data Releases the lock/semaphore The effect is to synchronize/order the access of the threads/processors to the shared data

Synchronization at Different Layers Hardware Library/API Programs Shared Programs Locks Semaphores Monitors Disable Interrupts Test&Set Comp&Swap

Synchronization Hardware Many systems provide hardware support for critical section code Uniprocessors: could disable interrupts Currently running code would execute without pre- emptive context switches due to interrupts System’s clock still needs to be updated by interrupts Consequently, a naive implementation of locks: LockAcquire { disable Ints; } LockRelease { enable Ints; } Problem with this approach: CS cannot be too long, otherwise system cannot respond to interrupts

Synchronization Hardware Special atomic machine instructions Atomic = non-interruptable (why atomic?) Atomic instructions allow either to test and modify the content of a word or to swap the contents of two words Basic building blocks: Atomic exchange: swaps register with memory location Test-and-set: sets memory location under condition Fetch-and-increment: reads original value from memory and increments it in memory Load linked/store conditional: if content of memory location specified by load linked are changed before store conditional to the same address, the store conditional fails

Atomic Exchange Swap a value in a register and a value in memory in one atomic operation Set the register to 1 Swap the register value and the lock value in memory  one load and one store in one atomic operation New register value determines whether got the lock Spin lock with atomic exchange: DADDUI R2,R0,#1 ;R2  1 lockit: EXCH R2,0(R1) ;atomic exchange BNEZ R2,lockit ;already locked? Also call atomic read-modify-write a location in memory What happen if the lock value 0(R1) can be cached?

Caching Lock Value Memory If read-modify-write is not atomic and the lock value can be cached  race condition How to make read-modify-write atomic? P0 P1 P3 R2=1 R2=1 EXCH R2,Lock EXCH R2,Lock Cache Cache Cache Lock=0 Lock=0 1 1 Memory Lock=0

Implement Atomic Exchange by Spin Lock Option 1: keep the lock variable (O(R1)) in memory and non-cacheable How to guarantee atomic for exchange (load-and-store)? Very heavy bus traffic for exchange Option 2: allow the lock variable to be cached Does it really help the bus traffic? Every exchange will attempt a write Many invalidates and lock value keeps changing ownership  heavy bus traffic again Want to spin on cached copy and at the same time avoid bus transaction Likely to get cache hits when spin

Implement Atomic Exchange by Spin Lock Option 3: Processors spin on local copy of lock variable When the lock value is changed (unlocked) by the lock owner, then try atomic exchange (e.g. holding onto the bus until exchange is done) to compete for the lock lockit: LD R2,0(R1) ;load lock BNEZ R2,lockit ;not avail.-spin DADDUI R2,R0,#1 ;R2  1 EXCH R2,0(R1) ;swap BNEZ R2,lockit ;spin if not 0

Coherence Traffic for Atomic Exchange Each processor keeps reading the lock value This read will be from the local cached copy and does not generate coherence traffic. Every processor thus spins on its locally cached copy When lock owner releases the lock by writing a 0 This write invalidates all other cached copies through cache coherence mechanism Each spinning processor now has a read miss, acquires a new copy, sees the 0, attempts an atomic exchange First processor to acquire the block in Modified state acquires the lock; others jump to spin again (what are the states of the cached copies?)

Race for Atomic Exchange Step P0 $ P1 $ P2 $ Bus/Directory activity 1. Has lock ?? spins Sh spins Sh None 2. lock 0 Mod Inv Inv P0 Invalidates lock 3. lock = 0 Mod miss Inv miss Sh P2 gets bus; P0 WB 4. lock = 0 Sh waits Inv lock = 0 Sh P2 cache filled; P1 gets bus 5. lock = 0 Sh lock = 0 Sh exch Sh P1 cache filled 6. Inv exch Inv R2 = 0 Mod P2 invalidates lock lock  1 7. Inv R2 = 1 Mod enters CS Inv P1 invalidates lock; P2 WB lock = 1 with lock = 1 8. Inv spins Mod Inv None Bus transactions due to exch can be interleaved!

Load-Linked and Store-Conditional Problem with atomic read-modify-write: Two memory operations in one Alternative: make pair of instructions appears atomic Avoid need for uninterruptible memory read and write Load-locked (LL) and store-conditional (SC) LL R1,x: loads the value at memory address x into register R1, and saves the address x into a link register SC x,R1,R2: stores R1 into address x only if it is the first store after the local LL R1,x (and there is no bus traffic in between for cache coherent cache)  “effectively” atomic The success is reported by returning a value (e.g. R2=1); otherwise the store fails and R2=0 is returned If LL and SC are for memory accesses without caching  SC successes if no other store since last local LL If LL and SC are for caching  SC successes if no bus traffic since last local LL

Load-Linked and Store-Conditional An illustrative example: do { int x = LL(p); int y = x + 1; } while (!SC(p,y)); Corresponding code: lockit: LL R1,p Add R1,1 SC p,R1,R2 BNZ R2,lockit Consider LL and SC with caching  must no have bus traffic in between (Prof. Andrew Lenharth, UT Austin)

Load-Linked and Store-Conditional Example: lock and unlock with LL & SC lockit: LL R2,0(R1) ;load linked BNEZ R2,lockit ;not free=>spin DADDUI R2,R0,#1 ;locked value SC 0(R1),R2,R3 ;store BEQZ R3,lockit ;branch if fails Critical Section ST 0(R1),#0 Example: fetch and increment with LL & SC: try: LL R2,0(R1) ; load linked DADDUI R2,R2,#1 ; increment SC 0(R1),R2,R3 ; store BEQZ R3,try ; branch SC fails

Summary Need synchronization for access ordering (producer- consumer), mutual exclusion, barrier Hardware supports for synchronization Disabling interrupts Atomic exchange instruction with spin lock How to spin on cached lock? Load linked/store conditional instructions