Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.

Slides:



Advertisements
Similar presentations
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Advertisements

The University of Adelaide, School of Computer Science
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:
12 – Shared Memory Synchronization. 2 Review Caches contain all information on state of cached memory blocks Snooping cache over shared medium for smaller.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.
EEL 5708 Multiprocessor 2 Larger multiprocessors Lotzi Bölöni Fall 2003.
PZ13A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ13A - Processor design Programming Language Design.
6: Process Synchronization 1 1 PROCESS SYNCHRONIZATION I This is about getting processes to coordinate with each other. How do processes work with resources.
Multiprocessors and Thread-Level Parallelism Cont’d
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
CPE 631 Session 24: Multiprocessors (cont’d) Department of Electrical and Computer Engineering University of Alabama in Huntsville.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Multiprocessors— Performance, Synchronization, Memory Consistency Models Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Synchronization, Memory Consistency 17th April, 2006.
Anshul Kumar, CSE IITD ECE729 : Advance Computer Architecture Lecture 26: Synchronization, Memory Consistency 25 th March, 2010.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
The University of Adelaide, School of Computer Science
1 Introduction and Taxonomy SMP Architectures and Snooping Protocols Distributed Shared-Memory Architectures Performance Evaluations Synchronization Issues.
1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan.
Multiprocessors – Locks
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:
Distributed Shared Memory
12 – Shared Memory Synchronization
Lecture 19: Coherence and Synchronization
Lecture 5: Synchronization
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
CS 213 Lecture 7: Multiprocessor 3: Synchronization, Prefetching
The University of Adelaide, School of Computer Science
CPE 631 Session 22: Multiprocessors (Part 3)
The University of Adelaide, School of Computer Science
CPE 631 Session 21: Multiprocessors (Part 2)
Cache Coherence Protocols 15th April, 2006
Designing Parallel Algorithms (Synchronization)
Lecture 21: Synchronization and Consistency
CPE 631 Lecture 20: Multiprocessors
CSCI 211 Computer System Architecture Lec 9 – Multiprocessor II
Lecture: Coherence and Synchronization
Professor David A. Patterson Computer Science 252 Spring 1998
12 – Shared Memory Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
CS333 Intro to Operating Systems
Lecture: Coherence, Synchronization
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang ECE Department University of Massachusetts Dartmouth 285 Old Westport Rd. North Dartmouth, MA Slides based on the PowerPoint Presentations created by David Patterson as part of the Instructor Resources for the textbook by Hennessy & Patterson Updated by Honggang Wang.

2 Outline Synchronization Relaxed Consistency Models

3 Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization: –Uninterruptable instruction to fetch and update memory (atomic operation); –User level synchronization operation using this primitive; –For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization M PPPPPP Shared Resource P1 P2 Mutual Exclusion: Ensure that only one process uses a resource at a given time

Lock/Unlock for Synchronization The easiest protection mechanism is a lock. For every process, before it accesses the set of data items, it acquires the lock. Once the lock is successfully acquired, the process becomes the owner of that lock and the lock is locked. Then, the owner can access the protected items. Lock is released finally 4

Basic Hardware Primitives The key to using the exchange (or swap) primitive to implement synchronization is that the operation is atomic (indivisible) –require set of hardware primitives –the ability to atomically read and modify a memory location. Build a simple lock –the value 0 is used to indicate that the lock is free –1 is used to indicate that the lock is unavailable. A processor tries to set the lock –by doing an exchange of 1, which is in a register, with the memory address corresponding to the lock. The value returned from the exchange instruction is 1 if some other processor had already claimed access and 0 otherwise 5

Other Atomic Primitives to Implement Synchronization test-and-set, –tests a value and sets it if the value passes the test. fetch-and-increment –It returns the value of a memory location and atomically increments it. –By using the value 0 to indicate that the synchronization variable is unclaimed, we can use fetch-and-increment, just as we used exchange They read and update a memory value in such a manner that we can tell whether or not the two operations executed atomically 6

7 Uninterruptable Instruction to Fetch and Update Memory Atomic exchange: interchange a value in a register for a value in memory 0  synchronization variable is free 1  synchronization variable is locked and unavailable –Set register to 1 & swap –New value in register determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if other processor had already claimed access –Key is that exchange operation is indivisible Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: it returns the value of a memory location and atomically increments it –0  synchronization variable is free

Implementing a single atomic memory operation It requires both a memory read and a write in a single, uninterruptible instruction. An alternative is to have a pair of instructions –The second instruction returns a value from which it can be deduced whether the pair of instructions was executed as if the instructions were atomic. –The pair of instructions is effectively atomic –when an instruction pair is effectively atomic, no other processor can change the value between the instruction pair. 8

9 Uninterruptable Instruction to Fetch and Update Memory Hard to have read & write in 1 instruction: use 2 instead Load linked (or load locked) + store conditional –Load linked returns the initial value –Store conditional returns 1 if it succeeds (no other store to same memory location since preceding load) and 0 otherwise Example doing atomic swap with LL & SC: (exchange contents of R4 and the memory location specified by R1 try:movR3,R4 ; mov exchange value llR2,0(R1); load linked scR3,0(R1); store conditional beqzR3,try ; branch store fails (R3 = 0) movR4,R2 ; put load value in R4 Example doing fetch & increment with LL & SC: try:llR2,0(R1); load linked addiR2,R2,#1 ; increment (OK if reg–reg) scR2,0(R1) ; store conditional beqzR2,try ; branch store fails (R2 = 0)

Implementing Locks Using Coherence Once we have an atomic operation, we can use the coherence mechanisms of a multiprocessor to implement spin locks—locks –a processor continuously tries to acquire, spinning around a loop until it succeeds. A processor could continually try to acquire the lock –using an atomic operation, say, exchange, and test whether the exchange returned the lock as free. To release the lock, the processor simply stores the value 0 to the lock. 10

11 User Level Synchronization— Operation Using this Primitive Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock liR2,#1 lockit:exchR2,0(R1) ;atomic exchange bnezR2,lockit ;already locked? What about MP with cache coherency? –Want to spin on cache copy to avoid full memory latency –Likely to get cache hits for such variables Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic Solution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”): try:liR2,#1 lockit:lwR3,0(R1) ;load var bnezR3,lockit ;≠ 0  not free  spin exchR2,0(R1) ;atomic exchange bnezR2,try ;already locked?

Implementing Locks Using Coherence Cache the locks using the coherence mechanism to maintain the lock value coherently. The process of “spinning” (trying to test and acquire the lock in a tight loop) could be done on a local cached copy –Rather than requiring a global memory access on each attempt to acquire the lock. The second advantage comes from the observation that there is often locality in lock accesses: –The processor that used the lock last will use it again in the near future. –In such cases, the lock value may reside in the cache of that processor, greatly reducing the time to acquire the lock. 12

Implementing Locks Using Coherence 13

14 Outline Synchronization Relaxed Consistency Models

15 Another MP Issue: Memory Consistency Models What is consistency? When must a processor see the new value? e.g., seems that P1:A = 0;P2:B = 0;..... A = 1;B = 1; L1: if (B == 0)...L2:if (A == 0)... Impossible for both if statements L1 & L2 to be true? –What if write invalidate is delayed & processor continues? Memory consistency models: what are the rules for such cases? Sequential consistency: result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved  assignments before ifs above –SC: delay all memory accesses until all invalidates done

16 Memory Consistency Model Schemes faster execution to sequential consistency Not an issue for most programs; they are synchronized –A program is synchronized if all access to shared data are ordered by synchronization operations write (x)... release (s) {unlock}... acquire (s) {lock}... read(x) Several Relaxed Models for Memory Consistency since most programs are synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW to different addresses

17 Relaxed Consistency Models: The Basics Key idea: allow reads and writes to complete out of order, but to use synchronization operations to enforce ordering, so that a synchronized program behaves as if the processor were sequentially consistent –By relaxing orderings, may obtain performance advantages –Also specifies range of legal compiler optimizations on shared data –Unless synchronization points are clearly defined and programs are synchronized, compiler could not interchange read and write of 2 shared data items because might affect the semantics of the program 3 major sets of relaxed orderings: 1.W→R ordering (all writes completed before next read) Because retains ordering among writes, many programs that operate under sequential consistency operate under this model, without additional synchronization. Called processor consistency 2.W → W ordering (all writes completed before next write) 3.R → W and R → R orderings, a variety of models depending on ordering restrictions and how synchronization operations enforce ordering Many complexities in relaxed consistency models; defining precisely what it means for a write to complete; deciding when processors can see values that it has written