Computer architecture II

Slides:

Advertisements

Similar presentations

Shared Memory Consistency

Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

CS252 Graduate Computer Architecture Lecture 25 Memory Consistency Models and Snoopy Bus Protocols Prof John D. Kubiatowicz

Computer Architecture II 1 Computer architecture II Lecture 8.

EECC756 - Shaaban #1 lec # 10 Spring Shared Memory Multiprocessors Symmetric Memory Multiprocessors (SMPs): commonly 2-4 processors/node.

EECC756 - Shaaban #1 lec # 11 Spring Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs): –Symmetric access to all of main memory.

1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Lecture 13: Consistency Models

Cache Coherence: Part 1 Todd C. Mowry CS 740 November 4, 1999 Topics The Cache Coherence Problem Snoopy Protocols.

Bus-Based Multiprocessor

Computer Architecture II 1 Computer architecture II Lecture 9.

1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

EECC756 - Shaaban #1 lec # 10 Spring Multiprocessors Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors.

Cache Coherence in Bus-Based Shared Memory Multiprocessors

1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

CS 258 Parallel Computer Architecture Lecture 12 Shared Memory Multiprocessors II March 1, 2002 Prof John D. Kubiatowicz

Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.

Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.

Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Cache Coherence CSE 661 – Parallel and Vector Architectures

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

CS252 Graduate Computer Architecture Lecture 18 April 4 th, 2011 Memory Consistency Models and Snoopy Bus Protocols Prof John D. Kubiatowicz

0 Shared Address Space Processors Chapter 5 from Culler & Singh February, 2007.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.

Memory Consistency Zhonghai Lu Outline Introduction What is a memory consistency model? Who should care? Memory consistency models Strict.

Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.

1 Memory and Cache Coherence. 2 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main memory from any processor.

Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin1 Lecture 9 Outline  MESI protocol  Dragon update-based protocol.

Cache Coherence for Small-Scale Machines Todd C

1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.

Cache Coherence CS433 Spring 2001 Laxmikant Kale.

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

CSC/ECE 506: Architecture of Parallel Computers Bus-Based Coherent Multiprocessors 1 Lecture 12 (Chapter 8) Lecture 12 (Chapter 8)

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 20: Consistency Models, TM

COSC6385 Advanced Computer Architecture

Lecture 21 Synchronization

Cache Coherence in Shared Memory Multiprocessors

Memory Consistency Models

Lecture 11: Consistency Models

Memory Consistency Models

Cache Coherence for Shared Memory Multiprocessors

Lecture 9 Outline MESI protocol Dragon update-based protocol

Example Cache Coherence Problem

Prof John D. Kubiatowicz

The University of Adelaide, School of Computer Science

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 2: Snooping-Based Coherence

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence Protocols 15th April, 2006

Shared Memory Consistency Models: A Tutorial

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Bus-Based Coherent Multiprocessors

Shared Memory Multiprocessors

Multiprocessor Highlights

Background for Debate on Memory Consistency Models

Lecture 10: Consistency Models

Lecture 24: Multiprocessors

Lecture 8 Outline Memory consistency

Prof John D. Kubiatowicz

Lecture 11: Consistency Models

Presentation transcript:

Computer architecture II Lecture 8 Computer Architecture II

Computer Architecture II Today: Cache coherency Write-through (last class) Write back Invalidation–based: MESI Update-based: Dragon Consistency models Program order Difference between coherency and consistency Sequential consistency Relaxing sequential consistency Computer Architecture II

Invalidation-based write-through State Transition Diagram PrRd/- PrWr/BusWr BusWr/- PrRd/BusRd PrWr/BusW Processor initiated action Snooper initiated action One transition diagram per cache block Block states V:valid, I:invalid V: block contains a correct copy of main memory I: block is not in the cache Notation a/b a : event b: action taken on event Events Processor can Rd/Wr from/to block Bus can Rd/Wr from/to block Computer Architecture II

Invalidation-based write-through State Transition Diagram PrRd/- PrWr/BusWr BusWr/- PrRd/BusRd Processor initiated action Snooper initiated action Write will invalidate all other caches (no local change of state) can have multiple simultaneous readers of block, but write invalidates them Implementation Hardware state bits associated only with blocks that are in the cache other blocks can be seen as being in invalid (not-present) state in that cache Computer Architecture II

Problems with Write Through High bandwidth requirements Every write from every processor goes to shared bus and memory Write-through unpopular for SMPs Write-back caches absorb most writes as cache hits Write hits don’t go on bus But now how do we ensure write propagation and serialization? Need more sophisticated protocols: large design space Computer Architecture II

Write-Back Snoopy Protocols No need to change processor, main memory, cache … Extend cache controller (not only V and I states) and exploit bus (provides serialization) Dirty state now also indicates exclusive ownership Exclusive: only cache with a valid copy (main memory may be too) Owner: responsible for supplying block upon a request for it 2 types of protocols Invalidation based Update based Computer Architecture II

Basic MSI Write-back Invalidation Protocol States Invalid (I) Shared (S): one or more Dirty or Modified (M): one only Processor Events: PrRd (read) PrWr (write) Bus Transactions BusRd (read): asks for copy with no intent to modify BusRdX (read exclusive): asks for copy with intent to modify BusWB (write back): updates memory Actions Update state, perform bus transaction, flush value onto bus Computer Architecture II

State Transition Diagram PrRd/— PrWr/- Replacement and write backs are not shown Rd/Wr in M and Rd in S state do not cause bus transaction But: Rd/Wr in I state cause 2 bus transactions Wr in S state 2 bus transactions And data sent at RdX Can spare the data transfer, because already have latest data: can use upgrade (BusUpgr) instead of BusRdX M BusRd/Flush PrWr/ BusRdX S BusRdX/Flush BusRdX/— PrRd/BusRd PrRd/— BusRd/— PrW r/BusRdX I Computer Architecture II

Example: Write-Back Protocol I/O devices Memory u :5 P1 P2 P3 PrRd U PrRd U PrRd U BusRd U PrWr U 7 PrRd/— PrW r/BusRdX BusRd/— r/— S M I BusRdX/Flush BusRdX/— BusRd/Flush PrRd/BusRd BusRdx U U S 5 U S 7 U S 5 U M 7 BusRd U BusRd Flush P1 reads u P3 reads u P3 : u=7 P2 reads u Computer Architecture II

MESI (4-state) Invalidation Protocol Problem with MSI protocol Reading and modifying data is 2 bus transactions, even if no sharing! even in a sequential program! BusRd (I->S) followed by BusRdX or BusUpgr (S->M) Add exclusive state: not modified block resides only in local cache => write locally without bus transaction States invalid exclusive or exclusive-clean (only this cache has copy, but not modified) shared (two or more caches may have copies) modified (dirty) Computer Architecture II

MESI State Transition Diagram PrRd/- PrWr/- M Goal: no bus transaction on E When does I go to E and when to S? I -> E on PrRd if no other processor has a copy I -> S otherwise S additional signal on bus BusRd(S): on BusRd signal, if one processor holds the block it asserts S (makes it 1) BusRd/Flush BusRdX/Flush PrWr/- PrWr/ BusRdX E BusRd/ Flush BusRdX/Flush PrRd/— PrWr/ BusRdX S BusRdX/Flush ¢ PrRd/ BusRd(S) ) PrRd/— ¢ BusRd/Flush PrRd/ BusRd(S) I Computer Architecture II

Dragon Write-Back Update Protocol 4 states Exclusive-clean or exclusive (E): locally and memory have it Shared clean (Sc): locally, others, and maybe memory, but I’m not owner Shared modified (Sm): locally and others but not memory, and I’m the owner Sm and Sc can coexist in different caches, with only one Sm Modified or dirty (D): locally and nowhere else No invalid state If in cache, cannot be invalid If not present in cache, can view as being in not-present or invalid state New processor events: PrRdMiss, PrWrMiss Introduced to specify actions when block not present in cache New bus transaction: BusUpd Broadcasts single word written on bus; updates caches that hold a copy Computer Architecture II

Dragon State Transition Diagram PrRd/— PrRd/— BusUpd/Update BusRd/— E Sc PrRdMiss/BusRd(S) PrRdMiss/BusRd(S) PrW r/— PrW r/BusUpd(S) PrW r/BusUpd(S) BusUpd/Update BusRd/Flush PrW rMiss/(BusRd(S); BusUpd) PrW rMiss/BusRd(S) Sm M PrW r/BusUpd(S) PrRd/— PrRd/— PrW r/BusUpd(S) BusRd/Flush PrW r/— Computer Architecture II

Invalidate versus Update Basic question of program behavior Is a block written by one processor read by others before it is rewritten? Invalidation: Yes => readers will take a miss No => multiple writes without additional traffic Update: Yes => readers will not miss if they had a copy previously single bus transaction to update all copies No => multiple useless updates, even to dead copies Invalidate or update may be better depending on the application Invalidation protocols much more popular Some systems provide both, or even hybrid Computer Architecture II

Today: Consistency models Program order Difference between coherency and consistency Sequential consistency Relaxing sequential consistency Computer Architecture II

Program order (an example) 1 2 (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; Order in which instructions appear in source code May be changed by a compiler We will assume the order the programmer sees (what you see in the example above, not how the assembly code would look like) Sequential program order P1: 1a->1b P2: 2a->2b Parallel program order: an arbitrary interleaving of sequential orders of P1 and P2 1a->1b->2a->2b 1a->2a->1b->2b 2a->1a->1b->2b 2a->2b->1a->1b Computer Architecture II

Computer Architecture II Program order Initially A=0, B=0 P 1 2 (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; Possible intuitive printings of the program? A compiler or an out-of-order execution on a superscalar processor may reorder 1a and 1b of P1 as long as they not affect the result of the program on P1 This would produce non-intuitive results Now assume that the compiler/superscalar processor does not reorder P1 will “see” the results of the writes A=1 and B=2 in the program order But when will P2 see the results of the writes A=1 and B=2 ? when will P2 see the results of the write A=1? We can say a processor P1 “sees” the results of write of P2 or the write operation of P1 completes with respect to P2 Coherence => Writes to one location become visible to all in the same order But here we have 2 locations! Computer Architecture II

Setup for Memory Consistency Coherence => Writes to one location become visible to all in the same order Nothing is said about when does a write become visible to another processor? Use event synchronization to insure that Which is the order in which consecutive writes to different locations are seen by other processors P 1 2 /*Assume initial value of A is 0*/ A = 1; Barrier -----------------------Barrier print A; Computer Architecture II

Computer Architecture II Second Example P P 1 2 /*Assume initial value of A and flag is 0*/ 1.a A = 1; 2.a while (flag == 0); /*spin idly*/ 1.b flag = 1; 2.b print A; Intuition not guaranteed by coherence Refers to one location: return the last value written to A or to flag Does not say anything about order the modification of A and flag are seen by P2 Intuitively we expect memory to respect order between accesses to different locations issued by a given process (1.b seen after 1.a) Conclusion: Coherence is not enough! pertains only to single location Computer Architecture II

Computer Architecture II Back to Second Example P P 1 2 /*Assume initial values of A and B are 0*/ (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; What’s the intuition? If 2a prints 2, will 2b print 1? We need an ordering model for clear semantics across different locations as well so programmers can reason about what results are possible This is the memory consistency model Computer Architecture II

Memory Consistency Model Specifies constraints on the order in which memory operations (from any process) can appear to execute with respect to one another What orders are preserved? Given a load, which are the possible values returned by it Without it, can’t tell much about an SAS program’s execution Implications for both programmer and system designer Programmer uses to reason about correctness and possible results System designer can use to constrain how much accesses can be reordered by compiler or hardware Contract between programmer and system Computer Architecture II

Sequential Consistency Total order achieved by interleaving accesses from different processes Maintains program order, and memory operations, from all processes, appear to [issue, execute, complete] atomically w.r.t. others as if there were no caches, and a single memory “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979] Computer Architecture II

Computer Architecture II SC Example P 1 2 /*Assume initial values of A and B are 0*/ (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; B=2 A=0 What matters is order in which operations appear to execute, not the chronological order of events Possible outcomes for (A,B): (0,0), (1,0), (1,2) What about (0,2) ? program order => 1a->1b and 2a->2b A = 0 implies 2b->1a, which implies 2a->1b B = 2 implies 1b->2a, which leads to a contradiction What about 1b->1a->2b->2a ? appears just like 1a->1b->2a->2b => fine! execution order 1b->2a->2b->1a is not fine, would produce (0,2) Computer Architecture II

Back to the first example 1 2 (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; Sequential program order P1: 1a->1b P2: 2a->2b Parallel program order: an arbitrary interleaving of sequential orders of P1 and P2 1a->1b->2a->2b 1a->2a->1b->2b 1a->2a->2b->1b 2a->1a->1b->2b 2a->1a->2b->1b 2a->2b->1a->1b But, 1a->1b->2b->2a is also acceptable for SC! intuitive Computer Architecture II

Computer Architecture II Implementing SC Two kinds of requirements Program order memory operations issued by a process must appear to execute (become visible to others and itself) in program order Atomicity in the overall hypothetical total order, one memory operation should appear to complete with respect to all processes before the next one is issued guarantees that total order is consistent across processes Computer Architecture II

Summary of Sequential Consistency READ READ WRITE WRITE READ WRITE READ WRITE Maintain order between shared access in each thread reads or writes wait for previous reads or writes to complete Computer Architecture II

Computer Architecture II Do we really need SC? SC has strong requirements SC may prevent compiler (code reorganization) and architectural optimizations (out-of-order execution in superscalar) Many programs execute correctly even without “strong” ordering explicit synch operations order key accesses initial: A, B=0 P1 P2 A := 1; B := 3.1415 barrier -------------------barrier ... = A; ... = B; Computer Architecture II

Does SC eliminate synchronization? No, still needed Critical sections ( e.g. insert element into a doubly-linked list) Barriers (e.g. enforce order on a variable access) Events (e.g. wait for a condition to become true) only ensures interleaving semantics of individual memory operations Computer Architecture II

Computer Architecture II Is SC hardware enough? No, Compiler can violate ordering constraints Register allocation to eliminate memory accesses Common subexpression elimination Instruction reordering Software Pipelining Unfortunately, programming languages and compilers are largely oblivious to memory consistency models P1 P2 P1 P2 B=0 A=0 r1=0 r2=0 A=1 B=1 A=1 B=1 u=B v=A u=r1 v=r2 B=r1 A=r2 (u,v)=(0,0) disallowed under SC may occur here Computer Architecture II

What orderings are essential? initial: A, B=0 P1 P2 A := 1; B := 3.1415 unlock(L) lock(L) ... = A; ... = B; Stores to A and B must complete before unlock Loads to A and B must be performed after lock Conclusion: may relax the sequential consistency semantics Computer Architecture II

Hardware Centric Models Processor Consistency (Goodman 89) Total Store Ordering (Sindhu 90) Partial Store Ordering (Sindhu 90) Causal Memory (Hutto 90) Weak Ordering (Dubois 86) READ WRITE READ WRITE Computer Architecture II

Relaxing write-to-read (PC, TSO) Why? Hardware may hide latency of write write-miss in write buffer, later reads hit, maybe even bypass write write to flag not visible until write to A visible PC: non atomic write (write does not complete wrt all other processors) Ex: Sequent Balance, Encore Multimax, vax 8800, SparcCenter, SGI Challenge, Pentium-Pro initial: A, flag, y == 0 P1 P2 (a) A = 1; (c) while (flag ==0) {} (b) flag = 1; (d) y = A; Computer Architecture II

Computer Architecture II Comparing with SC Initially A,B=0 Initially A,B=0 Initially A,B=0 Initially A,B=0 Different results a, b: same for SC, TSO, PC c: PC allows A=0 no write atomicity: A=1 may complete wrt P2 but not wrt P3 d: TSO and PC allow A=B=0 (read execute before write) Mechanism for insuring SC semantics: MEMBAR (Sun SPARC V9) A subsequent read waits until all write complete Computer Architecture II

Computer Architecture II Comparing with SC Initially A,B=0 Initially A,B=0 Initially A,B=0 Initially A,B=0 Different results a, b: same for SC, TSO, PC c: PC allows A=0 no write atomicity: A=1 may complete wrt P2 but not wrt P3 d: TSO and PC allow A=B=0 (read execute before write) Mechanism for insuring SC semantics: MEMBAR (Sun SPARC V9) A subsequent read waits until all write complete Computer Architecture II

Computer Architecture II Comparing with SC Initially A,B=0 Initially A,B=0 Initially A,B=0 Initially A,B=0 Mechanism for insuring SC semantics: MEMBAR (Sun SPARC V9) A subsequent read waits until all write complete P P 1 2 /* initially A, B = 0 */ A = 1; B=1, membar; membar; print B; print A; Computer Architecture II

Relaxing write-to-read and write-to-write (PSO) Why? Bypass multiple write cache missing Overlap several write operation => good performance But, even example (a) breaks Use MEMBAR: a subsequent write waits until all previous writes have completed Initially A,B=0 Initially A,B=0 Initially A,B=0 Initially A,B=0 Computer Architecture II

Computer Architecture II Relaxing all orders Retain control and data dependences within each thread Why? allow multiple overlapping read operations May be bypassed by writes Hyde read latency (for read misses) Two important models Weak ordering Release Consistency Computer Architecture II

Weak ordering synchronization operations wait for all previous memory operations to complete arbitrary completion ordering between them : synchronization operation Computer Architecture II

Computer Architecture II Release consistency Differentiate between synchronization operations acquire: read operation to gain access to set of operations or variables release: write operation to grant access to other processors acquire must complete wrt all processors before following accesses Lock(TaskQ) before newTask->next = Head; …, UnLock(TaskQ) release must wait until accesses before acquire complete UnLock(TaskQ) waits for Lock(TaskQ), …, Head=newTask->next; : acquire :release Computer Architecture II

Computer Architecture II Release consistency Intuition: The programmer inserts acquire/release operations for code that shares variables acquire has to complete before the following instructions Because the other processes must know a critical section is entered Acquire and code before acquire can be reordered The code before the release has to complete Because the critical section modifications must become visible to the others Release and code after release can be reordered : acquire :release Computer Architecture II

Computer Architecture II Preserved Orderings Weak Ordering Release Consistency read/write ° ° ° read/write ° ° ° Acquire 1 1 read/write ° ° ° 2 Synch read/write ° ° ° 3 read/write ° ° ° Release 2 Synch read/write ° ° ° 3 A block contains the instructions of one processor that me be reordered Intuitive results and performance if data races are eliminated through synchronization Computer Architecture II