Lecture 8 Outline Memory consistency

Slides:



Advertisements
Similar presentations
Chapter 5 Part I: Shared Memory Multiprocessors
Advertisements

Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Extra Cache Coherence Examples In the following examples there are a couple questions. You can answer these for practice by ing Colin at
CSE 502: Computer Architecture
Lecture 7. Multiprocessor and Memory Coherence
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
CS252 Graduate Computer Architecture Lecture 25 Memory Consistency Models and Snoopy Bus Protocols Prof John D. Kubiatowicz
Computer Architecture II 1 Computer architecture II Lecture 8.
CS 258 Parallel Computer Architecture Lecture 13 Shared Memory Multiprocessors March 10, 2008 Prof John D. Kubiatowicz
Computer architecture II
Cache Coherence: Part 1 Todd C. Mowry CS 740 November 4, 1999 Topics The Cache Coherence Problem Snoopy Protocols.
Computer Architecture II 1 Computer architecture II Lecture 9.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
CS 258 Parallel Computer Architecture Lecture 12 Shared Memory Multiprocessors II March 1, 2002 Prof John D. Kubiatowicz
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.
Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
CS252 Graduate Computer Architecture Lecture 18 April 4 th, 2011 Memory Consistency Models and Snoopy Bus Protocols Prof John D. Kubiatowicz
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.
Memory Consistency Zhonghai Lu Outline Introduction What is a memory consistency model? Who should care? Memory consistency models Strict.
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
1 Memory and Cache Coherence. 2 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main memory from any processor.
Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin1 Lecture 9 Outline  MESI protocol  Dragon update-based protocol.
Cache Coherence for Small-Scale Machines Todd C
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
Cache Coherence CS433 Spring 2001 Laxmikant Kale.
ECE/CS 552: Shared Memory © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith.
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
CSC/ECE 506: Architecture of Parallel Computers Bus-Based Coherent Multiprocessors 1 Lecture 12 (Chapter 8) Lecture 12 (Chapter 8)
The Cache-Coherence Problem Intro to Chapter 5. Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin2 Shared Memory vs.
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 20: Consistency Models, TM
COSC6385 Advanced Computer Architecture
Cache Coherence in Shared Memory Multiprocessors
Memory Consistency Models
Lecture 11: Consistency Models
Cache Coherence: Part 1 Todd C. Mowry CS 740 October 25, 2000
Memory Consistency Models
Cache Coherence for Shared Memory Multiprocessors
Lecture 9 Outline MESI protocol Dragon update-based protocol
The University of Adelaide, School of Computer Science
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Example Cache Coherence Problem
Prof John D. Kubiatowicz
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 2: Snooping-Based Coherence
The Cache-Coherence Problem
Chip-Multiprocessor.
Cache Coherence in Bus-Based Shared Memory Multiprocessors
Cache Coherence Protocols 15th April, 2006
Shared Memory Consistency Models: A Tutorial
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Symmetric Multiprocessors
Lecture 4: Update Protocol
Bus-Based Coherent Multiprocessors
Shared Memory Multiprocessors
Multiprocessor Highlights
Lecture 10: Consistency Models
Memory Consistency Models
Lecture 24: Multiprocessors
CS 258 Parallel Computer Architecture Lecture 16 Snoopy Protocols I
Prof John D. Kubiatowicz
Prof John D. Kubiatowicz
Lecture: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
Lecture 11: Consistency Models
Presentation transcript:

Lecture 8 Outline Memory consistency Sequential consistency Invalidation vs. update coherence protocols MSI protocol State diagrams Simulation Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Switch Gear to Memory Consistency Warning: “Studying memory consistency may induce headaches, confusion, apathy, and nausea. Do not try to study it over a short period of time. It is best to study memory consistency over a period of several months by revisiting the problem often, understanding all examples from the lectures and homework exercises. In addition, it is advised to focus only on Sequential Consistency Model on Chap 5, and leave the rest for Chap 9.” Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Let’s Switch Gear to Memory Consistency Coherence: Writes to a single location are visible to all in the same order Consistency: Writes to multiple locations are visible to all in the same order Recall Peterson’s algorithm (turn= …; interested[process]=…) When “multiple” = “any”, we have sequential consistency (SC) P 1 2 /*Assume initial value of A and ag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A; Sequential consistency (SC) corresponds to our intuition. It’s not intuitive to understand memory consistency models! Coherence doesn’t help; it pertains only to a single location Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Another Example of Ordering 1 2 /*Assume initial values of A and B are 0*/ (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; What do you think the results should be? You may think: 1a, 1b, 2a, 2b  1a, 2a, 2b, 1b  2a, 2b, 1a, 1b  {A=1, B=2} programmers’ intuition: sequential consistency {A=1, B=0} {A=0, B=0} Is {A=0, B=2} possible? Yes, suppose P2 sees: 1b, 2a, 2b, 1a e.g. evil compiler, evil interconnection Intuition? Whatever it is, need an ordering model for clear semantics across different locations as well (vs. just cache coherence!) so programmers can reason about what results are possible. Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

A Memory-Consistency Model … Is a contract between programmer and system Necessary to reason about correctness of shared-memory programs Specifies constraints on the order in which memory operations (from any process) can appear to execute with respect to one another What orders are preserved? Given a load, constrains the possible values returned by it Implications for programmers Restricts algorithms that can be used e.g., Peterson’s algorithm, home-brew synchronization will be incorrect in machines that do not guarantee SC Implications for compiler writers and arch designers Determine how much accesses can be reordered Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Lecture 8 Outline Memory consistency Sequential consistency Invalidation vs. update coherence protocols MSI protocol State diagrams Simulation Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Sequential Consistency “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979] (as if there were no caches, and a single memory) Total order achieved by interleaving accesses from different processes Maintains program order, and memory operations, from all processes, appear to [issue, execute, complete] atomically w.r.t. others Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

What Really Is Program Order? Intuitively, the order in which operations appear in source code Thus, we assume order as seen by programmer, the compiler is prohibited from reordering mem accesses to shared variables. Note that this is one reason parallel programs are less efficient than serial programs. (besides fp-op reordering restriction, overheads) Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

What Reordering Is Safe in SC? What matters is order in which it appears to execute, not order in which it executes. P 1 2 /*Assume initial values of A and B are 0*/ (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; possible outcomes for (A,B): (0,0), (1,0), (1,2); impossible under SC: (0,2) we know 1a  1b and 2a  2b by program order A = 0 implies 2b  1a, which implies 2a->1b B = 2 implies 1b  2a, which leads to a contradiction BUT, actual execution 1b 1a  2b  2a is SC, despite not program order appears just like 1a  1b  2a  2b as visible from results actual execution 1b  2a  2b  is not SC Thus, some reordering is possible, but difficult to reason that it ensures SC Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Conditions for SC Two kinds of requirements Program order memory operations issued by a process must appear to become visible (to others and itself) in program order Global Order Atomicity: one memory operation should appear to complete with respect to all processes before the next one is issued Global order = operation order is consistent as seen by all processes Tricky part: how to make writes atomic? Detect write completion Read completion is easy: it completes when the data returns Who should enforce SC? Compiler should not change program order Hardware should ensure program order and atomicity Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Write Atomicity Write Atomicity: ensures write ordering same for all processes In effect, extends write serialization to writes from multiple processes Transitivity implies A should print as 1 under SC Problem if P2 leaves loop, writes B, and P3 sees new B but old A (from its cache, say) Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Pitfalls for SC Compiler optimizations Loop transformations: Reorder loads and stores Register allocation: Actually eliminates some loads/stores Declare the variables as “volatile” to disallow this. Hardware may violate SC for better performance Write buffers Out-of-order execution Interconnection-network delay Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Is the Write-Through Example SC? Assume no write buffers, load-store bypassing Yes, because of the atomic bus: Any write and read misses (to all locations) serialized by bus into bus order If read obtains value of write W, W guaranteed to have completed since it caused a bus transaction When write W is performed w.r.t. any processor, all previous writes in bus order have completed Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Lecture 8 Outline Memory consistency Sequential consistency Invalidation vs. update coherence protocols MSI protocol State diagrams Simulation Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Using Write-Back Caches Dirty state Uniprocessor: Line has been modified Multiprocessor: Line has been modified + exclusive ownership Exclusive: “I’m the only one that have it, other than possibly the main memory.” I’m the Owner: responsible for supplying block upon a request for it. Two variations Invalidation based protocol Update-based protocols Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Invalidation-Based Protocols Idea: On my write, invalidate everybody else  I get exclusive state. Exclusive Can modify without notifying anyone else (i.e., without bus transaction) Must first get block in exclusive state before writing into it Even if already in valid state, need bus transaction to invalidate (Read Exclusive = RdX) Read and Read-exclusive bus transactions drive coherence actions On writeback: Silent replace if not in D (or M=Modified), otherwise flush/write-back. Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Update-Based Protocols Idea: On my write, update everybody else (with new word) New bus transaction: Update Advantages Other processors don’t miss on next access Saves refetch: In invalidation protocols, they would miss & bus transaction. Saves bandwidth: Single bus transaction updates several caches Disadvantages Multiple writes by same processor cause multiple update transactions In invalidation, first write gets exclusive ownership, others local Detailed tradeoffs more complex Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Invalidate versus Update Is a block written by one processor read by others before it is rewritten? Invalidation: Yes  Readers will take a miss No  Multiple writes without additional traffic and clears out copies that won’t be used again Update: Yes  Readers will not miss if they had a copy previously single bus transaction to update all copies No  Multiple useless updates, even to dead copies Invalidation protocols much more popular (more later) Some systems provide both, or even hybrid Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Lecture 8 Outline Memory consistency Sequential consistency Invalidation vs. update coherence protocols MSI protocol State diagrams Simulation Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Basic MSI Writeback Inval Protocol States Invalid (I) Shared (S): one or more copies, and memory copy is up-to-date Dirty or Modified (M): only one copy Processor Events: PrRd (read), PrWr (write) Bus Transactions BusRd: asks for copy with no intent to modify BusRdX: asks for copy with intent to modify (instead of BusWr) Flush: updates memory Actions Update state, perform bus transaction, flush value onto bus Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

State-Transition Diagrams On the following slides, we will display the state-transition diagrams for processor-initiated transactions for bus-initiated transactions We will see transitions of the following form: Invalidation: Any  I Intervention: {Exclusive, Modified}  Shared Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI: Processor-Initiated Transactions PrRd/- PrRd/- PrWr/- PrWr/BusRdX M S PrRd/BusRd PrWr/BusRdX I Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI: Bus-Initiated Transactions Thus, valid data must be supplied by memory BusRd/- BusRd/Flush M S BusRdX/- BusRdX/Flush I BusRd/- BusRdX/- Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI State Transition Diagram Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI Visualization P1 P2 P3 Cache Snooper Snooper Snooper Bus Mem Ctrl Main Memory X=1 Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI Visualization P1 P2 P3 rd &X BusRd Snooper Snooper Snooper Mem Ctrl X=1 Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI Visualization P1 P2 P3 X=1 S Snooper Snooper Snooper Mem Ctrl X=1 Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI Visualization P1 P2 P3 wr &X (X=2) X=1 M 2 S BusRdX Snooper Mem Ctrl X=1 Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI Visualization P1 P2 P3 rd &X X=2 M BusRd Snooper Snooper Snooper Mem Ctrl X=1 Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI Visualization P1 P2 P3 X=2 S M X=2 S Flush Snooper Snooper Snooper Mem Ctrl X=1 2 Cancel memory read Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI Visualization P1 P2 P3 wr &X X=3 X=2 S I X=2 M 3 S BusUpgr Snooper Mem Ctrl X=2 Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI Visualization P1 P2 P3 rd &X X=2 S 3 Flush I X=3 S M BusRd Snooper Mem Ctrl Update memory as well X=2 Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI Visualization P1 P2 P3 rd &X X=3 S X=3 S Snooper Snooper Snooper Mem Ctrl X=3 Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI Visualization P1 P2 P3 rd &X X=3 S X=3 S X=3 S BusRd Snooper Mem Ctrl X=3 Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Example: Rd/Wr to a single line Proc Action State P1 State P2 State P3 Bus Action Data From R1 S - BusRd Mem W1 M BusRdX* R3 P1 cache W3 I P3 cache Local Cache R2 *or, BusUpgr (data from own cache) Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

Notes on MSI Protocol For M  I, BusRdX/Flush: why flush? Because it is a read with intention to write, as opposed to write. Thus, there is a possibility for a read before the write is performed In addition, the writes could be to different words in a line Write to shared block: Already have latest data; can use upgrade (BusUpgr) instead of BusRdX Replacement changes state of two blocks: outgoing and incoming Flush has to modify both caches and main memory Note: coherence granularity is u (a single line). What happens when all the reads go to word 0 on line u, but write by P3 goes to word 1 on line u? False-sharing miss on the 2nd R1 Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

MSI: Coherence and SC Coherence: To enforce SC: Write propagation: through invalidation, and flush on subsequent BusRds Write serialization? Writes (BusRdX) that go to the bus appear in bus order (and handled by snoopers in bus order!) Writes that do not go to the bus? Only happens the line state is M, i.e. when I am the only processor holding the line. Local writes are only visible to me, so they are serialized. To enforce SC: Program order: enforced by following the bus transaction order All writes appear on the bus All local writes (within 1 processor) can follow program order Write completion: Occurs when write appears on bus Write atomicity: A read returns the latest value of a write. At that time, the value is visible to all others (on a bus transaction, or on a local write). Lecture 8 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin