COSC6385 Advanced Computer Architecture

Slides:



Advertisements
Similar presentations
Chapter 5 Part I: Shared Memory Multiprocessors
Advertisements

Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Extra Cache Coherence Examples In the following examples there are a couple questions. You can answer these for practice by ing Colin at
CSE 502: Computer Architecture
Lecture 7. Multiprocessor and Memory Coherence
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
CS252 Graduate Computer Architecture Lecture 25 Memory Consistency Models and Snoopy Bus Protocols Prof John D. Kubiatowicz
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
Computer Architecture II 1 Computer architecture II Lecture 8.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
Computer architecture II
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Meenaktchi Venkatachalam.
CS 258 Parallel Computer Architecture Lecture 12 Shared Memory Multiprocessors II March 1, 2002 Prof John D. Kubiatowicz
Snooping Cache and Shared-Memory Multiprocessors
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Cache Coherence CSE 661 – Parallel and Vector Architectures
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
Cache Coherence CS433 Spring 2001 Laxmikant Kale.
CSC/ECE 506: Architecture of Parallel Computers Bus-Based Coherent Multiprocessors 1 Lecture 12 (Chapter 8) Lecture 12 (Chapter 8)
Outline Introduction (Sec. 5.1)
Cache Coherence in Shared Memory Multiprocessors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
Cache Coherence for Shared Memory Multiprocessors
The University of Adelaide, School of Computer Science
Prof. Gennady Pekhimenko University of Toronto Fall 2017
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
The University of Adelaide, School of Computer Science
Lecture 2: Snooping-Based Coherence
Chip-Multiprocessor.
Cache Coherence in Bus-Based Shared Memory Multiprocessors
Cache Coherence Protocols 15th April, 2006
Shared Memory Consistency Models: A Tutorial
Lecture 8: Directory-Based Cache Coherence
Lecture 4: Update Protocol
Bus-Based Coherent Multiprocessors
Lecture 7: Directory-Based Cache Coherence
Lecture 25: Multiprocessors
Lecture 10: Consistency Models
High Performance Computing
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture 8 Outline Memory consistency
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Prof John D. Kubiatowicz
Lecture 17 Multiprocessors and Thread-Level Parallelism
Prof John D. Kubiatowicz
CPE 631 Lecture 20: Multiprocessors
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 11: Consistency Models
Presentation transcript:

COSC6385 Advanced Computer Architecture Lecture 16. Cache Coherence Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Topics Today Cache Coherence

Memory Hierarchy in a Multiprocessor Shared cache P $ Bus-based shared memory Memory P P P Cache Memory P $ Memory Fully-connected shared memory (Dancehall) Interconnection Network P $ Memory Interconnection Network Distributed shared memory

Cache Coherency Closest cache level is private Multiple copies of cache line can be present across different processor nodes Local updates Lead to incoherent state Problem exhibits in both write-through and writeback caches Bus-based  globally visible Point-to-point interconnect  visible only to communicated processor nodes

Example (Writeback Cache) Rd? Rd? Cache Cache Cache X= -100 X= -100 X= -100 X= 505 Memory X= -100

Example (Write-through Cache) Rd? Cache Cache Cache X= -100 X= 505 X= -100 X= 505 X= 505 Memory X= -100

Memory Consistency Issue What do you expect for the following codes? Initial values A=0 B=0 P1 P2 A=1; Flag = 1; while (Flag==0) {}; print A; Is it possible P2 prints A=0? P1 P2 A=1; B=1; print B; print A; Is it possible P2 prints A=0, B=1?

Memory Consistency Model Programmers anticipate certain memory ordering and program behavior Become very complex When Running shared-memory programs A processor supports out-of-order execution A memory consistency model specifies the legal ordering of memory events when several processors access the shared memory locations

Implicit definition of coherence Defining Coherence An MP is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Implicit definition of coherence Write propagation Writes are visible to other processes Write serialization All writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity) E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi

SC Example P1 P2 P3 P0 P1 P2 P3 P0 A=1 A=2 T=A Y=A U=A Z=A A=1 A=2 T=A Sequentially Consistent Violating Sequential Consistency! (but possible in processor consistency model)

Maintain Program Ordering (SC) Flag1 = Flag2 = 0 Dekker’s algorithm Only one processor is allowed to enter the CS P1 P2 Flag1 = 1 if (Flag2 == 0) enter Critical Section Flag2 = 1 if (Flag1 == 0) enter Critical Section Caveat: implementation fine with uni-processor,but violate the ordering of the above P0 P1 Flag2=0 Flag1=0 Flag1=1 Write Buffer Flag2=1 Write Buffer INCORRECT!! BOTH ARE IN CRITICAL SECTION! Flag1: 0 Flag2: 0

Atomic and Instantaneous Update (SC) Update (of A) must take place atomically to all processors A read cannot return the value of another processor’s write until the write is made visible by “all” processors A = B = 0 P1 P2 P3 A = 1 if (A==1) B =1 if (B==1) R1=A

Atomic and Instantaneous Update (SC) Update (of A) must take place atomically to all processors A read cannot return the value of another processor’s write until the write is made visible by “all” processors A = B = 0 P1 P2 P3 A = 1 if (A==1) B =1 if (B==1) R1=A Caveat when an update is not atomic to all … P0 P1 P2 P3 P4 A=1 A=1 A=1 A=1 B=1 B=1 R1=0?

Atomic and Instantaneous Update (SC) Caches also make things complicated P3 caches A and B A=1 will not show up in P3 until P3 reads it in R1=A A = B = 0 P1 P2 P3 A = 1 if (A==1) B =1 if (B==1) R1=A

Another Example P0 P1 P2 P3 A=0 B=0 A=1 B=2 A=1 B=2 A=1 B=2 A=1 B=2 T1 See A’s update before B’s See B’s update before A’s

Bus Snooping based on Write-Through Cache All the writes will be shown as a transaction on the shared bus to memory Two protocols Update-based Protocol Invalidation-based Protocol

Bus Snooping (Update-based Protocol on Write-Through cache) X= 505 X= -100 X= 505 Memory Bus transaction X= -100 X= 505 Bus snoop Each processor’s cache controller constantly snoops on the bus Update local copies upon snoop hit

Bus Snooping (Invalidation-based Protocol on Write-Through cache) Load X Cache Cache Cache X= -100 X= 505 X= 505 Memory Bus transaction X= -100 X= 505 Bus snoop Each processor’s cache controller constantly snoops on the bus Invalidate local copies upon snoop hit

A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache PrWr / BusWr PrRd / --- Valid PrRd / BusRd BusWr / --- Invalid Observed / Transaction PrWr / BusWr Processor-initiated Transaction Bus-snooper-initiated Transaction

How about Writeback Cache? WB cache to reduce bandwidth requirement The majority of local writes are hidden behind the processor nodes How to snoop? Write Ordering

Cache Coherence Protocols for WB caches A cache has an exclusive copy of a line if It is the only cache having a valid copy Memory may or may not have it Modified (dirty) cache line The cache having the line is the owner of the line, because it must supply the block

Cache Coherence Protocol (Update-based Protocol on Writeback cache) Store X Cache Cache Cache X= 505 X= -100 X= 505 X= -100 X= 505 X= -100 update Memory Bus transaction Update data for all processor nodes who share the same data For a processor node keeps updating the memory location, a lot of traffic will be incurred

Cache Coherence Protocol (Update-based Protocol on Writeback cache) Store X Load X Cache Cache Cache X= 505 X= 333 X= 333 X= 505 X= 333 X= 505 Hit ! update update Memory Bus transaction Update data for all processor nodes who share the same data For a processor node keeps updating the memory location, a lot of traffic will be incurred

Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Store X Cache Cache Cache X= -100 X= -100 X= 505 X= -100 invalidate Memory Bus transaction Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location

Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Load X Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit Memory Bus transaction Bus snoop Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location

Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Store X P P P Store X Store X Cache Cache Cache X= 505 X= 987 X= 333 X= 444 X= 505 Memory Bus transaction Bus snoop Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location

MSI Writeback Invalidation Protocol Modified Dirty Only this cache has a valid copy Shared Memory is consistent One or more caches have a valid copy Invalid Writeback protocol: A cache line can be written multiple times before the memory is updated.

MSI Writeback Invalidation Protocol Two types of request from the processor PrRd PrWr Three types of bus transactions post by cache controller BusRd PrRd misses the cache Memory or another cache supplies the line BusRd eXclusive (Read-to-own) PrWr is issued to a line which is not in the Modified state BusWB Writeback due to replacement Processor does not directly involve in initiating this operation

MSI Writeback Invalidation Protocol (Processor Request) PrWr / BusRdX PrWr / --- PrRd / --- Modified Shared PrRd / --- PrWr / BusRdX PrRd / BusRd Invalid Processor-initiated

MSI Writeback Invalidation Protocol (Bus Transaction) BusRd / Flush BusRd / --- Modified Shared BusRdX / Flush BusRdX / --- Flush data on the bus Both memory and requestor will grab the copy The requestor get data by Cache-to-cache transfer; or Memory Invalid Bus-snooper-initiated

Bus-snooper-initiated MSI Writeback Invalidation Protocol (Bus transaction) Another possible implementation Another possible, valid implementation Anticipate no more reads from this processor A performance concern Save “invalidation” trip if the requesting cache writes the shared line later BusRd / Flush BusRd / --- Modified Shared BusRdX / Flush BusRdX / --- BusRd / Flush Invalid Bus-snooper-initiated

MSI Writeback Invalidation Protocol PrWr / BusRdX PrWr / --- PrRd / --- BusRd / Flush BusRd / --- Modified Shared PrRd / --- BusRdX / Flush BusRdX / --- PrWr / BusRdX Invalid PrRd / BusRd Processor-initiated Bus-snooper-initiated

MSI Example P1 P2 P3 Bus BusRd MEMORY Processor Action State in P1 Cache Cache Cache X=10 S Bus BusRd MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory

MSI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1 Cache Cache Cache X=10 S X=10 S BusRd Bus MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X

Does not come from memory if having “BusUpgrade” MSI Example P1 P2 P3 Cache Cache Cache --- I X=10 S X=10 X=-25 S M BusRdX Bus MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X I --- M BusRdX Memory P3 writes X Does not come from memory if having “BusUpgrade”

MSI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1 Cache Cache Cache --- X=-25 S I X=-25 S M BusRd Bus MEMORY X=-25 X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X Memory P3 writes X I --- M BusRdX P1 reads X S --- BusRd P3 Cache

MSI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1 Cache Cache Cache X=-25 S X=-25 S X=-25 S M BusRd Bus MEMORY X=-25 X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X P3 writes X I --- M BusRdX Memory P1 reads X S --- BusRd P3 Cache P2 reads X S BusRd Memory