ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
Lecture 7. Multiprocessor and Memory Coherence
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Lecture 2. Snoop-based Cache Coherence Protocols
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
The University of Adelaide, School of Computer Science
Cache Coherence. CSE 4711 Cache Coherence Recall the memory wall –In multiprocessors the wall might even be higher! –Contention on shared-bus –Time to.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.
CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Snooping Cache and Shared-Memory Multiprocessors
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Also known as “Snoopy cache” Paper by: Mark S. Papamarcos and Janak H.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Cache Coherence CSE 661 – Parallel and Vector Architectures
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Multiprocessors— Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
Cache Coherence CS433 Spring 2001 Laxmikant Kale.
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Outline Introduction (Sec. 5.1)
COSC6385 Advanced Computer Architecture
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Cache Coherence for Shared Memory Multiprocessors
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
CS5102 High Performance Computer Systems Distributed Shared Memory
Lecture 2: Snooping-Based Coherence
Chip-Multiprocessor.
11 – Snooping Cache and Directory Based Multiprocessors
Lecture 25: Multiprocessors
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

2 Memory Hierarchy in a Multiprocessor PPP Cache Memory Shared cache PPP $ Bus-based shared memory $$ Memory PPP $ Fully-connected shared memory (Dancehall) $$ Memory Interconnection Network P $ Memory Interconnection Network P $ Memory Distributed shared memory

3 Cache Coherency Closest cache level is private Multiple copies of cache line can be present across different processor nodes Local updates –Lead to incoherent state –Problem exhibits in both write-through and writeback caches Bus-based  globally visible Point-to-point interconnect  visible only to communicated processor nodes

4 Example (Writeback Cache) P Cache Memory P X= -100 Cache P X= -100 X= 505 Rd ? X= -100 Rd ?

5 Example (Write-through Cache) P Cache Memory P X= -100 Cache P X= -100 X= 505 Rd ?

6 Defining Coherence An MP is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Implicit Definition of coherence Write propagation –Writes are visible to other processes Write serialization –All writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity) –E.g., w1 followed by w2 seen by a read from P 1, will be seen in the same order by all reads by other processors P i

7 Sounds Easy? P0P1P2P3 A=1B=2 T1 A=0B=0 T2 A=1 B=2 T3 A=1 B=2 A=1 B=2 T3 A=1 B=2 A=1 B=2 A=1 See A’s update before B’sSee B’s update before A’s

8 Bus Snooping based on Write-Through Cache All the writes will be shown as a transaction on the shared bus to memory Two protocols –Update-based Protocol –Invalidation-based Protocol

9 Bus Snooping (Update-based Protocol on Write-Through cache) Each processor’s cache controller constantly snoops on the bus Update local copies upon snoop hit P Cache Memory P X= -100 Cache P X= 505 Bus transaction Bus snoop X= 505

10 Each processor’s cache controller constantly snoops on the bus Invalidate local copies upon snoop hit P Cache Memory P X= -100 Cache P X= 505 Bus transaction Bus snoop X= 505 Load X X= 505 Bus Snooping (Invalidation-based Protocol on Write-Through cache)

11 A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache Invalid Valid PrRd / BusRd PrRd / --- PrWr / BusWr BusWr / --- PrWr / BusWr Processor-initiated Transaction Bus-snooper-initiated Transaction Observed / Transaction

12 How about Writeback Cache? WB cache to reduce bandwidth requirement The majority of local writes are hidden behind the processor nodes How to snoop? Write Ordering

13 Cache Coherence Protocols for WB caches A cache has an exclusive copy of a line if –It is the only cache having a valid copy –Memory may or may not have it Modified (dirty) cache line –The cache having the line is the owner of the line, because it must supply the block

14 Cache Coherence Protocol (Update-based Protocol on Writeback cache) Update data for all processor nodes who share the same data For a processor node keeps updating the memory location, a lot of traffic will be incurred P Cache Memory P Cache P Bus transaction X= -100 Store X X= 505 update X= 505

15 Cache Coherence Protocol (Update-based Protocol on Writeback cache) Update data for all processor nodes who share the same data For a processor node keeps updating the memory location, a lot of traffic will be incurred P Cache Memory P Cache P Bus transaction X= 505 Load X Hit ! Store X X= 333 update X= 333

16 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location P Cache Memory P Cache P Bus transaction X= -100 Store X invalidate X= 505

17 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location P Cache Memory P Cache P Bus transaction X= 505 Load X Bus snoop Miss ! Snoop hit X= 505

18 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location P Cache Memory P Cache P Bus transaction X= 505 Store X Bus snoop X= 505X= 333 Store X X= 987 Store X X= 444

19 MSI Writeback Invalidation Protocol Modified –Dirty –Only this cache has a valid copy Shared –Memory is consistent –One or more caches have a valid copy Invalid Writeback protocol: A cache line can be written multiple times before the memory is updated.

20 MSI Writeback Invalidation Protocol Two types of request from the processor –PrRd –PrWr Three types of bus transactions post by cache controller –BusRd PrRd misses the cache Memory or another cache supplies the line –BusRd eXclusive (Read-to-own) PrWr is issued to a line which is not in the Modified state –BusWB Writeback due to replacement Processor does not directly involve in initiating this operation

21 MSI Writeback Invalidation Protocol (Processor Request) Modified Invalid Shared PrRd / BusRd PrRd / --- PrWr / BusRdX PrWr / --- PrRd / --- PrWr / BusRdX Processor-initiated

22 MSI Writeback Invalidation Protocol (Bus Transaction) Flush data on the bus Both memory and requestor will grab the copy The requestor get data by –Cache-to-cache transfer; or –Memory Modified Invalid Shared Bus-snooper-initiated BusRd / --- BusRd / Flush BusRdX / FlushBusRdX / ---

23 MSI Writeback Invalidation Protocol (Bus transaction) Another possible implementation Modified Invalid Shared Bus-snooper-initiated BusRd / --- BusRd / Flush BusRdX / FlushBusRdX / --- Another possible, valid implementation Anticipate no more reads from this processor A performance concern Save “invalidation” trip if the requesting cache writes the shared line later BusRd / Flush

24 MSI Writeback Invalidation Protocol Modified Invalid Shared Bus-snooper-initiated BusRd / --- PrRd / BusRd PrRd / --- PrWr / BusRdX PrWr / --- PrRd / --- PrWr / BusRdX Processor-initiated BusRd / Flush BusRdX / FlushBusRdX / ---

25 MSI Example P1 Cache P2P3 Bus Cache MEMORY BusRd Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X X=10 S

26 MSI Example P1 Cache P2P3 Bus Cache MEMORY X=10S Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X BusRd X=10S S ---SBusRdMemory X=10

27 MSI Example P1 Cache P2P3 Bus Cache MEMORY X=10S Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X X=10 S S ---SBusRdMemory P3 writes X BusRdX ---I M I MBusRdX X=10 X=-25

28 MSI Example P1 Cache P2P3 Bus Cache MEMORY Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X X=-25M S ---SBusRdMemory P3 writes X ---I I MBusRdX P1 reads X BusRd X=-25 S S S ---SBusRdP3 Cache X=10X=-25

29 MSI Example P1 Cache P2P3 Bus Cache MEMORY Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X X=-25M S ---SBusRdMemory P3 writes X I ---MBusRdX P1 reads X X=-25S S S ---SBusRdP3 Cache X=10X=-25 P2 reads X BusRd X=-25S S SSBusRdMemory

30 MESI Writeback Invalidation Protocol To reduce two types of unnecessary bus transactions –BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block –BusRd that gets the line in S state when there is no sharers (that lead to the overhead above) Introduce the Exclusive state –One can write to the copy without generating BusRdX Illinois Protocol: Proposed by Pamarcos and Patel in 1984 Employed in Intel, PowerPC, MIPS

31 MESI Writeback Invalidation Protocol Processor Request (Illinois Protocol) Invalid ExclusiveModified Shared PrRd / BusRd (not-S) PrWr / --- Processor-initiated PrRd / --- PrRd, PrWr / --- PrRd / --- S: Shared Signal PrWr / BusRdX PrRd / BusRd (S) PrWr / BusRdX

32 MESI Writeback Invalidation Protocol Bus Transactions (Illinois Protocol) Invalid ExclusiveModified Shared Bus-snooper-initiated BusRd / Flush BusRdX / Flush BusRd / Flush* Flush*: Flush for data supplier; no action for other sharers BusRdX / Flush* BusRd / Flush Or ---) BusRdX / --- Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) Most of the MESI implementations simply write to memory

33 MESI Writeback Invalidation Protocol (Illinois Protocol) Invalid ExclusiveModified Shared Bus-snooper-initiated BusRd / Flush BusRdX / Flush BusRd / Flush* BusRdX / Flush* BusRdX / --- PrRd / BusRd (not-S) PrWr / --- PrRd / BusRd (S) Processor-initiated PrRd / --- PrRd, PrWr / --- PrRd / --- PrWr / BusRdX S: Shared Signal PrWr / BusRdX BusRd / Flush (or ---) Flush*: Flush for data supplier; no action for other sharers

34 MOESI Protocol Add one additional state ─ Owner state Similar to Shared state The O state processor will be responsible for supplying data (copy in memory may be stale) Employed by –Sun UltraSparc –AMD Opteron In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed CPU0 L2 CPU1 L2 System Request Interface Crossbar Hyper- Transport Mem Controller

35 Cache State Transitions: Based on CPU Requests INVALID SHARED EXCLUSIVE CPU Read: Place read miss on bus CPU Read hit CPU Read miss: place read miss on bus CPU Read miss: write-back block, place read on bus CPU Write Miss: place write miss on bus CPU Read hit CPU Write hit CPU write miss: write back cache block, place write miss on bus CPU write: place write miss on bus ::::: : 31 0 Mux State Bits Tag Data : : Invalid, shared or exclusive CPU Write hit: place invalidate on bus From S. Yalamanchili From H&P, Section 4.2

36 Cache State Transitions: Based on Bus Requests INVALID SHARED EXCLUSIVE Write miss for this block CPU read miss Read miss for this block: write back block, abort memory access Write miss for this block: write back block, abort memory access Memory P1 Memory Network A P2 A Invalidate for this block From S. Yalamanchili From H&P, Section 4.2

37 Implication on Multi-Level Caches How do you guarantee coherence in a multi-level cache hierarchy? –Snoop all cache levels? –Intel’s 8870 chipset has a “snoop filter” for quad-core Maintaining inclusion property –Ensure data in the outer level must be present in the inner level –Only snoop the outermost level (e.g. L2) –L2 needs to know L1 has write hits Use Write-Through cache Use Write-back but maintain another “modified-but-stale” bit in L2

38 Inclusion Property Not so easy … –Replacement: Different bus observes different access activities, e.g. L2 may replace a line frequently accessed in L1 –Split L1 caches: Imagine all caches are direct- mapped. –Different cache line sizes

39 Inclusion Property Use specific cache configurations –E.g., DM L1 + bigger DM or set-associative L2 with the same cache line size Explicitly propagate L2 action to L1 –L2 replacement will flush the corresponding L1 line –Observed BusRdX bus transaction will invalidate the corresponding L1 line –To avoid excess traffic, L2 maintains an Inclusion bit for filtering (to indicate in L1 or not)

40 Directory-based Coherence Protocol Snooping-based protocol –N transactions for an N-node MP –All caches need to watch every memory request from each processor –Not a scalable solution for maintaining coherence in large shared memory systems Directory protocol –Directory-based control of who has what; –HW overheads to keep the directory (~ # lines * # processors) P $ P $ P $ P $ Memory Interconnection Network Directory Modified bit Presence bits, one for each node

41 Directory-based Coherence Protocol P $ P $ P $ P $ Memory Interconnection Network P $ C(k) C(k+1) C(k+j) 1 presence bit for each processor, each cache block in memory 1 modified bit for each cache block in memory

42 Directory-based Coherence Protocol (Limited Dir) Encoded Present bits (lg 2 N), each cache line can reside in 2 processors in this example 1 modified bit for each cache block in memory P0 $ P13 $ P14 $ P15 $ Memory Interconnection Network P1 $ Presence encoding is NULL or not

43 Distributed Directory Coherence Protocol Centralized directory is less scalable (contention) Distributed shared memory (DSM) for a large MP system Interconnection network is no longer a shared bus Maintain cache coherence (CC-NUMA) Each address has a “home” P $ Memory Interconnection Network P $ Memory P $ P $ P $ P $ Directory

44 Some Additional Concepts P + C Dir Memory P + C Dir Memory P + C Dir Memory Local node generates a memory reference Remote node has a copy of block Home node is the physical memory location of a memory reference Generating the request Network Messages are received in the order sent Directory entry indicates state of cached blocks and the members of the sharing set From S. Yalamanchili

45 Distributed Directory Coherence Protocol Stanford DASH (4 CPUs in each cluster, total 16 clusters) –Invalidation-based cache coherence –Directory keeps one of the 3 status of a cache block at its home node Uncached Shared (unmodified state) Dirty P $ Memory P $ Directory Interconnection Network Snoop bus P $ Memory P $ Directory Snoop bus

46 DASH Memory Hierarchy Processor Level Local Cluster Level Home Cluster Level (address is at home) If dirty, needs to get it from remote node which owns it Remote Cluster Level P $ Memory P $ Directory Interconnection Network Snoop bus P $ Memory P $ Directory Snoop bus

47 Directory Coherence Protocol: Read Miss Interconnection Network 0011 P $ Memory P $ Miss Z (read) Go to Home Node Memory P $ Z ZZ 1 Data Z is shared (clean) Home of Z

48 Directory Coherence Protocol: Read Miss Interconnection Network 1010 P $ Memory P $ Miss Z (read) Memory P $ Z Data Z is Dirty Go to Home Node Respond with Owner Info Data Request ZZ 011 Data Z is Clean, Shared by 3 nodes

49 Directory Coherence Protocol: Write Miss Interconnection Network 001 P $ Memory P $ Miss Z (write) Memory P $ Z 1 Z Go to Home Node Respond w/ sharers Invalidate ACK 0011 Z Write Z can proceed in P0

50 Directory Protocol: Some General Features The {sharing set} is the set of processors with a copy of a memory block –Implementation Bit vectors and fully mapped entries Linked lists When using a linked directory –Update messages are propagated –Requester is added to the head of the list Invalidations reduce the size of the sharing set, while updates increase their size –Updates reduce new requests for the line –Invalidations increase network traffic From S. Yalamanchili

51 The Local Cache State Machine INVALID SHARED MODIFIED CPU read miss CPU read hit Invalidate CPU Read: send read miss msg CPU write miss: data write back CPU write hit CPU read hit Fetch Invalidate: data write back CPU write: send write miss msg CPU read miss: data write back Fetch; data write back CPU write miss: data send write msg P + C Dir memory Network cache ::::: : 31 0 State Bits Tag Data : : CPU write hit: data send invalidate message From S. Yalamanchili From H&P, Section 4.2

52 The Directory State Machine UNCACHED SHARED EXCLUSIVE Data value reply: Sharers = {P}, Read miss Data write back: Sharers = {} Data value reply: Sharers = {P} Write miss Write miss: fetch/invalidate Data value reply, Sharers = {P} Read miss: Data value reply Sharers = Sharers + {P} Read miss: fetch, Data value reply Sharers = Sharers + {P} Write miss: invalidate, Sharers = {P} Data value reply P + C Dir memory Network cache Note that the state of the memory block refers to the state of the copies in remote caches From S. Yalamanchili From H&P, Section 4.2