Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

Similar presentations


Presentation on theme: "ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer."— Presentation transcript:

1 ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

2 2 Memory Hierarchy in a Multiprocessor PPP Cache Memory Shared cache PPP $ Bus-based shared memory $$ Memory PPP $ Fully-connected shared memory (Dancehall) $$ Memory Interconnection Network P $ Memory Interconnection Network P $ Memory Distributed shared memory

3 3 Cache Coherency Closest cache level is private Multiple copies of cache line can be present across different processor nodes Local updates –Lead to incoherent state –Problem exhibits in both write-through and writeback caches Bus-based  globally visible Point-to-point interconnect  visible only to communicated processor nodes

4 4 Example (Writeback Cache) P Cache Memory P X= -100 Cache P X= -100 X= 505 Rd ? X= -100 Rd ?

5 5 Example (Write-through Cache) P Cache Memory P X= -100 Cache P X= -100 X= 505 Rd ?

6 6 Defining Coherence An MP is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Implicit Definition of coherence Write propagation –Writes are visible to other processes Write serialization –All writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity) –E.g., w1 followed by w2 seen by a read from P 1, will be seen in the same order by all reads by other processors P i

7 7 Sounds Easy? P0P1P2P3 A=1B=2 T1 A=0B=0 T2 A=1 B=2 T3 A=1 B=2 A=1 B=2 T3 A=1 B=2 A=1 B=2 A=1 See A’s update before B’sSee B’s update before A’s

8 8 Bus Snooping based on Write-Through Cache All the writes will be shown as a transaction on the shared bus to memory Two protocols –Update-based Protocol –Invalidation-based Protocol

9 9 Bus Snooping (Update-based Protocol on Write-Through cache) Each processor’s cache controller constantly snoops on the bus Update local copies upon snoop hit P Cache Memory P X= -100 Cache P X= 505 Bus transaction Bus snoop X= 505

10 10 Each processor’s cache controller constantly snoops on the bus Invalidate local copies upon snoop hit P Cache Memory P X= -100 Cache P X= 505 Bus transaction Bus snoop X= 505 Load X X= 505 Bus Snooping (Invalidation-based Protocol on Write-Through cache)

11 11 A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache Invalid Valid PrRd / BusRd PrRd / --- PrWr / BusWr BusWr / --- PrWr / BusWr Processor-initiated Transaction Bus-snooper-initiated Transaction Observed / Transaction

12 12 How about Writeback Cache? WB cache to reduce bandwidth requirement The majority of local writes are hidden behind the processor nodes How to snoop? Write Ordering

13 13 Cache Coherence Protocols for WB caches A cache has an exclusive copy of a line if –It is the only cache having a valid copy –Memory may or may not have it Modified (dirty) cache line –The cache having the line is the owner of the line, because it must supply the block

14 14 Cache Coherence Protocol (Update-based Protocol on Writeback cache) Update data for all processor nodes who share the same data For a processor node keeps updating the memory location, a lot of traffic will be incurred P Cache Memory P Cache P Bus transaction X= -100 Store X X= 505 update X= 505

15 15 Cache Coherence Protocol (Update-based Protocol on Writeback cache) Update data for all processor nodes who share the same data For a processor node keeps updating the memory location, a lot of traffic will be incurred P Cache Memory P Cache P Bus transaction X= 505 Load X Hit ! Store X X= 333 update X= 333

16 16 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location P Cache Memory P Cache P Bus transaction X= -100 Store X invalidate X= 505

17 17 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location P Cache Memory P Cache P Bus transaction X= 505 Load X Bus snoop Miss ! Snoop hit X= 505

18 18 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location P Cache Memory P Cache P Bus transaction X= 505 Store X Bus snoop X= 505X= 333 Store X X= 987 Store X X= 444

19 19 MSI Writeback Invalidation Protocol Modified –Dirty –Only this cache has a valid copy Shared –Memory is consistent –One or more caches have a valid copy Invalid Writeback protocol: A cache line can be written multiple times before the memory is updated.

20 20 MSI Writeback Invalidation Protocol Two types of request from the processor –PrRd –PrWr Three types of bus transactions post by cache controller –BusRd PrRd misses the cache Memory or another cache supplies the line –BusRd eXclusive (Read-to-own) PrWr is issued to a line which is not in the Modified state –BusWB Writeback due to replacement Processor does not directly involve in initiating this operation

21 21 MSI Writeback Invalidation Protocol (Processor Request) Modified Invalid Shared PrRd / BusRd PrRd / --- PrWr / BusRdX PrWr / --- PrRd / --- PrWr / BusRdX Processor-initiated

22 22 MSI Writeback Invalidation Protocol (Bus Transaction) Flush data on the bus Both memory and requestor will grab the copy The requestor get data by –Cache-to-cache transfer; or –Memory Modified Invalid Shared Bus-snooper-initiated BusRd / --- BusRd / Flush BusRdX / FlushBusRdX / ---

23 23 MSI Writeback Invalidation Protocol (Bus transaction) Another possible implementation Modified Invalid Shared Bus-snooper-initiated BusRd / --- BusRd / Flush BusRdX / FlushBusRdX / --- Another possible, valid implementation Anticipate no more reads from this processor A performance concern Save “invalidation” trip if the requesting cache writes the shared line later BusRd / Flush

24 24 MSI Writeback Invalidation Protocol Modified Invalid Shared Bus-snooper-initiated BusRd / --- PrRd / BusRd PrRd / --- PrWr / BusRdX PrWr / --- PrRd / --- PrWr / BusRdX Processor-initiated BusRd / Flush BusRdX / FlushBusRdX / ---

25 25 MSI Example P1 Cache P2P3 Bus Cache MEMORY BusRd Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X X=10 S

26 26 MSI Example P1 Cache P2P3 Bus Cache MEMORY X=10S Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X BusRd X=10S S ---SBusRdMemory X=10

27 27 MSI Example P1 Cache P2P3 Bus Cache MEMORY X=10S Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X X=10 S S ---SBusRdMemory P3 writes X BusRdX ---I M I MBusRdX X=10 X=-25

28 28 MSI Example P1 Cache P2P3 Bus Cache MEMORY Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X X=-25M S ---SBusRdMemory P3 writes X ---I I MBusRdX P1 reads X BusRd X=-25 S S S ---SBusRdP3 Cache X=10X=-25

29 29 MSI Example P1 Cache P2P3 Bus Cache MEMORY Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X X=-25M S ---SBusRdMemory P3 writes X I ---MBusRdX P1 reads X X=-25S S S ---SBusRdP3 Cache X=10X=-25 P2 reads X BusRd X=-25S S SSBusRdMemory

30 30 MESI Writeback Invalidation Protocol To reduce two types of unnecessary bus transactions –BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block –BusRd that gets the line in S state when there is no sharers (that lead to the overhead above) Introduce the Exclusive state –One can write to the copy without generating BusRdX Illinois Protocol: Proposed by Pamarcos and Patel in 1984 Employed in Intel, PowerPC, MIPS

31 31 MESI Writeback Invalidation Protocol Processor Request (Illinois Protocol) Invalid ExclusiveModified Shared PrRd / BusRd (not-S) PrWr / --- Processor-initiated PrRd / --- PrRd, PrWr / --- PrRd / --- S: Shared Signal PrWr / BusRdX PrRd / BusRd (S) PrWr / BusRdX

32 32 MESI Writeback Invalidation Protocol Bus Transactions (Illinois Protocol) Invalid ExclusiveModified Shared Bus-snooper-initiated BusRd / Flush BusRdX / Flush BusRd / Flush* Flush*: Flush for data supplier; no action for other sharers BusRdX / Flush* BusRd / Flush Or ---) BusRdX / --- Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) Most of the MESI implementations simply write to memory

33 33 MESI Writeback Invalidation Protocol (Illinois Protocol) Invalid ExclusiveModified Shared Bus-snooper-initiated BusRd / Flush BusRdX / Flush BusRd / Flush* BusRdX / Flush* BusRdX / --- PrRd / BusRd (not-S) PrWr / --- PrRd / BusRd (S) Processor-initiated PrRd / --- PrRd, PrWr / --- PrRd / --- PrWr / BusRdX S: Shared Signal PrWr / BusRdX BusRd / Flush (or ---) Flush*: Flush for data supplier; no action for other sharers

34 34 MOESI Protocol Add one additional state ─ Owner state Similar to Shared state The O state processor will be responsible for supplying data (copy in memory may be stale) Employed by –Sun UltraSparc –AMD Opteron In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed CPU0 L2 CPU1 L2 System Request Interface Crossbar Hyper- Transport Mem Controller

35 35 Cache State Transitions: Based on CPU Requests INVALID SHARED EXCLUSIVE CPU Read: Place read miss on bus CPU Read hit CPU Read miss: place read miss on bus CPU Read miss: write-back block, place read on bus CPU Write Miss: place write miss on bus CPU Read hit CPU Write hit CPU write miss: write back cache block, place write miss on bus CPU write: place write miss on bus ::::: : 31 0 Mux State Bits Tag Data : : Invalid, shared or exclusive CPU Write hit: place invalidate on bus From S. Yalamanchili From H&P, Section 4.2

36 36 Cache State Transitions: Based on Bus Requests INVALID SHARED EXCLUSIVE Write miss for this block CPU read miss Read miss for this block: write back block, abort memory access Write miss for this block: write back block, abort memory access Memory P1 Memory Network A P2 A Invalidate for this block From S. Yalamanchili From H&P, Section 4.2

37 37 Implication on Multi-Level Caches How do you guarantee coherence in a multi-level cache hierarchy? –Snoop all cache levels? –Intel’s 8870 chipset has a “snoop filter” for quad-core Maintaining inclusion property –Ensure data in the outer level must be present in the inner level –Only snoop the outermost level (e.g. L2) –L2 needs to know L1 has write hits Use Write-Through cache Use Write-back but maintain another “modified-but-stale” bit in L2

38 38 Inclusion Property Not so easy … –Replacement: Different bus observes different access activities, e.g. L2 may replace a line frequently accessed in L1 –Split L1 caches: Imagine all caches are direct- mapped. –Different cache line sizes

39 39 Inclusion Property Use specific cache configurations –E.g., DM L1 + bigger DM or set-associative L2 with the same cache line size Explicitly propagate L2 action to L1 –L2 replacement will flush the corresponding L1 line –Observed BusRdX bus transaction will invalidate the corresponding L1 line –To avoid excess traffic, L2 maintains an Inclusion bit for filtering (to indicate in L1 or not)

40 40 Directory-based Coherence Protocol Snooping-based protocol –N transactions for an N-node MP –All caches need to watch every memory request from each processor –Not a scalable solution for maintaining coherence in large shared memory systems Directory protocol –Directory-based control of who has what; –HW overheads to keep the directory (~ # lines * # processors) P $ P $ P $ P $ Memory Interconnection Network Directory Modified bit Presence bits, one for each node

41 41 Directory-based Coherence Protocol P $ P $ P $ P $ Memory Interconnection Network P $ 1 1100000 0 0000101 C(k) C(k+1) 0 0010100 C(k+j) 1 presence bit for each processor, each cache block in memory 1 modified bit for each cache block in memory

42 42 Directory-based Coherence Protocol (Limited Dir) Encoded Present bits (lg 2 N), each cache line can reside in 2 processors in this example 1 modified bit for each cache block in memory P0 $ P13 $ P14 $ P15 $ Memory Interconnection Network P1 $ Presence encoding is NULL or not 0 0 01 1 1 1 0 1 0 0 0 1 11 - - 0 00 0

43 43 Distributed Directory Coherence Protocol Centralized directory is less scalable (contention) Distributed shared memory (DSM) for a large MP system Interconnection network is no longer a shared bus Maintain cache coherence (CC-NUMA) Each address has a “home” P $ Memory Interconnection Network P $ Memory P $ P $ P $ P $ Directory

44 44 Some Additional Concepts P + C Dir Memory P + C Dir Memory P + C Dir Memory Local node generates a memory reference Remote node has a copy of block Home node is the physical memory location of a memory reference Generating the request Network Messages are received in the order sent Directory entry indicates state of cached blocks and the members of the sharing set From S. Yalamanchili

45 45 Distributed Directory Coherence Protocol Stanford DASH (4 CPUs in each cluster, total 16 clusters) –Invalidation-based cache coherence –Directory keeps one of the 3 status of a cache block at its home node Uncached Shared (unmodified state) Dirty P $ Memory P $ Directory Interconnection Network Snoop bus P $ Memory P $ Directory Snoop bus

46 46 DASH Memory Hierarchy Processor Level Local Cluster Level Home Cluster Level (address is at home) If dirty, needs to get it from remote node which owns it Remote Cluster Level P $ Memory P $ Directory Interconnection Network Snoop bus P $ Memory P $ Directory Snoop bus

47 47 Directory Coherence Protocol: Read Miss Interconnection Network 0011 P $ Memory P $ Miss Z (read) Go to Home Node Memory P $ Z ZZ 1 Data Z is shared (clean) Home of Z

48 48 Directory Coherence Protocol: Read Miss Interconnection Network 1010 P $ Memory P $ Miss Z (read) Memory P $ Z Data Z is Dirty Go to Home Node Respond with Owner Info Data Request ZZ 011 Data Z is Clean, Shared by 3 nodes

49 49 Directory Coherence Protocol: Write Miss Interconnection Network 001 P $ Memory P $ Miss Z (write) Memory P $ Z 1 Z Go to Home Node Respond w/ sharers Invalidate ACK 0011 Z Write Z can proceed in P0

50 50 Directory Protocol: Some General Features The {sharing set} is the set of processors with a copy of a memory block –Implementation Bit vectors and fully mapped entries Linked lists When using a linked directory –Update messages are propagated –Requester is added to the head of the list Invalidations reduce the size of the sharing set, while updates increase their size –Updates reduce new requests for the line –Invalidations increase network traffic From S. Yalamanchili

51 51 The Local Cache State Machine INVALID SHARED MODIFIED CPU read miss CPU read hit Invalidate CPU Read: send read miss msg CPU write miss: data write back CPU write hit CPU read hit Fetch Invalidate: data write back CPU write: send write miss msg CPU read miss: data write back Fetch; data write back CPU write miss: data send write msg P + C Dir memory Network cache ::::: : 31 0 State Bits Tag Data : : CPU write hit: data send invalidate message From S. Yalamanchili From H&P, Section 4.2

52 52 The Directory State Machine UNCACHED SHARED EXCLUSIVE Data value reply: Sharers = {P}, Read miss Data write back: Sharers = {} Data value reply: Sharers = {P} Write miss Write miss: fetch/invalidate Data value reply, Sharers = {P} Read miss: Data value reply Sharers = Sharers + {P} Read miss: fetch, Data value reply Sharers = Sharers + {P} Write miss: invalidate, Sharers = {P} Data value reply P + C Dir memory Network cache Note that the state of the memory block refers to the state of the copies in remote caches From S. Yalamanchili From H&P, Section 4.2


Download ppt "ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer."

Similar presentations


Ads by Google