Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.

Slides:



Advertisements
Similar presentations
Chapter 5 Part I: Shared Memory Multiprocessors
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
Extra Cache Coherence Examples In the following examples there are a couple questions. You can answer these for practice by ing Colin at
Lecture 7. Multiprocessor and Memory Coherence
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
The University of Adelaide, School of Computer Science
Computer Architecture II 1 Computer architecture II Lecture 8.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.
CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Snooping Cache and Shared-Memory Multiprocessors
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Cache Coherence CSE 661 – Parallel and Vector Architectures
1 Memory and Cache Coherence. 2 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main memory from any processor.
Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin1 Lecture 9 Outline  MESI protocol  Dragon update-based protocol.
Multiprocessors— Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
Cache Coherence CS433 Spring 2001 Laxmikant Kale.
Additional Material CEG 4131 Computer Architecture III
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
CSC/ECE 506: Architecture of Parallel Computers Bus-Based Coherent Multiprocessors 1 Lecture 12 (Chapter 8) Lecture 12 (Chapter 8)
COSC6385 Advanced Computer Architecture
תרגול מס' 5: MESI Protocol
Architecture and Design of AlphaServer GS320
Cache Coherence in Shared Memory Multiprocessors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Cache Coherence for Shared Memory Multiprocessors
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Lecture 2: Snooping-Based Coherence
Chip-Multiprocessor.
Bus-Based Coherent Multiprocessors
11 – Snooping Cache and Directory Based Multiprocessors
Lecture 25: Multiprocessors
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 24: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture 8 Outline Memory consistency
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Presentation transcript:

Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander

December 15, 2015SoC Architecture2 Formal Definition of Coherence Results of a program: values returned by its read operations A memory system is coherent if the results of any execution of a program are such that it is possible to construct a hypothetical serial order of all operations that is consistent with the results of the execution and in which: 1. operations issued by any particular process occur in the order issued by that process, and 2. the value returned by a read is the value written by the last write to that location in the serial order

December 15, 2015SoC Architecture3 Formal Definition of Coherence Two necessary features: Write propagation: value written must become visible to others Write serialization: writes to location seen in same order by all if I see w1 before w2, you should not see w2 before w1 no need for analogous read serialization since reads not visible to others

Example December 15, 2015SoC Architecture4 Task A x:=0; y:=0; Print (x+y); Task B x:=1; y:=x+2; x:=1; y:=x+2; x:=0; y:=0; Print (x+y); 0 x:=0; y:=0; x:=1; y:=x+2; Print (x+y); 4 x:=0; x:=1; y:=x+2; y:=0; Print (x+y); 1 x:=1; x:=0; y:=0; y:=x+2; Print (x+y); 2 Coherent memory system

Example December 15, 2015SoC Architecture5 Task A x:=0; y:=0; Print (x+y); Task B x:=1; y:=x+2; x:=0; y:=0; x:=1; y  3 y:=x+2; Print (x+y); x  1 3 Incoherent memory system

Snooping-based Cache Coherence

December 15, 2015SoC Architecture7 Cache Coherence Using a Bus Built on Bus transactions State transition diagram in cache Uniprocessor bus transaction: Serialization of bus transactions Burst – Transactions visible to all

December 15, 2015SoC Architecture8 Cache Coherence Using a Bus Uniprocessor cache states: Effectively, every block is a finite state machine Write-through, write no-allocate has two states: valid, invalid Write-back, write-allocate caches have one more state: modified (“dirty”) Multiprocessors extend cache states and bus transactions to implement coherence

December 15, 2015SoC Architecture9 Snooping-based Coherence Basic Idea Transactions on bus are visible to all processors Processors or cache controllers can snoop (monitor) bus and take action on relevant events (e.g. change state)

December 15, 2015SoC Architecture10 Snooping-based Coherence Implementing a Protocol Cache controller now receives inputs from both sides: Requests from processor, bus requests/responses from snooper In either case, takes zero or more actions Updates state, responds with data, generates new bus transactions Protocol is distributed algorithm: cooperating state machines Set of states, state transition diagram, actions Granularity of coherence is typically cache block Like that of allocation in cache and transfer to/from cache

December 15, 2015SoC Architecture11 Cache Coherence with Write- Through Caches Key extensions to uniprocessor: snooping, invalidating/updating caches no new states or bus transactions in this case invalidation- versus update-based protocols Write propagation: even in invalidation case, later reads will see new value invalidation causes miss on later access, and memory up-to-date via write-through P1P1 Cache Main Memory Bus PnPn Cache Cache-Memory Transition Bus Snooping V I V I Cache Coherence Protocol

December 15, 2015SoC Architecture12 State Transition Diagram write-through, write no-allocate Cache IV PrRd/BusRd PrWr/BusWr PrRd/- PrWr/BusWr BusWr/- Processor-initiated transactions Bus-snooper-initiated transactions Protocol is executed for each cache-controller connected to a processor Cache Controller receives inputs from processor and bus Block is in CacheBlock is not in Cache

December 15, 2015SoC Architecture13 Ordering All writes appear on the bus Read misses: appear on bus, and will see last write in bus order Read hits: do not appear on bus But value read was placed in cache by either most recent write by this processor, or most recent read miss by this processor Both these transactions appear on the bus So read hits also see values as being produced in consistent bus order

December 15, 2015SoC Architecture14 Problem with Write-Through High bandwidth requirements Every write from every processor goes to shared bus and memory Write-through especially unpopular for Symmetric Multi- Processors Write-back caches absorb most writes as cache hits Write hits don’t go on bus But now how do we ensure write propagation and serialization? Need more sophisticated protocols: large design space

December 15, 2015SoC Architecture15 Basic MSI Protocol for writeback, write-allocate caches States Invalid (I) Shared (S): memory and one or more caches have a valid copy Dirty or Modified (M): only one cache has a modified (dirty) copy Processor Events: PrRd (read) PrWr (write) Bus Transactions BusRd: asks for copy with no intent to modify BusRdX: asks for an exclusive copy with intent to modify BusWB: updates memory on write back Actions Update state, perform bus transaction, flush value onto bus

December 15, 2015SoC Architecture16 MSI State Transition Diagram PrRd/- PrRd/— PrWr/BusRdX BusRd/— PrWr/- S M I BusRdX/Flush BusRdX/— BusRd/FlushPrWr/BusRdX PrRd/BusRd

December 15, 2015SoC Architecture17 Modern Bus Standards and Cache Coherence Protocols Both the AMBA and the Avalon protocols do not include a cache coherence protocol! The designer has to be aware of problems related to cache coherence We see cache coherence protocols for SoCs coming E.g. ARM11 MPCore Platform support data cache coherence

ARM11 MPCore Cache Write back Write allocate MESI Protocol Modified: Exclusive and modified Exclusive: Exclusive but not modified Shared Invalid December 15, 2015SoC Architecture18

Directory Based Cache Coherence

December 15, 2015SoC Architecture20 Networks on Chip In Networks-on-Chip cache coherence cannot be implemented by bus snooping! P MEM Switch Channel NI Network Interface C P MEM C P C P C

December 15, 2015SoC Architecture21 Distributed Memory Distributed Memory Architectures which do not have a bus as only communication channel cannot use snooping protocols to ensure cache coherence Instead a directory based approach can be used to guarantee cache coherence P1P1 PmPm Cache Memory Cache Interconnection Network Memory

December 15, 2015SoC Architecture22 Directory-Based Cache Coherence Concepts State of caches is maintained in a directory A cache miss results in a communication between the node where the cache miss occures and the directory Then information in affected caches is updated Each node monitors the state of its cache with e.g. an MSI protocol

December 15, 2015SoC Architecture23 Multiprocessor with Directories Every block of main memory (the size of a cache block) has a directory entry that keeps track of its cached copies and the state Directory Memory Communication Assist Cache P CA C Interconnection Network Directory Memory P CA C

December 15, 2015SoC Architecture24 Tasks of the Protocol When a cache miss occurs the following tasks have to be performed 1. Finding out information of the state of copies in other caches 2. Location of these copies, if needed (e.g. for Invalidation) 3. Communication with other copies (e.g. obtaining data)

December 15, 2015SoC Architecture25 Some Definitions Home Node: Node with the main memory where the block is located Dirty Node: Node, which has a copy of the block in modified (dirty) state Owner Node: Node, that has a valid copy of the block and thus must supply data when needed (is either home or dirty node) Exclusive Node: Node, that has a copy of the block in exclusive state (either dirty or clean) Local Node (Requesting Node): Node, that has the processor issuing a request for the cache block Locally Allocated Blocks: Blocks whose home is local to the issuing processor Remotely Allocated Blocks: Blocks whose home is not local to the issuing processor

December 15, 2015SoC Architecture26 Read Miss to a Block in modified State in Cache C P CA Memory/Dir Requestor C P CA Memory/Dir Directory Node for block C P CA Memory/Dir Node with dirty copy Read request to directory 1 Response with owner identity 2 Read request to owner 3 Data Reply 4a Revision message to directory (Data Reply) 4b

December 15, 2015SoC Architecture27 Write Miss to a Block with Two Sharers C P CA Memory/Dir Requestor C P CA Memory/Dir Directory Node for block C P CA Memory/Dir Node with shared copy ReadEx request to directory 1 Response with Sharer’s identity 2 C P CA Memory/Dir Node with shared copy 4b Invalidation Acknowledgement 3a Invalidation request to sharer Invalidation request to sharer 3b Invalidation Acknowledgement 4a

December 15, 2015SoC Architecture28 Organization of the Directory A natural organization of the directory is to maintain the directory information for a block together with the block in main memory Each block can be represented as a bit vector of p presence bits and one or more state bits. In the simplest case there is one state bit (dirty bit), which represents if there is a modified (dirty) copy of the cache in one node

December 15, 2015SoC Architecture29 Example for Directory Information An entry for a memory block consists of presence bits and a status bit (dirty bit) If the dirty bit == ON, there can only be one presence bit set xx Presence Bits Dirty Bit P CA C Memory Directory

December 15, 2015SoC Architecture30 Read Miss of Processor i If the dirty bit == OFF Assist obtains the block from main memory, supplies it to the requestor and sets the presence bit p[i] ← ON If the dirty bit == ON Assist responds to the requestor with the identity of the owner node Requester then sends a request network transaction to owner node Owner changes its state to shared and supplies the block to both the requesting node and the main memory The memory sets dirty ← OFF and p[i] ← ON

December 15, 2015SoC Architecture31 Write Miss of Processor i If the dirty bit == OFF The main memory has a clean copy of data The home node sends the presence vector to the requesting node i together with the data The home node clears its directory entry, leaving only the p[i] ← ON and dirty ← ON The assist at the requestor sends invalidation requests to the nodes where the value of the presence bit was ON and waits for an acknowledgement The requestor places the block in its cache in dirty state (dirty ← ON)

December 15, 2015SoC Architecture32 Write Miss of Processor i If the dirty bit == ON The main memory has not a clean copy of data The home node requests the cache block from the dirty node, which sets its cache state to invalid Then the block is supplied to the requesting node, which places the block in cache in dirty state The home node clears its directory entry, leaving only the p[i] ← ON and dirty ← ON

Size of Directory 1 entry/memory block SD = ST/SB x (N+1) December 15, 2015SoC Architecture33 SD …size of directory ST … total memory N … no. of nodes CB…blocks per cache SB … block size SC … cache size Example: ST = 4GB N= 64 nodes CB = 128 K SB = 64 Byte SC = 8 MB SD = 520MB 13% of total memory 102% of total cache size

Size of Directory 1 entry/cache block SD = N x CB x (N+1) December 15, 2015SoC Architecture34 SD …size of directory ST … total memory N … no. of nodes CB…blocks per cache SB … block size SC … cache size Example: ST = 4GB N= 64 nodes CB = 128 K SB = 64 Byte SC = 8 MB SD = 65 MB 1.5% of total memory 12.6% of total cache size

December 15, 2015SoC Architecture35 Discussion Directory based protocols allow to provide cache coherence for distributed shared memory systems, which are not based on buses Since the protocol requires communication between nodes with shared copies there is a potential for congestion Since communication is not instantly and varies from node to node there is the risk that there are different views of the memory at some time instances. These race conditions have to be understood and taken care of!