Cache Coherence Constructive Computer Architecture Arvind

Slides:

Advertisements

Similar presentations

Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.

Advertisements

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November.

1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

Realistic Memories and Caches – Part II Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 2, 2012L14-1.

Constructive Computer Architecture Tutorial 8 Final Project Part 2: Coherence Sizhuo Zhang TA Nov 25, 2015T08-1http://csg.csail.mit.edu/6.175.

Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology Includes.

Non-blocking Caches Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 14, 2012L25-1

Realistic Memories and Caches – Part III Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 4, 2012L15-1.

Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Lecture 8: Snooping and Directory Protocols

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Cache Coherence Constructive Computer Architecture Arvind

6.175 Final Project Part 0: Understanding Non-Blocking Caches and Cache Coherency Answers.

Caches-2 Constructive Computer Architecture Arvind

COSC6385 Advanced Computer Architecture

COMP 740: Computer Architecture and Implementation

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Realistic Memories and Caches

Constructive Computer Architecture Tutorial 6 Cache & Exception

תרגול מס' 5: MESI Protocol

Realistic Memories and Caches

CAM Content Addressable Memory

Multiprocessor Cache Coherency

Caches-2 Constructive Computer Architecture Arvind

Cache Coherence Constructive Computer Architecture Arvind

CMSC 611: Advanced Computer Architecture

Krste Asanovic Electrical Engineering and Computer Sciences

Example Cache Coherence Problem

Symmetric Multiprocessors: Synchronization and Sequential Consistency

CS5102 High Performance Computer Systems Distributed Shared Memory

Lecture 2: Snooping-Based Coherence

Constructive Computer Architecture Tutorial 7 Final Project Overview

Realistic Memories and Caches

Cache Coherence Constructive Computer Architecture Arvind

Lecture 5: Snooping Protocol Design Issues

Cache Coherence Constructive Computer Architecture Arvind

Lecture 8: Directory-Based Cache Coherence

CC protocol for blocking caches

Lecture 7: Directory-Based Cache Coherence

Caches-2 Constructive Computer Architecture Arvind

Caches and store buffers

Lecture 25: Multiprocessors

Lecture 4: Synchronization

Lecture 8: Directory-Based Examples

Lecture 25: Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 18 Cache Coherence Krste Asanovic Electrical Engineering and.

Lecture 23: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 18: Coherence and Synchronization

CSE 486/586 Distributed Systems Cache Coherence

Caches-2 Constructive Computer Architecture Arvind

Lecture 10: Directory-Based Examples II

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Cache Coherence Constructive Computer Architecture Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 15, 2017 http://www.csg.csail.mit.edu/6.175

SC and caches Caches present a similar problem as store buffers – stores in one cache will not be visible to other caches automatically Cache problem is solved differently – caches are kept coherent P Cache Memory How to build coherent caches is the topic of this lecture November 13, 2017 http://www.csg.csail.mit.edu/6.175

Cache-coherence problem Processor-Memory Interconnect P1 P2 cache-2 memory 200 200 Suppose P1 updates A to 200. write-back: memory and P2 have stale values write-through: P2 has a stale value Do these stale values matter for programming? Yes, if we want to implement SC or, in fact, any reasonable memory model November 15, 2017 http://www.csg.csail.mit.edu/6.175

Shared Memory Systems Interconnect P P P P L1 L1 L1 L1 P P L1 L1 L2 L2 Interconnect M Modern systems often have hierarchical caches Each cache has exactly one parent but can have zero or more children Logically only a parent and its children can communicate directly Inclusion property is maintained between a parent and its children, i.e., a  Li  a  Li+1 Because usually Li+1 >> Li November 15, 2017 http://www.csg.csail.mit.edu/6.175

Cache-Coherent Memory req res Monolithic Memory ... A monolithic memory processes one request at a time; it can be viewed as processing requests instantaneously A memory with hierarchy of caches is said to be coherent, if functionally it behaves like the monolithic memory November 15, 2017 http://www.csg.csail.mit.edu/6.175

Maintaining Coherence In a coherent memory all loads and stores can be placed in a global order multiple copies of an address in various caches can cause this property to be violated This property can be ensured if: Only one cache at a time has the write permission for an address No cache can have a stale copy of the data after a write to the address has been performed cache coherence protocols are used to maintain coherence November 15, 2017 http://www.csg.csail.mit.edu/6.175

Cache Coherence Protocols Write request: the address is invalidated in all other caches before the write is performed Read request: if a dirty copy is found in some other cache then that value is written back to the memory and supplied to the reader. Alternatively the dirty value can be forwarded directly to the reader Such protocols are called Invalidation-based November 15, 2017 http://www.csg.csail.mit.edu/6.175

State and actions needed to maintain Cache Coherence invalidate I flush Each line in each cache maintains MSI state: I - cache doesn’t contain the address S- cache has the address but so may other caches; hence it can only be read M- only this cache has the address; hence it can be read and written load store S M store write-back Action on a read miss (i.e., Cache state is I): If some other cache has the address in state M then write back the dirty data to Memory and set its state to S Read the value from Memory and set the state to S Action on a write miss (i.e., Cache state is I or S): Invalidate the address in other caches; in case some cache has the address in state M then write back the dirty data Read the value from Memory if necessary and set the state to M How do we know the state of other caches? November 15, 2017 http://www.csg.csail.mit.edu/6.175

Protocols are distributed! Fundamental assumption A processor or cache can only examine or set its own state The state of other caches is inferred or set by sending request and response messages Each parent cache maintains information about each of its child cache in a directory Directory information is conservative, e.g., if the directory say that the child cache c has a cache-line in state S, then cache c may have the address in either S or I state but not in M state Sometimes the state of a cache line is transient because it has requested a change. Directory also contains information about outstanding messages November 15, 2017 http://www.csg.csail.mit.edu/6.175

Directory State Encoding Two-level (L1, M) system P L1 P L1 P L1 P P [S,a] [S,a] Interconnect directory for a M a <[S, no],I,[S, no],I> Addr Tag Data Block V M/S dir -L1 has no directory -M has no need for MSI <[(M|S|I), (No | Yes)]> Child’s state Waiting for downgrade response for each child November 15, 2017 http://www.csg.csail.mit.edu/6.175

State ordering to develop protocols The states M, S, I can be thought of as an order M>S>I Upgrade: A cache miss causes transition from a lower state to a higher state Downgrade: A write-back or invalidation causes a transition from a higher state to a lower state invalidate I flush load store M S store write-back November 15, 2017 http://www.csg.csail.mit.edu/6.175

Message passing an abstract view interconnect PP P c2m m2c L1 p2m m2p m in out Each cache has 2 pairs of queues (c2m, m2c) to communicate with the memory (p2m, m2p) to communicate with the processor Message format: <cmd, srcdst, a, s, data> FIFO message passing between each (srcdst) pair except a Req cannot block a Resp Messages in one srcdst path cannot block messages in another srcdst path Req/Resp address state November 15, 2017 http://www.csg.csail.mit.edu/6.175

Consequences of distributed protocol In the blocking-cache protocol we presented in L15, a cache could go to sleep after it issued a request for a missing line A cache may receive an invalidation request at any time from other caches (via its parent); such requests cannot be ignored otherwise the system will deadlock none of the requests may be able to complete A difficult part of the protocol design is to determine which request can arrive in a given state November 15, 2017 http://www.csg.csail.mit.edu/6.175

Processing misses: Requests and Responses Cache 1,5,8 3,5,7 Cache 1,5,8 3,5,7 3 5 1 2 4 1 Up-req send (cache) 2 Up-req proc, Up resp send (memory) 3 Up-resp proc (cache) 4 Dn-req send (memory) 5 Dn-req proc, Dn resp send (cache) 6 Dn-resp proc (memory) 7 Dn-req proc, drop (cache) 8 Voluntary Dn-resp (cache) 6 Memory 2,6 2,4 November 15, 2017 http://www.csg.csail.mit.edu/6.175

CC protocol for blocking caches Extension to the cache rules for Blocking L1 design discussed in lecture L15 Code is somewhat simplified by assuming that cache-line size = one word syntax is full of errors November 15, 2017 http://www.csg.csail.mit.edu/6.175

Req method hit processing method Action req(MemReq r) if(mshr == Ready); let a = r.addr; let hit = contains(state, a); if(hit) begin let slot = getSlot(state, a); let x = dataArray[slot]; if(r.op == Ld) hitQ.enq(x); else // it is store if (isStateM(state[slot]) dataArray[slot] <= r.data; else begin missReq <= r; mshr <= SendFillReq; missSlot <= slot; end end else begin missReq <= r; mshr <= StartMiss; end // (1) endmethod PP P c2m m2c L1 p2m m2p November 15, 2017 http://www.csg.csail.mit.edu/6.175

Start-miss and Send-fill rules Rdy -> StrtMiss -> SndFillReq -> WaitFillResp -> Resp -> Rdy rule startMiss(mshr == StartMiss); let slot = findVictimSlot(state, missReq.addr); if(!isStateI(state[slot])) begin // write-back (Evacuate) let a = getAddr(state[slot]); let d = (isStateM(state[slot])? dataArray[slot]: -); state[slot] <= (I, _); c2m.enq(<Resp, c->m, a, I, d>); end mshr <= SendFillReq; missSlot <= slot; endrule rule sendFillReq (mshr == SendFillReq); let upg = (missReq.op == Ld)? S : M; c2m.enq(<Req, c->m, missReq.addr, upg, - >); mshr <= WaitFillResp; endrule // (1) November 15, 2017 http://www.csg.csail.mit.edu/6.175

Wait-fill rule and Proc Resp rule Rdy -> StrtMiss -> SndFillReq -> WaitFillResp -> Resp -> Rdy rule waitFillResp ((mshr == WaitFillResp) &&& (m2c.first matches <Resp, m->c, .a, .cs, .d>)); let slot = missSlot; dataArray[slot] <= (missReq.op == Ld)? d : missReq.data; state[slot] <= (cs, a); m2c.deq; mshr <= Resp; endrule // (3) Is there a problem with waitFill? What if the hitQ is blocked? Should we not at least write it in the cache? rule sendProc(mshr == Resp); if(missReq.op == Ld) begin c2p.enq(dataArray[slot]); end mshr <= Ready; endrule November 15, 2017 http://www.csg.csail.mit.edu/6.175

Parent Responds rule parentResp (c2m.first matches <Req,.c->m,.a,.y,.*>); let slot = getSlot(state, a); // in a 2-level // system a has to be present in the memory let statea = state[slot]; if((i≠c, isCompatible(statea.dir[i],y)) && (statea.waitc[c]=No)) begin let d =(statea.dir[c]=I)? dataArray[slot]: -); m2c.enq(<Resp, m->c, a, y, d>); state[slot].dir[c]:=y; c2m.deq; end endrule IsCompatible(M, M) = False IsCompatible(M, S) = False IsCompatible(S, M) = False All other cases = True November 15, 2017 http://www.csg.csail.mit.edu/6.175

Parent (Downgrade) Requests rule dwn (c2m.first matches <Req,c->m,.a,.y,.*>); let slot = getSlot(state, a); let statea = state[slot]; if (findChild2Dwn(statea) matches (Valid .i)) begin state[slot].waitc[i] <= Yes; m2c.enq(<Req, m->i, a, (y==M?I:S), ? >); end; Endrule // (4) This rule will execute as long some child cache is not compatible with the incoming request November 15, 2017 http://www.csg.csail.mit.edu/6.175

Parent receives Response rule dwnRsp (c2m.first matches <Resp, c->m, .a, .y, .data>); c2m.deq; let slot = getSlot(state, a); let statea = state[slot]; if(statea.dir[c]=M) dataArray[slot]<=data; state[slot].dir[c]<=y; state[slot].waitc[c]<=No; endrule // (6) November 15, 2017 http://www.csg.csail.mit.edu/6.175

Child Responds let slot = getSlot(state,a); rule dng ((mshr != Resp) &&& m2c.first matches <Req,m->c,.a,.y,.*>); let slot = getSlot(state,a); if(getCacheState(state[slot])>y) begin let d = (isStateM(state[slot])? dataArray[slot]: -); c2m.enq(<Resp, c->m, a, y, d>); state[slot] <= (y,a); end // the address has already been downgraded m2c.deq; endrule // (5) and (7) November 15, 2017 http://www.csg.csail.mit.edu/6.175

Child Voluntarily downgrades rule startMiss(mshr == Ready); let slot = findVictimSlot(state); if(!isStateI(state[slot])) begin // write-back (Evacuate) let a = getAddr(state[slot]); let d = (isStateM(state[slot])? dataArray[slot]: -); state[slot] <= (I, _); c2m.enq(<Resp, c->m, a, I, d>); end endrule // (8) Rules 1 to 8 are complete - cover all possibilities and cannot deadlock or violate cache invariants November 15, 2017 http://www.csg.csail.mit.edu/6.175

Invariants for a CC-protocol design Directory state is always a conservative estimate of a child’s state E.g., if directory thinks that a child cache is in S state then the cache has to be in either I or S state For every request there is a corresponding response, though sometimes it is generated even before the request is processed Communication system has to ensure that responses cannot be blocked by requests a request cannot overtake a response for the same address At every merger point for requests, we will assume fair arbitration to avoid starvation November 15, 2017 http://www.csg.csail.mit.edu/6.175