Cache Coherence Constructive Computer Architecture Arvind

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Realistic Memories and Caches – Part II Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 2, 2012L14-1.
Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology Includes.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Non-blocking Caches Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 14, 2012L25-1
Realistic Memories and Caches – Part III Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 4, 2012L15-1.
Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Lecture 8: Snooping and Directory Protocols
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Cache Coherence Constructive Computer Architecture Arvind
6.175 Final Project Part 0: Understanding Non-Blocking Caches and Cache Coherency Answers.
Caches-2 Constructive Computer Architecture Arvind
Realistic Memories and Caches
תרגול מס' 5: MESI Protocol
Lecture 21 Synchronization
CSC 4250 Computer Architectures
CS 704 Advanced Computer Architecture
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Caches-2 Constructive Computer Architecture Arvind
Notation Addresses are ordered triples:
Assignment 4 – (a) Consider a symmetric MP with two processors and a cache invalidate write-back cache. Each block corresponds to two words in memory.
CMSC 611: Advanced Computer Architecture
Krste Asanovic Electrical Engineering and Computer Sciences
Example Cache Coherence Problem
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 9: Directory-Based Examples II
CS5102 High Performance Computer Systems Distributed Shared Memory
Lecture 2: Snooping-Based Coherence
Constructive Computer Architecture Tutorial 7 Final Project Overview
Realistic Memories and Caches
Cache Coherence Constructive Computer Architecture Arvind
Lecture 5: Snooping Protocol Design Issues
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Cache Coherence Constructive Computer Architecture Arvind
CC protocol for blocking caches
Caches-2 Constructive Computer Architecture Arvind
11 – Snooping Cache and Directory Based Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Caches and store buffers
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
CS 3410, Spring 2014 Computer Science Cornell University
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Cache Coherence Constructive Computer Architecture Arvind
Lecture 18: Coherence and Synchronization
CSE 486/586 Distributed Systems Cache Coherence
Caches-2 Constructive Computer Architecture Arvind
Lecture 10: Directory-Based Examples II
Presentation transcript:

Cache Coherence Constructive Computer Architecture Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 19, 2014 http://www.csg.csail.mit.edu/6.175

Plan The invalidation protocol Non-blocking L1 November 19, 2014 http://www.csg.csail.mit.edu/6.175

Processor Hit Rules Load-hit rule p2m.msg=(Load a) & c.tag[cs(a)]=tag(a) & c.state[cs(a)]>I  p2m.deq; m2p.enq(c.data[cs(a)]); Store-hit rule p2m.msg=(Store a v) & c.state[cs(a)]=M m2p.enq(Ack); c.data[cs(a)]:=v; PP P c2m m2c L1 p2m m2p November 19, 2014 http://www.csg.csail.mit.edu/6.175

Processing misses: Requests and Responses Cache 1,5,8 3,5,7 Cache 1,5,8 3,5,7 3 5 1 2 4 1 Up-req send (cache) 2 Up-req proc, Up resp send (memory) 3 Up-resp proc (cache) 4 Dn-req send (memory) 5 Dn-req proc, Dn resp send (cache) 6 Dn-resp proc (memory) 7 Dn-req proc, drop (cache) 8 Voluntary Dn-resp (cache) 6 Memory 2,6 2,4 November 19, 2014 http://www.csg.csail.mit.edu/6.175

Invariants for a CC-protocol design Directory state is always a conservative estimate of a child’s state E.g., if directory thinks that a child cache is in S state then the cache has to be in either I or S state For every request there is a corresponding response, though sometimes it is generated even before the request is processed Communication system has to ensure that responses cannot be blocked by requests a request cannot overtake a response for the same address At every merger point for requests, we will assume fair arbitration to avoid starvation November 19, 2014 http://www.csg.csail.mit.edu/6.175

Child Requests 1. Child to Parent: Upgrade-to-y Request c.tag[cs(a)]!=tag(a) & c.state[cs(a)]=I & c.waitp[cs(a)]=No  c.tag[cs(a)]:= tag(a); c.waitp[cs(a)]:=Yes y; c2m.enq(<Req, cm, a, y, - >); c.tag[cs(a)]=tag(a) & (c.state[cs(a)]< y) &  c.waitp[cs(a)]:=Yes y; A request is never sent unless the cache has a slot and the slot contains the tag cs(a) is the cache slot address These rules are mutually exclusive and can be combined. This rule would normally be triggered by a cache miss. November 19, 2014 http://www.csg.csail.mit.edu/6.175

Parent Responds 2. Parent to Child: Upgrade-to-y response (j, m.waitc[j][a]=No) & c2m.msg=<Req,cm,a,y,-> & (i≠c, IsCompatible(m.child[i][a],y))  m2c.enq(<Resp, mc, a, y, (if (m.child[c][a]=I) then m.data[a] else -)>); m.child[c][a]:=y; c2m.deq; November 19, 2014 http://www.csg.csail.mit.edu/6.175

Child receives Response 3. Child receiving upgrade-to-y response m2c.msg=<Resp, mc, a, y, data>  m2c.deq; if(c.state[cs(a)]=I) c.data[cs(a)]:=data; c.state[cs(a)]:=y; c.waitp[cs(a)]:=No; // the child must be waiting for state y November 19, 2014 http://www.csg.csail.mit.edu/6.175

Parent Requests 4. Parent to Child: Downgrade-to-y Request c2m.msg=<Req,cm,a,y,-> & (m.child[i][a]>y) & (m.waitc[i][a]=No)  m.waitc[i][a]:=Yes y; m2c.enq(<Req, mc, a, y, - >); November 19, 2014 http://www.csg.csail.mit.edu/6.175

Child Responds 5. Child to Parent: Downgrade-to-y response (m2c.msg=<Req,mc,a,y,->) & c.state[cs(a)]>y & c.tag[cs(a)]=tag(a)  c2m.enq(<Resp, c->m, a, y, (if (c.state[cs(a)]=M) then c.data[a] else -)>); c.state[cs(a)]:=y; m2c.deq November 19, 2014 http://www.csg.csail.mit.edu/6.175

Parent receives Response 6. Parent receiving downgrade-to-y response c2m.msg=<Resp, cm, a, y, data>  c2m.deq; if(m.child[c][a]=M) m.data[a]:=data; m.child[c][a]:=y; if(m.waitc[c][a]=(Yes x) & x≥y) m.waitc[c][a]:=No; November 19, 2014 http://www.csg.csail.mit.edu/6.175

Child receives served Request 7. Child receiving downgrade-to-y request (m2c.msg=<Req, mc, a, y, - >) & ((c.tag[cs(a)]=tag(a) & c.state[cs(a)]≤y) || c.tag[cs(a)]!=tag(a))  m2c.deq; November 19, 2014 http://www.csg.csail.mit.edu/6.175

Child Voluntarily downgrades 8. Child to Parent: Downgrade-to-y response (vol) (c.waitp[cs(a)]=No) & (c.state[cs(a)]>y)  c2m.enq(<Resp, c->m, a, y, (if (c.state[cs(a)]=M) then c.data[a] else -)>); c.state[cs(a)]:=y; Rules 1 to 8 are complete - cover all possibilities and cannot deadlock or violate cache invariants November 19, 2014 http://www.csg.csail.mit.edu/6.175

Non-blocking Cache single processor Behavior to be described by 2 concurrent FSMs to process input requests and memory responses, respectively St req goes into StQ and waits until data can be written into the cache req hitQ resp V D W Tag Data St Q Ld Buff load reqs waiting for data An extra bit in the cache line to indicate if it is waiting for data from parent wbQ mRespQ November 19, 2014 http://www.csg.csail.mit.edu/6.175 mReqQ 14

Incoming req single processor a request may end up here because of a conflict miss Type of request st ld cache state V? In StQ? yes no yes no bypass hit Cache state V? StQ empty? yes no yes no hit Cache W? yes no Write in cache Set D Put in StQ Cache state W? Put in LdBuf Put in LdBuf If (evacuate) send wbResp unset V send memReq set W, set Tag yes no Put in StQ Put in StQ If (evacuate) send wbResp unset V send memReq set W, set Tag November 19, 2014 http://www.csg.csail.mit.edu/6.175

Mem Resp (for line cl) single processor 1. Update cache line (set V, unset D, and unset W) 2. Process all matching ldBuff entries and send responses 3. L: If cachestate(oldest StQ entry address) = V then update the cache word with StQ entry; set D remove the oldest entry; Loop back to L else if there is a ldBuff entry for cl // process conflict misses then if(evacuate) wbResp; unset V memReq for the address in ldBuff; set W, set Tag else if cachestate(oldest StQ entry address) = !W memReq for this store entry; November 19, 2014 http://www.csg.csail.mit.edu/6.175

Non-blocking Cache multi-processor Behavior to be described by 2 concurrent FSMs to process input requests and memory responses, respectively St req goes into StQ and waits until data can be written into the cache req hitQ resp M S I D WM WS Tag Data St Q Ld Buff state load reqs waiting for data An extra bits in the cache line to indicate if it is waiting for data from parent Includes invalidation messages wbQ mRespQ November 19, 2014 http://www.csg.csail.mit.edu/6.175 mReqQ 17

Incoming req multi-processor a request may end up here because of a conflict miss Type of request st ld cache state M? In StQ? yes no yes no bypass hit Cache state M or S? StQ empty? yes no yes no hit Cache W? yes no Write in cache Set D Put in StQ Cache W? Put in LdBuf Put in LdBuf If (evacuate) send wbResp set state to I send memReq set W(S), set Tag yes no Put in StQ Put in StQ If (evacuate) send wbResp set state to I send memReq set W(M), set Tag This flow chart replaces the processor Hit and Miss rules November 19, 2014 http://www.csg.csail.mit.edu/6.175

Mem Resp (for line cl) multi-processor Update cache line (set state to M or S based on W, unset D, unset W) 2. Process all matching ldBuff entries and send responses 3. L: If cachestate(oldest StQ entry address) = M then update the cache word with StQ entry; set D remove the oldest entry; Loop back to L else if there is a ldBuff entry for cl then if(evacuate) wbResp; set state to I memReq for the address in ldBuff; set W(S), set Tag else if cachestate(oldest StQ entry address) = !W memReq for this store entry; set W(M), set Tag This flow chart replaces rule 3 (for L1) November 19, 2014 http://www.csg.csail.mit.edu/6.175

next - Network and buffer issues to avoid deadlocks November 19, 2014 http://www.csg.csail.mit.edu/6.175