Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November.

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
Cache Optimization Summary
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )
CS 152 Computer Architecture and Engineering Lecture 21: Directory-Based Cache Protocols Scott Beamer (substituting for Krste Asanovic) Electrical Engineering.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Snooping Cache and Shared-Memory Multiprocessors
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Constructive Computer Architecture: Bluespec execution model and concurrency semantics Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
Constructive Computer Architecture Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology 6.175: L01 – September 9,
EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September.
Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology Includes.
1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )
Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Lecture 8: Snooping and Directory Protocols
Cache Coherence Constructive Computer Architecture Arvind
Lecture 18: Coherence and Synchronization
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
Cache Coherence Constructive Computer Architecture Arvind
CMSC 611: Advanced Computer Architecture
Krste Asanovic Electrical Engineering and Computer Sciences
Example Cache Coherence Problem
Constructive Computer Architecture
CS5102 High Performance Computer Systems Distributed Shared Memory
Cache Coherence Constructive Computer Architecture Arvind
Cache Coherence Constructive Computer Architecture Arvind
CC protocol for blocking caches
Computer Science & Artificial Intelligence Lab.
Lecture 25: Multiprocessors
Lecture 25: Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Lecture 19: Coherence and Synchronization
Cache Coherence Constructive Computer Architecture Arvind
Lecture 18: Coherence and Synchronization
CSE 486/586 Distributed Systems Cache Coherence
Presentation transcript:

Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 18, L21-1

Contributors to the course material Arvind, Rishiyur S. Nikhil, Joel Emer, Muralidaran Vijayaraghavan Staff and students in (Spring 2013), 6.S195 (Fall 2012), 6.S078 (Spring 2012) Asif Khan, Richard Ruhler, Sang Woo Jun, Abhinav Agarwal, Myron King, Kermin Fleming, Ming Liu, Li- Shiuan Peh External Prof Amey Karkare & students at IIT Kanpur Prof Jihong Kim & students at Seoul Nation University Prof Derek Chiou, University of Texas at Austin Prof Yoav Etsion & students at Technion November 18,

Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value cache-1 A100 CPU-Memory bus CPU-1 CPU-2 cache-2 A100 memory A Do these stale values matter? What is the view of shared memory for programming? November 18,

Maintaining Store Atomicity Store atomicity requires all processors to see writes occur in the same order multiple copies of a location in various caches can cause this to be violated To meet the ordering requirement it is sufficient for hardware to ensure: Only one processor at a time has write permission for a location No processor can load a stale copy of the data after a write to the location November 18, 2013 L21-4http://  cache coherence protocols

A System with Multiple Caches M L1 P P P P L2 L1 P P Interconnect Modern systems often have hierarchical caches Each cache has exactly one parent but can have zero or more children Logically only a parent and its children can communicate directly Inclusion property is maintained between a parent and its children, i.e., a  L i  a  L i+1 Because usually L i+1 >> L i November 18,

Cache Coherence Protocols Write request: the address is invalidated in all other caches before the write is performed, or the address is updated in all other caches after the write is performed Read request: if a dirty copy is found in some cache, that is the value that must be used, e.g., by doing a write-back and reading the memory or forwarding that dirty value directly to the reader. November 18, 2013 L21-6http:// We will focus on Invalidation protocols as opposed to Update protocols

State needed to maintain Cache Coherence Use MSI encoding in caches where I means this cache does not contain the location S means this cache has the location but so may other caches; hence it can only be read M means only this cache has the location; hence it can be read and written The states M, S, I can be thought of as an order M > S > I A transition from a lower state to a higher state is called an Upgrade A transition from a higher state to a lower state is called a Downgrade November 18,

Sibling invariant and compatibility Sibling invariant: Cache is in state M  its siblings are in state I That is, the sibling states are “compatible” The states x, y of two siblings are compatible iff IsCompatible(x, y) is True where IsCompatible(M, M) = False IsCompatible(M, S) = False IsCompatible(S, M) = False All other cases = True November 18,

Cache State Transitions S M I store load write-back invalidate flush store optimizations This state diagram is helpful as long as one remembers that each transition involves cooperation of other caches and the main memory November 18,

Cache Actions On a read miss (i.e., Cache state is I): In case some other cache has the location in state M then write back the dirty data to Memory Read the value from Memory and set the state to S On a write miss (i.e., Cache state is I or S): Invalidate the location in all other caches and in case some cache has the location in state M then write back the dirty data Read the value from Memory if necessary and set the state to M Misses cause Cache upgrade actions which in turn may cause further downgrades or upgrades on other caches November 18,

MSI protocol: some issues It is possible to have multiple requests for the same location from different processors. Hence there is a need to arbitrate requests In bus-based systems bus controller performs this function In directory-based systems upgrade requests are passed to the parent who acts as an arbitrator On a cache miss there is a need to find out the state of other caches In a bus-based system a system-wide broadcast of the request determines the state of other caches by “snooping” In directory-based systems a directory keeps track of the state of each child cache November 18,

Directory State Encoding Two-level (L1, M) system For each location in a cache, the directory keeps two types of info c.state[a] (sibling info): do c’s siblings have a copy of location a; M (means no), S (means maybe) c.child[c k ][a] (children info): the state of c’s child c k for location a; At most one child can be in state M Since L1 has no children, only sibling information is kept and since main (home) memory has no siblings only children cache information is kept November 18, 2013 L21-12http:// All addresses in the home memory are in state M a a P L1 P Interconnect S P P

Directory state encoding cont New states needed to deal with waiting for responses: c.waitp[a] : Denotes if cache c is waiting for a response from its parent  Nothing means not waiting  Valid (M|S|I) means waiting for a response to transition to M or S or I state, respectively c.waitc[c k ][a] : Denotes if cache c is waiting for a response from its child c k  Nothing | Valid (M|S|I) Cache state in L1: Directory state in home memory: November 18, 2013 L21-13http:// Children’s state

A Directory-based Protocol an abstract view interconnect PP P c2m m2c L1 p2mm2p m PP inout PP P c2m m2c L1 p2mm2p Each cache has 2 pairs of queues (c2m, m2c) to communicate with the memory (p2m, m2p) to communicate with the processor Message format: FIFO message passing between each (srcdst) pair except a Resp cannot block a Req November 18, Req/Respaddressstate

Processor Hit Rules Load-hit rule p2m.msg=(Load a) & (c.state[a]>I)  p2m.deq; m2p.enq(c.data[a]); Store-hit rule p2m.msg=(Store a v) & c.state[a]=M  p2m.deq; m2p.enq(Ack); c.data[a]:=v; PP P c2m m2c L1 p2mm2p The miss rules are taken care of by the general cache rules to be presented November 18,

Processing a Load or a Store miss Child to Parent: Upgrade-to-y request Parent to Child: process Upgrade-to-y request Parent to other child caches: Downgrade-to-x request Child to Parent: Downgrade-to-x response Parent waits for all Downgrade-to-x responses Parent to Child: Upgrade-to-y response Child receives upgrade-to-y response November 18,

Processing a Load miss L1 to Parent: Upgrade-to-S request (c.state[a]=I) & (c.waitp[a]=Nothing)  c.waitp[a]:=Valid S; c2m.enq( ); Parent to L1: Upgrade-to-S response (j, m.waitc[j][a]=Nothing) & c2m.msg= & (i≠c, IsCompatible(m.child[i][a],S))  m2c.enq( ); m.child[c][a]:=S; c2m.deq L1 receiving upgrade-to-S response m2c.msg=  m2c.deq; c.data[a]:=data; c.state[a]:=S; c.waitp[a]:=Nothing; November 18, 2013 L21-17http://

Processing Load miss cont. What if (i≠c, IsCompatible(m.child[i][a],y)) is false? Downgrade other child caches November 18, Parent to L1: Upgrade-to-S response (j, m.waitc[j][a]=Nothing) & c2m.msg= & (i≠c, IsCompatible(m.child[i][a],S))  m2c.enq( ); m.child[c][a]:=S; c2m.deq Parent to Child: Downgrade to S request c2m.msg= & (m.child[i][a]>S) & (m.waitc[i][a]=Nothing)  m.waitc[i][a]:=Valid S; m2c.enq( );

Complete set of cache actions req = {1,4,7} resp = {2,3,5,6,8} A protocol specifies cache actions corresponding to each of these 8 different messages Cache 1,5,8 3,7 November 18, Memory 4,2 6

Child Requests 1. Child to Parent: Upgrade-to-y Request (c.state[a]<y) & (c.waitp[a]=Nothing)  c.waitp[a]:=Valid y; c2m.enq( ); November 18,

Parent Responds 2. Parent to Child: Upgrade-to-y response (j, m.waitc[j][a]=Nothing) & c2m.msg= & (m.child[c][a]<y) & (i≠c, IsCompatible(m.child[i][a],y))  m2c.enq(<Resp, mc, a, y, (if (m.child[c][a]=I) then m.data[a] else -)>); m.child[c][a]:=y; c2m.deq; November 18,

Child receives Response 3. Child receiving upgrade-to-y response m2c.msg=  m2c.deq; if(c.state[a]=I) c.data[a]:=data; c.state[a]:=y; if(c.waitp[a]=(Valid x) & x≤y) c.waitp[a]:=Nothing; November 18,

Parent Requests 4. Parent to Child: Downgrade-to-y Request (m.child[i][a]>y) & (m.waitc[i][a]=Nothing)  m.waitc[i][a]:=Valid y; m2c.enq( ); November 18,

Child Responds 5. Child to Parent: Downgrade-to-y response (m2c.msg= ) & (c.state[a]>y)  c2m.enq( m, a, y, (if (c.state[a]=M) then c.data[a] else - )>); c.state[a]:=y; m2c.deq November 18,

Parent receives Response 6. Parent receiving downgrade-to-y response c2m.msg=  c2m.deq; if(m.child[c][a]=M) m.data[a]:=data; c.state[a]:=y; if(m.waitc[c][a]=(Valid x) & x≥y) m.waitc[c][a]:=Nothing; November 18,

Child receives served Request 7. Child receiving downgrade-to-y request (m2c.msg= ) & (c.state[a]≤y)  m2c.deq; November 18,

Child Voluntarily downgrades 8. Child to Parent: Downgrade-to-y response (vol) (c.waitp[a]=Nothing) & (c.state[a]>y)  c2m.enq( m, a, y, (if (c.state[a]=M) then c.data[a] else - )>); c.state[a]:=y; November 18,

Some properties Rules 1 to 8 are complete - cover all possibilities and cannot deadlock or violate cache invariants Our protocol maintains two important invariants: Directory state is always a conservative estimate of a child’s state Every request eventually gets a corresponding response (assuming responses cannot be blocked by requests and a request cannot overtake a response for the same address) Starvation, that is a Load or store request is ignored indefinitely has to be prevented; Fair arbitration at the memory between requests from various caches will ensure starvation freedom. November 18,

FIFO property of queues If FIFO property is not enforced, then the protocol can either deadlock or update with wrong data A deadlock scenario: 1. Child 1 requests upgrade (from I) to M (msg1) 2. Parent responds to Child 1 with upgrade from I to M (msg2) 3. Child 2 requests upgrade (from I) to M (msg2) 4. Parent requests Child 1 for downgrade (from M) to I (msg3) 5. msg3 overtakes msg2 6. Child 1 sees request to downgrade to I and drops it 7. Parent never gets a response from Child 1 for downgrade to I November 18,

H and L Priority Messages At the memory, unprocessed request messages cannot block reply messages. Hence all messages are classified as H or L priority. all messages carrying replies are classified as high priority Accomplished by having separate paths for H and L priority In Theory: separate networks In Practice:  Separate Queues  Shared physical wires for both networks H L November 18,