Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

CMSC 611: Advanced Computer Architecture
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
The University of Adelaide, School of Computer Science
Cache Coherence. CSE 4711 Cache Coherence Recall the memory wall –In multiprocessors the wall might even be higher! –Contention on shared-bus –Time to.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –
1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )
1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )
CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Snooping Cache and Shared-Memory Multiprocessors
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 13.
ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
The University of Adelaide, School of Computer Science
1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
תרגול מס' 5: MESI Protocol
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
CS5102 High Performance Computer Systems Distributed Shared Memory
Cache Coherence (controllers snoop on bus transactions)
Chip-Multiprocessor.
Multiprocessors CS258 S99.
11 – Snooping Cache and Directory Based Multiprocessors
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
Lecture 19: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Presentation transcript:

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache coherency problem updates from 1 processor not known to others Mechanisms for maintaining cache coherency coherency state associated with a block of data bus/interconnect operations on shared data change the state for the processor that initiates an operation for other processors that have the data of the operation resident in their caches

Spring 2003CSE P5482 Cache Coherency Protocols Write-invalidate (Sequent Symmetry, SGI Power/Challenge, SPARCCenter 2000) processor obtains exclusive access for writes (becomes the “owner”) by invalidating data in other processors’ caches coherency miss (invalidation miss) cache-to-cache transfers good for: multiple writes to same word or block by one processor migratory sharing from processor to processor a problem is false sharing processors read & write to different words in a shared cache block cache coherency is maintained on a cache block basis block ownership bounces between processor caches

Spring 2003CSE P5483 Cache Coherency Protocols Write-update (SPARCCenter 2000) broadcast each write to actively shared data each processor with a copy snarfs the data good for inter-processor contention Competitive switches between them We will focus on write-invalidate.

Spring 2003CSE P5484 Cache Coherency Protocol Implementations Snooping used with low-end, bus-based MPs few processors centralized memory Directory-based used with higher-end MPs more processors distributed memory

Spring 2003CSE P5485 Snooping Implementation A distributed coherency protocol coherency state associated with each cache block each snoop maintains coherency for its own cache How the bus is used broadcast medium entire coherency operation is atomic wrt other processors keep-the-bus protocol: master holds the bus until the entire operation has completed split-transaction buses: request & response are different phases state value that indicates that an operation is in progress do not initiate another operation for a cache block that has one in progress

Spring 2003CSE P5486 Snooping Implementation Snoop implementation: separate tags & state for snoop lookups processor & snoop communicate for a state or tag change snoop on the highest level cache another reason it is a physically-accessed cache property of inclusion: all blocks in L1 are in L2 therefore only have to snoop on L2 may need to update L1 state if change L2 state

Spring 2003CSE P5487 A Low-end MP

Spring 2003CSE P5488 An Example Snooping Protocol Invalidation-based Each cache block is in one of three states shared: clean in all caches & up-to-date in memory block can be read by any processor exclusive: dirty in exactly one cache only that processor can write to it invalid: block contains no valid data

Spring 2003CSE P5489 State Transitions for a Given Cache Block State transitions caused by: events incurred by the processor associated with the cache read miss, write miss, write on shared block events incurred by snooping on the bus as a result of other processor actions, e.g., read miss by Q makes P’s block change from exclusive to shared write miss by Q makes P’s block change from exclusive to invalid

Spring 2003CSE P54810 State Machine (CPU side) Invalid Shared (read/only) Exclusive (read/write) CPU read miss CPU Write CPU read hit Place read op on bus Place write op on bus CPU read miss Write back block CPU write Place write op on bus CPU read miss Place read op on bus CPU write miss Write back cache block Place write op on bus CPU read hit Place read op on bus CPU write hit

Spring 2003CSE P54811 State Machine (Bus side) Invalid Shared (read/only) Exclusive (read/write) Write miss for this block Write back the block Read miss for this block Write back the block Write miss for this block

Spring 2003CSE P54812 Directory Implementation Distributed memory each processor (or cluster of processors) has its own memory processor-memory pairs are connected via a multi-path interconnection network snooping with broadcasting is wasteful point-to-point communication instead a processor has fast access to its local memory & slower access to “remote” memory located at other processors NUMA (non-uniform memory access) machines How cache coherency is handled no caches (Tera (Cray) MTA) disallow caching of shared data (Cray 3TD) hardware directories that record cache block state

Spring 2003CSE P54813 A High-end MP Proc Interconnection network $ Proc$ $ $ $ $ Mem Dir Mem Dir Mem Dir Mem Dir Mem Dir Mem Dir

Spring 2003CSE P54814 Directory Implementation Coherency state is associated with memory blocks that are the size of cache blocks cache state shared: at least 1 processor has the data cached & memory is up- to-date block can be read by any processor exclusive: 1 processor (the owner) has the data cached & memory is stale only that processor can write to it invalid: no processor has the data cached & memory is up-to-date which processors have read/write access bit vector in which 1 means the processor has the data optimization: space for 4 processors & trap for more write bit

Spring 2003CSE P54815 Directory Implementation Directories have different meanings (& therefore uses) to different processors home node: where the memory location of an address resides (and cached data may be there too) (static) local node: where the request initiated (relative) remote node: alternate location for the data if this processor has requested it (dynamic) In satisfying a memory request: messages sent between the different nodes in point-to-point communication messages get explicit replies Some simplifying assumptions for using the protocol processor blocks until the access is complete messages processed in the order received

Spring 2003CSE P54816 Read Miss for an Uncached Block P2 Mem Interconnection network $P3$ P4$P1$ 1: read miss 2: data value reply Mem Dir Mem Dir

Spring 2003CSE P54817 Read Miss for an Exclusive, Remote Block P2 Mem Interconnection network $P3$ P4$P1$ 1: read miss 4: data value reply 2: fetch Mem Dir Mem Dir Mem Dir 3: data write-back

Spring 2003CSE P54818 Write Miss for an Exclusive, Remote Block P2 Mem Interconnection network $P3$ P4$P1$ 1: write miss 4: data value reply 3: data write-back 2: fetch & invalidate Mem Dir Mem Dir Mem Dir

Spring 2003CSE P54819 Directory Protocol Messages Message typeSourceDestinationMsg Content Read missLocal cacheHome directoryP, A –Processor P reads data at address A; make P a read sharer and arrange to send data back Write miss Local cache Home directory P, A –Processor P writes data at address A; make P the exclusive owner and arrange to send data back InvalidateHome directory Remote cachesA –Invalidate a shared copy at address A. Fetch Home directory Remote cache A –Fetch the block at address A and send it to its home directory Fetch/Invalidate Home directory Remote cache A –Fetch the block at address A and send it to its home directory; invalidate the block in the cache Data value reply Home directory Local cache Data –Return a data value from the home memory (read or write miss response) Data write-backRemote cache Home directory A, Data –Write-back a data value for address A (invalidate response)

Spring 2003CSE P54820 CPU FSM for a Cache Block States identical to the snooping protocol Transactions very similar invalidate & data fetch requests replace broadcasted write misses read & write misses sent to home directory

Spring 2003CSE P54821 CPU FSM for a Cache Block Fetch/Invalidate Send data write-back Invalidate Invalid Shared (read/only) Exclusive (read/write) CPU read CPU read hit Send read miss CPU write Send write miss CPU write Send invalidate CPU write hit CPU read miss CPU write miss Data write-back message Send write miss message CPU read hit Fetch Send data write-back Read miss Send data write-back Send read miss

Spring 2003CSE P54822 Directory FSM for a Memory Block Same states and structure as for the cache block FSM Tracks all copies of a memory block Two state changes: update coherency state alter the number of sharers in the sharing set

23 Directory FSM for a Memory Block Data write-back Sharers = {} Uncached Shared (read only) Exclusive (read/write) Read miss Send data value reply Sharers = {P}, W = 0 Write miss Send invalidate to all sharers Send data value reply Sharers = {P}, W = P Write miss Send data value reply Sharers = {P}, W = P Read miss Send data fetch to owner Send data write-back Send data value reply Sharers += {P} Read miss Send data value reply Sharers += {P}, W = 0 Write miss Send fetch/invalidate to owner Send data write-back Send data value reply Sharers = {P}, W = P

Spring 2003CSE P54824 False Sharing Processes share cache blocks, not data Impact aggravated by: block size: why? cache size: why? large miss penalties: why? Reduced by: coherency protocols (state per subblock) let cache blocks become incoherent as long as there is only false sharing make them coherent if any processor true shares compiler optimizations (group & transpose, cache block padding) cache-conscious programming wrt initial data structure layout

Spring 2003CSE P54825 Examples (1999) Alpha 4-way MP bus-based, snooping MESI coherency protocol shared/private, clean/dirty state, invalid cache-to-cache transfers competitive snooping: begin with write-update; snoop decides to switch WU for locks; WI for data UltraSPARC 4-way MP bus-based, snooping MESI: cache-to-cache transfers; no snarfing dual-ported tags P6 clusters of 4-way MPs bus-based, snooping MESI: cache-to-cache transfers; snarfing