COMP 740: Computer Architecture and Implementation

COMP 740: Computer Architecture and Implementation
Montek Singh Sep 21, 2016 Topic: Cache Coherence

Outline Cache Coherence Reading: Ch. 5.2, 5.4

Cache Coherence Common problem with multiple copies of mutable information (in both hardware and software) “If a datum is copied and the copy is to match the original at all times, then all changes to the original must cause the copy to be immediately updated or invalidated.” (Richard L. Sites, co-architect of DEC Alpha) Copy becomes stale Time  A A A C - A B B Copies diverge; hard to recover from Copy 1  Copy 2  A A A B - A B B Write update A A A - - A B B Write invalidate

Memory-I/O model Most modern processors use DMA
DMA controller = “sidekick” who directly reads/writes memory to perform I/O e.g.: CPU tells DMA controller to read N bytes/packets from USB interface and place them at address A in main memory, and tell the CPU when it is all done via an interrupt Potential problem: DMA directly reads/write main memory, whereas CPU reads/writes cache potent for cache copy to get “out of sync” with main memory Memory Control Datapath Processor Input Output Cache

Example of Cache Coherence: I/O
I/O in uniprocessor with primary unified cache MM copy and cache copy of memory block not always coherent WT cache MM copy stale while write update to MM in transit WB cache MM copy stale while cache copy Dirty Inconsistency of no concern if no one reads/writes MM copy If I/O directed to main memory, need to maintain coherence

The University of Adelaide, School of Computer Science
Types The University of Adelaide, School of Computer Science 25 April 2018 Symmetric multiprocessors (SMP) Small number of cores Share single memory with uniform memory latency Distributed shared memory (DSM) Memory distributed among processors Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi-hop) interconnection networks Chapter 2 — Instructions: Language of the Computer

Example: Multiprocessor Caches
The University of Adelaide, School of Computer Science 25 April 2018 Processors may see different values through their caches: Chapter 2 — Instructions: Language of the Computer

Coherence Protocols: 2 strategies
Key Challenge: one processor can modify a memory location while other processors are left with stale data in their private caches What do we do? Either: communicate new value to all other processors  write update Or: tell other processors to throw away their stale data  write invalidate

Write Invalidate Example
The University of Adelaide, School of Computer Science 25 April 2018 Write invalidate On write, invalidate all other copies Use bus itself to serialize write cannot complete until bus access is obtained Chapter 2 — Instructions: Language of the Computer

Coherence vs. Consistency
Closely related by different people often use them interchangeably (incorrectly) Coherence defines what values can be returned by a read all reads by any processor must return the most recently written value writes to the same location by any two processors are seen in the same order by all processors Consistency determines when a written value will be returned by a read if a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A

Coherence A memory system is coherent if:
A read by processor P to location X after a write by P to X, with no intervening write of X by another processor always returns the value written by P  P reads back exactly what it just wrote A read by a processor to location X after a write by another processor to X returns the written value if the read and write are sufficiently separated in time and there are no other intervening writes to X  P reads back the last write of X soon/eventually Two writes to the same location by any two processors are seen in the same order by all processors. For example, is the values 1 and then 2 are written to X, no processor will see 2 before 1 (though, not all writes have to be seen)  Writes to the same location are serialized

Assumptions A write does not complete (and allow the next write to occur) until all processors have seen the effect of that write A processor does not change the order of any write w.r.t. any other memory access Implications if P writes location A and then location B any processor that sees the new value of B must also see the new value of A a processor can reorder reads, but writes must finish in program order i.e., all reads can be reordered, but only within the boundaries of the write operations immediately before and after

Types of Coherence Protocols
Directory based a centralized data structure (“directory”) holds the sharing status of every block in the cache system distributed directories are also possible but much more complex directory becomes a single serialization point Snooping every cache that has a copy of the data block tracks the sharing status of the block all caches typically connected to a shared bus connected to each other: for sharing messages, and for copying blocks connected to memory: for copying blocks each cache “listens” to what other caches are doing to infer updates to the status of the block

“Snoopy” Protocols We will discuss a simple protocol Snooping:
three-state protocol (MSI) Section 5.2 several extensions possible MESI, MESIF, MOESI, etc. IEEE standards Used by many machines, including Intel i7, AMD Opteron Snooping: monitor memory bus activity by individual caches taking some actions based on this activity introduces a fourth category of miss to the 3C model: coherence misses First, we need some notation to discuss the protocols

Three-State Write-Invalidate Protocol
MSI Protocol: modification of WB cache 3 states: Modified, Shared, Invalid Assumptions Single bus and main memory (MM) Two or more CPUs, each with WB cache Every cache block in one of three states: Invalid, Clean, Dirty (also called Invalid, Shared, Modified) MM copies of blocks have no state At any moment, a single cache owns bus (is bus master) Bus master issues bus commands; all others obey All misses (reads or writes) serviced by MM if all cache copies are Clean (Shared) the only Dirty (Modified) cache copy (which is no longer Dirty), and MM copy is written instead of being read

Understanding the MSI Protocol
MM C1 C2 Only two global states: EITHER: Most up-to-date copy is MM copy, and all cache copies are Clean (Shared) OR: Most up-to-date copy is a single unique cache copy in state Dirty (Modified) A B -- A Bus owner Clean Another Clean copy exists Can read without notifying other caches Bus owner Dirty No other cache copies Can read or write without notifying other caches A -- Bus owner Clean No other cache copies Can read without notifying other caches

MSI Coherence Protocol

Tabular form (Part 1) NOTE: Clean=Shared, Dirty=Modified

Tabular form (Part 2) NOTE: Clean=Shared, Dirty=Modified

MSI State graph: two parts

MSI State graph: combined

Comparison with Single WB Cache
Similarities Read hit invisible on bus All misses visible on bus Differences In single WB cache, all misses are serviced by MM; in three-state protocol, misses are serviced either by MM or by unique cache block holding only Dirty copy In single WB cache, write hit is invisible on bus; in three-state protocol, write hit of Clean block: invalidates all other Clean blocks by a Bus Write Miss (necessary action)

Extensions to Basic Coherence Protocol
MESI: adds Exclusive state indicates cache block is clean and in only a single cache benefit: can be written without issuing any invalidates helps with repeated writes MESIF: adds Forward state indicates which sharing cache should respond to a request for reading a missed block Intel i7 uses it MOESI: adds Owned state indicates cache block is “owned” by that cache and out-of-date in memory benefit: avoids writing to memory AMD Opteron uses it

MESI vs. MSI Similarities Differences A -- A A B --
Read hit invisible on bus All misses handled the same way Differences Big improvement in handling write hits Write hit in Exclusive state invisible on bus Write hit only in Shared state is visible on bus A -- A A B -- Exclusive state Can be read or written Shared state Can be read only Modified state Can be read and written

Impact on Performance Performance impact due to invalidation:
Processor can lose cache block through invalidation by another processor Average memory access time goes up, since writes to shared blocks take more time (other copies have to be invalidated)

The University of Adelaide, School of Computer Science
Performance The University of Adelaide, School of Computer Science 25 April 2018 Coherence influences cache miss rate Coherence misses True sharing misses Write to shared block (transmission of invalidation) Read an invalidated block False sharing misses Read an unmodified word in an invalidated block Chapter 2 — Instructions: Language of the Computer

Performance Study: Commercial Workload
The University of Adelaide, School of Computer Science 25 April 2018 Chapter 2 — Instructions: Language of the Computer

Directory-Based Protocols
Self Study: Ch. 5.4

COMP 740: Computer Architecture and Implementation

Similar presentations

Presentation on theme: "COMP 740: Computer Architecture and Implementation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP 740: Computer Architecture and Implementation

Similar presentations

Presentation on theme: "COMP 740: Computer Architecture and Implementation"— Presentation transcript:

Similar presentations

About project

Feedback