COMP 740: Computer Architecture and Implementation

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

The University of Adelaide, School of Computer Science

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Cache Optimization Summary

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.

The University of Adelaide, School of Computer Science

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 12, 2003 Topics: 1. Cache Performance (concl.) 2. Cache Coherence.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 6, 2002 Topic: 1. Virtual Memory; 2. Cache Coherence.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.

Lecture 13: Multiprocessors Kai Bu

Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.

The University of Adelaide, School of Computer Science

1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

Lecture 13: Multiprocessors Kai Bu

1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.

CS 704 Advanced Computer Architecture

Outline Introduction (Sec. 5.1)

COSC6385 Advanced Computer Architecture

תרגול מס' 5: MESI Protocol

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

12.4 Memory Organization in Multiprocessor Systems

Multiprocessor Cache Coherency

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Krste Asanovic Electrical Engineering and Computer Sciences

Example Cache Coherence Problem

Flynn’s Taxonomy Flynn classified by data and control streams in 1966

The University of Adelaide, School of Computer Science

CS5102 High Performance Computer Systems Distributed Shared Memory

Parallel and Multiprocessor Architectures – Shared Memory

Shared Memory Multiprocessors

Lecture 2: Snooping-Based Coherence

Merry Christmas Good afternoon, class,

Chip-Multiprocessor.

Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.

Chapter 5 Multiprocessor and Thread-Level Parallelism

CMSC 611: Advanced Computer Architecture

Multiprocessors - Flynn’s taxonomy (1966)

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Lecture 25: Multiprocessors

High Performance Computing

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 20: Multiprocessors

The University of Adelaide, School of Computer Science

CSE 486/586 Distributed Systems Cache Coherence

Lecture 17 Multiprocessors and Thread-Level Parallelism

Multiprocessors and Multi-computers

Presentation transcript:

COMP 740: Computer Architecture and Implementation Montek Singh Sep 21, 2016 Topic: Cache Coherence

Outline Cache Coherence Reading: Ch. 5.2, 5.4

Cache Coherence Common problem with multiple copies of mutable information (in both hardware and software) “If a datum is copied and the copy is to match the original at all times, then all changes to the original must cause the copy to be immediately updated or invalidated.” (Richard L. Sites, co-architect of DEC Alpha) Copy becomes stale Time  1 2 3 4 A A A C - A B B Copies diverge; hard to recover from Copy 1  Copy 2  1 2 3 4 A A A B - A B B Write update 1 2 3 4 A A A - - A B B Write invalidate

Memory-I/O model Most modern processors use DMA DMA controller = “sidekick” who directly reads/writes memory to perform I/O e.g.: CPU tells DMA controller to read N bytes/packets from USB interface and place them at address A in main memory, and tell the CPU when it is all done via an interrupt Potential problem: DMA directly reads/write main memory, whereas CPU reads/writes cache potent for cache copy to get “out of sync” with main memory Memory Control Datapath Processor Input Output Cache

Example of Cache Coherence: I/O I/O in uniprocessor with primary unified cache MM copy and cache copy of memory block not always coherent WT cache MM copy stale while write update to MM in transit WB cache MM copy stale while cache copy Dirty Inconsistency of no concern if no one reads/writes MM copy If I/O directed to main memory, need to maintain coherence

The University of Adelaide, School of Computer Science Types The University of Adelaide, School of Computer Science 25 April 2018 Symmetric multiprocessors (SMP) Small number of cores Share single memory with uniform memory latency Distributed shared memory (DSM) Memory distributed among processors Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi-hop) interconnection networks Chapter 2 — Instructions: Language of the Computer

Example: Multiprocessor Caches The University of Adelaide, School of Computer Science 25 April 2018 Processors may see different values through their caches: Chapter 2 — Instructions: Language of the Computer

Coherence Protocols: 2 strategies Key Challenge: one processor can modify a memory location while other processors are left with stale data in their private caches What do we do? Either: communicate new value to all other processors  write update Or: tell other processors to throw away their stale data  write invalidate

Write Invalidate Example The University of Adelaide, School of Computer Science 25 April 2018 Write invalidate On write, invalidate all other copies Use bus itself to serialize write cannot complete until bus access is obtained Chapter 2 — Instructions: Language of the Computer

Coherence vs. Consistency Closely related by different people often use them interchangeably (incorrectly) Coherence defines what values can be returned by a read all reads by any processor must return the most recently written value writes to the same location by any two processors are seen in the same order by all processors Consistency determines when a written value will be returned by a read if a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A

Coherence A memory system is coherent if: A read by processor P to location X after a write by P to X, with no intervening write of X by another processor always returns the value written by P  P reads back exactly what it just wrote A read by a processor to location X after a write by another processor to X returns the written value if the read and write are sufficiently separated in time and there are no other intervening writes to X  P reads back the last write of X soon/eventually Two writes to the same location by any two processors are seen in the same order by all processors. For example, is the values 1 and then 2 are written to X, no processor will see 2 before 1 (though, not all writes have to be seen)  Writes to the same location are serialized

Assumptions A write does not complete (and allow the next write to occur) until all processors have seen the effect of that write A processor does not change the order of any write w.r.t. any other memory access Implications if P writes location A and then location B any processor that sees the new value of B must also see the new value of A a processor can reorder reads, but writes must finish in program order i.e., all reads can be reordered, but only within the boundaries of the write operations immediately before and after

Types of Coherence Protocols Directory based a centralized data structure (“directory”) holds the sharing status of every block in the cache system distributed directories are also possible but much more complex directory becomes a single serialization point Snooping every cache that has a copy of the data block tracks the sharing status of the block all caches typically connected to a shared bus connected to each other: for sharing messages, and for copying blocks connected to memory: for copying blocks each cache “listens” to what other caches are doing to infer updates to the status of the block

“Snoopy” Protocols We will discuss a simple protocol Snooping: three-state protocol (MSI) Section 5.2 several extensions possible MESI, MESIF, MOESI, etc. IEEE standards Used by many machines, including Intel i7, AMD Opteron Snooping: monitor memory bus activity by individual caches taking some actions based on this activity introduces a fourth category of miss to the 3C model: coherence misses First, we need some notation to discuss the protocols

Three-State Write-Invalidate Protocol MSI Protocol: modification of WB cache 3 states: Modified, Shared, Invalid Assumptions Single bus and main memory (MM) Two or more CPUs, each with WB cache Every cache block in one of three states: Invalid, Clean, Dirty (also called Invalid, Shared, Modified) MM copies of blocks have no state At any moment, a single cache owns bus (is bus master) Bus master issues bus commands; all others obey All misses (reads or writes) serviced by MM if all cache copies are Clean (Shared) the only Dirty (Modified) cache copy (which is no longer Dirty), and MM copy is written instead of being read

Understanding the MSI Protocol MM C1 C2 Only two global states: EITHER: Most up-to-date copy is MM copy, and all cache copies are Clean (Shared) OR: Most up-to-date copy is a single unique cache copy in state Dirty (Modified) A B -- A Bus owner Clean Another Clean copy exists Can read without notifying other caches Bus owner Dirty No other cache copies Can read or write without notifying other caches A -- Bus owner Clean No other cache copies Can read without notifying other caches

MSI Coherence Protocol

Tabular form (Part 1) NOTE: Clean=Shared, Dirty=Modified

Tabular form (Part 2) NOTE: Clean=Shared, Dirty=Modified

MSI State graph: two parts

MSI State graph: combined

Comparison with Single WB Cache Similarities Read hit invisible on bus All misses visible on bus Differences In single WB cache, all misses are serviced by MM; in three-state protocol, misses are serviced either by MM or by unique cache block holding only Dirty copy In single WB cache, write hit is invisible on bus; in three-state protocol, write hit of Clean block: invalidates all other Clean blocks by a Bus Write Miss (necessary action)

Extensions to Basic Coherence Protocol MESI: adds Exclusive state indicates cache block is clean and in only a single cache benefit: can be written without issuing any invalidates helps with repeated writes MESIF: adds Forward state indicates which sharing cache should respond to a request for reading a missed block Intel i7 uses it MOESI: adds Owned state indicates cache block is “owned” by that cache and out-of-date in memory benefit: avoids writing to memory AMD Opteron uses it

MESI vs. MSI Similarities Differences A -- A A B -- Read hit invisible on bus All misses handled the same way Differences Big improvement in handling write hits Write hit in Exclusive state invisible on bus Write hit only in Shared state is visible on bus A -- A A B -- Exclusive state Can be read or written Shared state Can be read only Modified state Can be read and written

Impact on Performance Performance impact due to invalidation: Processor can lose cache block through invalidation by another processor Average memory access time goes up, since writes to shared blocks take more time (other copies have to be invalidated)

The University of Adelaide, School of Computer Science Performance The University of Adelaide, School of Computer Science 25 April 2018 Coherence influences cache miss rate Coherence misses True sharing misses Write to shared block (transmission of invalidation) Read an invalidated block False sharing misses Read an unmodified word in an invalidated block Chapter 2 — Instructions: Language of the Computer

Performance Study: Commercial Workload The University of Adelaide, School of Computer Science 25 April 2018 Chapter 2 — Instructions: Language of the Computer

Performance Study: Commercial Workload The University of Adelaide, School of Computer Science 25 April 2018 Chapter 2 — Instructions: Language of the Computer

Performance Study: Commercial Workload The University of Adelaide, School of Computer Science 25 April 2018 Chapter 2 — Instructions: Language of the Computer

Performance Study: Commercial Workload The University of Adelaide, School of Computer Science 25 April 2018 Chapter 2 — Instructions: Language of the Computer

Directory-Based Protocols Self Study: Ch. 5.4