Processor support devices Part 2: Caches and the MESI protocol

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Memory Management and Protection Part 2: The hardware view dr.ir. A.C. Verschueren.
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.
Chapter 6 Computer Architecture
Avishai Wool lecture Introduction to Systems Programming Lecture 8 Input-Output.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.
Cache Memory.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Cache Organization of Pentium
The Memory System (Chapter 5)
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Improving Memory Access 1/3 The Cache and Virtual Memory
Replacement Policy Replacement policy:
CSC 4250 Computer Architectures
A Study on Snoop-Based Cache Coherence Protocols
Architecture Background
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
CS-301 Introduction to Computing Lecture 17
CACHE MEMORY.
Cache memory Direct Cache Memory Associate Cache Memory
PIII Data Stream Power Saving Modes Buses Memory Order Buffer
Computer Architecture
Example Cache Coherence Problem
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Andy Wang Operating Systems COP 4610 / CGS 5765
CMPT 886: Computer Architecture Primer
Module IV Memory Organization.
Lecture: Cache Innovations, Virtual Memory
Chapter 6 Memory System Design
Chap. 12 Memory Organization
Cache Memory.
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Miss Rate versus Block Size
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 15: Memory Design
/ Computer Architecture and Design
Contents Memory types & memory hierarchy Virtual memory (VM)
CS 3410, Spring 2014 Computer Science Cornell University
Lecture 24: Virtual Memory, Multiprocessors
Chapter Five Large and Fast: Exploiting Memory Hierarchy
CSE 471 Autumn 1998 Virtual memory
Cache - Optimization.
Update : about 8~16% are writes
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Cache Memory Rabi Mahapatra
Synonyms v.p. x, process A v.p # index Map to same physical page
Andy Wang Operating Systems COP 4610 / CGS 5765
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Processor support devices Part 2: Caches and the MESI protocol dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital Information Systems

The memory speed ‘gap’ High-performance processors are much too fast for the main memory they are connected to Processors running at 1000 MegaHerz would like a memory read/write cycle time of 1 nanosecond Large memories with (relatively) cheap RAM’s have cycle times on the order of 100 nanoseconds 100 times slower, this speed gap continues to grow...

Wide words and memory banking The gap can be closed IF the processor tolerates a long delay between the start and end of a cycle 4 words in parallel 4 accesses in parallel 0..3 read use use read 1 1 2 2 3 Complex timing 3 4..7 4 4 Lots of pins 5 5 6 6 7 7 1) Wide memory words 2) Multiple memory 'banks'

The big IF in closing the gap Long memory access delays can be tolerated IF addresses are known in advance True for sequential instruction reads NOT true for most of the other read operations Memory reading MUST become quicker! Not interested in (timing of) write operations Data & address to memory, then forget about it...

Small-scale virtual memory: the cache ‘Cache’ is French: ‘secret hiding place’ Small-scale virtual memory: the cache A 'cache' is a small but very fast memory which contains the 'most active' memory words IF a requested memory word is in the cache THEN supply the word from the cache {very fast} ELSE supply the word from main memory {rather slow} and place it in the cache for later references (throwing out not used words when needed) An ideal cache knows which words will be used soon A good cache reaches 95% THEN and only 5% ELSE

Keeping the cache hidden The cache must keep a copy of memory words Memory mapped I/O ports are problematic These can spontaneously change their value ! Have to be made'non-cacheable’ at all times Shared memory is problematic too Make it non-cacheable (from all sides), or better Inform all attached caches of changes (write actions)

Cache writing policies 'write-through’: written data copied into memory Option: write to cache only if word is already present The amount of data in the cache can be reduced Read after non-cached write requires true memory read 'posted write’: writes buffered until the bus is free Gives priority to reads, allows high speed write bursts More hardware, delay between CPU and memory write 'late write’: write only to make free space in cache Reduces the amount of memory write cycles drastically Complex cache control, especially with shared memory! Pentium

An example of a cache CPU (80386) bus switch main memory data address control CPU bus system bus cache memory cache controller (82385) administration To reduce the amount of administration memory, a single cache 'line' administrates 8 word blocks

Intel 82385 'direct mapped’ cache mode 'tag' 17 line 10 word 3 byte 2 32 bits address: Line select 'hit' word select 32 bit data 'word valid' 'line valid' 17 bit tags 1024 lines word #0 word #7 Also known as '1-way set associative’ prone to ‘tag clashing’ !

Intel 82385 ’2-way set associative’ mode 32 bits address: word 3 17 bit tags 1024 lines 'line valid' Line select word select byte 2 line 10 'tag' 17 32 bit data 'word valid' 'hit' word #0 word #7 9 18 18 bit tags 512 lines 'hit' hit logic LRU bits ’Least Recently Used' bits indicate which set in each line has been used last (the other is replacement target)

The MESI protocol Late write and shared memory combine badly The 'MESI' protocol solves this with four states for each of the cache words (or lines) Modified: cached data differs from the main memory and is only located in this cache Exclusive: cached data is the same as main memory and is only located in this cache Shared: cached data is the same as main memory and also located in one or more other caches Invalid: cache word/line not loaded with memory data

State changes in the MESI protocol Induced by processor read/write actions and actions of other cache controllers Caches keep track of other read/write actions Uses ’bus snooping’: monitoring the address and control buses when they are driven by someone else During a memory access, other cache controllers indicate if one of them contains the accessed location Needed to decide between the Shared/Exclusive states!

Intel 82496 CPU accesses Pentium A read hit reads the cache, does not change state A read miss reads memory, other controllers check if they also contain the address read A write hit handling depends on the state If Shared, write is done in main memory too If Exclusive or Modified, write is only done in cache A write miss writes to memory, but not the cache Other caches may change their state! Normal MESI: write cache too

Intel 82496 state diagram read hit write miss read miss & somewhere else Invalid Modified Shared Exclusive snoop read any snoop snoop write snoop read (*) read miss, only here snoop write snoop write snoop read write hit (write to memory) read hit write hit (setup for late write) read/write hit (*): This controller copies local data to memory immediately

Final remarks on caches (1) High performance processors rely on caches Main memory must be accessed in a single clock cycle At 1 GHz, the cache must be on the CPU chip But a large & fast cache takes a lot of chip space! Second level cache CPU chip off-chip cache large(r) & slow(er) main memory huge & very slow CPU on-chip cache small & fast First level cache

Final remarks on caches (2) The off-chip cache becomes as slow as main memory was some time ago... Second level cache placed on the CPU chip too Examples: power-PC, Crusoe (both > 256 KiloByte!) The external cache becomes a third-level cache Data transfer between on-chip caches can be done a complete cache line in parallel: a huge speedup