ECE/CS 552: Cache Design © Prof. Mikko Lipasti

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

3-1: Memory Hierarchy Prof. Mikko H. Lipasti University of Wisconsin-Madison Companion Lecture Notes for Modern Processor Design: Fundamentals of Superscalar.

CMPE 421 Parallel Computer Architecture

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui COMP 203 / NWEN 201 Computer Organisation / Computer Architectures Virtual.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

ECE/CS 552: Cache Memory Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on set created by Mark Hill Updated.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Computer Organization & Programming

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

Lecture 20 Last lecture: Today’s lecture: Types of memory

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Lecture slides originally adapted from Prof. Valeriu Beiu (Washington State University, Spring 2005, EE 334)

CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:

CS161 – Design and Architecture of Computer

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Soner Onder Michigan Technological University

Memory COMPUTER ARCHITECTURE

CS161 – Design and Architecture of Computer

CAM Content Addressable Memory

Multilevel Memories (Improving performance using alittle “cash”)

ECE/CS 552: Virtual Memory

Cache Memory Presentation I

Morgan Kaufmann Publishers

Lecture 21: Memory Hierarchy

Lecture 21: Memory Hierarchy

CMSC 611: Advanced Computer Architecture

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

CPE 631 Lecture 05: Cache Design

Performance metrics for caches

Adapted from slides by Sally McKee Cornell University

Performance metrics for caches

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

CS 704 Advanced Computer Architecture

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

/ Computer Architecture and Design

Lecture 22: Cache Hierarchies, Memory

CS-447– Computer Architecture Lecture 20 Cache Memories

CS 3410, Spring 2014 Computer Science Cornell University

CSC3050 – Computer Architecture

Lecture 21: Memory Hierarchy

Cache Memory Rabi Mahapatra

CPE 631 Lecture 04: Review of the ABC of Caches

Principle of Locality: Memory Hierarchies

Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )

Performance metrics for caches

10/18: Lecture Topics Using spatial locality

Presentation transcript:

ECE/CS 552: Cache Design © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith

Memory Hierarchy Temporal Locality Keep recently referenced items at higher levels Future references satisfied quickly CPU Spatial Locality Bring neighbors of recently referenced to higher levels Future references satisfied quickly I & D L1 Cache Shared L2 Cache Main Memory Disk

Four Burning Questions These are: Placement Where can a block of memory go? Identification How do I find a block of memory? Replacement How do I make space for new blocks? Write Policy How do I propagate changes? Consider these for caches Usually SRAM Will consider main memory, disks later

Placement Memory Type Placement Comments Registers Anywhere; Int, FP, SPR Compiler/programmer manages Cache (SRAM) Fixed in H/W Direct-mapped, set-associative, fully-associative DRAM Anywhere O/S manages Disk HUH?

Placement Address Range Map address to finite capacity Direct-mapped Block Size Address SRAM Cache Address Range Exceeds cache capacity Map address to finite capacity Called a hash Usually just masks high-order bits Direct-mapped Block can only exist in one location Hash collisions cause problems Index Hash Offset Data Out 32-bit Address Index Offset

Placement Fully-associative Identification Expensive! Tag Address SRAM Cache ?= Hit Fully-associative Block can exist anywhere No more hash collisions Identification How do I know I have the right block? Called a tag check Must store address tags Compare against address Expensive! Tag & comparator per block Hash Tag Check Offset Data Out 32-bit Address Tag Offset

Placement Set-associative Identification Block can be in a locations Address SRAM Cache Index Set-associative Block can be in a locations Hash collisions: a still OK Identification Still perform tag check However, only a in parallel Index Hash a Tags a Data Blocks ?= ?= Tag ?= ?= Offset Offset 32-bit Address Tag Index Data Out

Placement and Identification Offset 32-bit Address Tag Index Portion Length Purpose Offset o=log2(block size) Select word within block Index i=log2(number of sets) Select set of blocks Tag t=32 - o - i ID block within set Consider: <BS=block size, S=sets, B=blocks> <64,64,64>: o=6, i=6, t=20: direct-mapped (S=B) <64,16,64>: o=6, i=4, t=22: 4-way S-A (S = B / 4) <64,1,64>: o=6, i=0, t=26: fully associative (S=1) Total size = BS x B = BS x S x (B/S)

Replacement Cache has finite size Analogy: desktop full? Same idea: What do we do when it is full? Analogy: desktop full? Move books to bookshelf to make room Same idea: Move blocks to next level of cache

Replacement How do we choose victim? Several policies are possible Verbs: Victimize, evict, replace, cast out Several policies are possible FIFO (first-in-first-out) LRU (least recently used) NMRU (not most recently used) Pseudo-random (yes, really!) Pick victim within set where a = associativity If a <= 2, LRU is cheap and easy (1 bit) If a > 2, it gets harder Pseudo-random works pretty well for caches

Write Policy Memory hierarchy What to do on a write? 2 or more copies of same block Main memory and/or disk Caches What to do on a write? Eventually, all copies must be changed Write must propagate to all levels

Write Policy Easiest policy: write-through Every write propagates directly through hierarchy Write in L1, L2, memory, disk (?!?) Why is this a bad idea? Very high bandwidth requirement Remember, large memories are slow Popular in real systems only to the L2 Every write updates L1 and L2 Beyond L2, use write-back policy

Write Policy Most widely used: write-back Maintain state of each line in a cache Invalid – not present in the cache Clean – present, but not written (unmodified) Dirty – present and written (modified) Store state in tag array, next to address tag Mark dirty bit on a write On eviction, check dirty bit If set, write back dirty line to next level Called a writeback or castout

Write Policy Complications of write-back policy Stale copies lower in the hierarchy Must always check higher level for dirty copies before accessing copy in a lower level Not a big problem in uniprocessors In multiprocessors: the cache coherence problem I/O devices that use DMA (direct memory access) can cause problems even in uniprocessors Called coherent I/O Must check caches for dirty copies before reading main memory

Cache Example 32B Cache: <BS=4,S=4,B=8> Trace execution of: Tag Array 32B Cache: <BS=4,S=4,B=8> o=2, i=2, t=2; 2-way set-associative Initially empty Only tag array shown on right Trace execution of: Tag0 Tag1 LRU Reference Binary Set/Way Hit/Miss

Cache Example 32B Cache: <BS=4,S=4,B=8> Trace execution of: Tag Array 32B Cache: <BS=4,S=4,B=8> o=2, i=2, t=2; 2-way set-associative Initially empty Only tag array shown on right Trace execution of: Tag0 Tag1 LRU 10 1 Reference Binary Set/Way Hit/Miss Load 0x2A 101010 2/0 Miss

Cache Example 32B Cache: <BS=4,S=4,B=8> Trace execution of: Tag Array 32B Cache: <BS=4,S=4,B=8> o=2, i=2, t=2; 2-way set-associative Initially empty Only tag array shown on right Trace execution of: Tag0 Tag1 LRU 10 1 Reference Binary Set/Way Hit/Miss Load 0x2A 101010 2/0 Miss Load 0x2B 101011 Hit

Cache Example 32B Cache: <BS=4,S=4,B=8> Trace execution of: Tag Array 32B Cache: <BS=4,S=4,B=8> o=2, i=2, t=2; 2-way set-associative Initially empty Only tag array shown on right Trace execution of: Tag0 Tag1 LRU 10 1 11 Reference Binary Set/Way Hit/Miss Load 0x2A 101010 2/0 Miss Load 0x2B 101011 Hit Load 0x3C 111100 3/0

Cache Example 32B Cache: <BS=4,S=4,B=8> Trace execution of: Tag Array 32B Cache: <BS=4,S=4,B=8> o=2, i=2, t=2; 2-way set-associative Initially empty Only tag array shown on right Trace execution of: Tag0 Tag1 LRU 10 1 11 Reference Binary Set/Way Hit/Miss Load 0x2A 101010 2/0 Miss Load 0x2B 101011 Hit Load 0x3C 111100 3/0 Load 0x20 100000 0/0

Cache Example 32B Cache: <BS=4,S=4,B=8> Trace execution of: Tag Array 32B Cache: <BS=4,S=4,B=8> o=2, i=2, t=2; 2-way set-associative Initially empty Only tag array shown on right Trace execution of: Tag0 Tag1 LRU 10 11 1 Reference Binary Set/Way Hit/Miss Load 0x2A 101010 2/0 Miss Load 0x2B 101011 Hit Load 0x3C 111100 3/0 Load 0x20 100000 0/0 Load 0x33 110011 0/1

Cache Example 32B Cache: <BS=4,S=4,B=8> Trace execution of: Tag Array 32B Cache: <BS=4,S=4,B=8> o=2, i=2, t=2; 2-way set-associative Initially empty Only tag array shown on right Trace execution of: Tag0 Tag1 LRU 01 11 1 10 Reference Binary Set/Way Hit/Miss Load 0x2A 101010 2/0 Miss Load 0x2B 101011 Hit Load 0x3C 111100 3/0 Load 0x20 100000 0/0 Load 0x33 110011 0/1 Load 0x11 010001 0/0 (lru) Miss/Evict

Cache Example 32B Cache: <BS=4,S=4,B=8> Trace execution of: Tag Array 32B Cache: <BS=4,S=4,B=8> o=2, i=2, t=2; 2-way set-associative Initially empty Only tag array shown on right Trace execution of: Tag0 Tag1 LRU 01 11 1 10 d Reference Binary Set/Way Hit/Miss Load 0x2A 101010 2/0 Miss Load 0x2B 101011 Hit Load 0x3C 111100 3/0 Load 0x20 100000 0/0 Load 0x33 110011 0/1 Load 0x11 010001 0/0 (lru) Miss/Evict Store 0x29 101001 Hit/Dirty

Caches and Pipelining Instruction cache Interface to pipeline: No writes, so simpler Interface to pipeline: Fetch address (from PC) Supply instruction (to IR) What happens on a miss? Stall pipeline; inject nop Initiate cache fill from memory Supply requested instruction, end stall condition PC IR

I-Caches and Pipelining PC Tag Array Data Array IR “NOP” ?= Hit/Miss Fill FSM FILL FSM: Fetch from memory Critical word first Save in fill buffer Write data array Write tag array Miss condition ends Memory

D-Caches and Pipelining Pipelining loads from cache Hit/Miss signal from cache Stalls pipeline or inject NOPs? Hard to do in current real designs, since wires are too slow for global stall signals Instead, treat more like branch misprediction Cancel/flush pipeline Restart when cache fill logic is done

D-Caches and Pipelining Stores more difficult MEM stage: Perform tag check Only enable write on a hit On a miss, must not write (data corruption) Problem: Must do tag check and data array access sequentially This will hurt cycle time or force extra pipeline stage Extra pipeline stage delays loads as well: IPC hit!

Pipelining Writes (1) Simplest option: Performance implications Write-through, no-write-allocate, invalidate on write miss Write proceeds in parallel with tag check Write miss invalidates block that was corrupted Performance implications Write misses don’t fetch block: read-following-write is common pattern, now read will incur a miss Only works for direct-mapped (don’t know which block to write until after tag check completes) Not a great solution, but OK for 552 project

Solution: Pipelining Writes (2) Store Buffer Tag Check If hit place in SB If miss stall, start fill Perform D$ Write MEM St#1 WB St#1 W $ St#1 … MEM St#2 WB St#2 … Store #1 performs only a tag check in the MEM stage <value, address, cache way> placed in store buffer (SB) When store #2 reaches MEM stage Store #1 writes to data cache In the meantime, must handle RAW to store buffer Pending write in SB to address A Newer loads must check SB for conflict Stall/flush SB, or forward directly from SB Any load miss must also flush SB first Otherwise SB D$ write may be to wrong line Can expand to >1 entry to overlap store misses

Summary Cache design Pipeline integration of caches Placement Identification Replacement Write Policy Pipeline integration of caches