ECE/CS 552: Cache Memory Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on set created by Mark Hill Updated.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
3-1: Memory Hierarchy Prof. Mikko H. Lipasti University of Wisconsin-Madison Companion Lecture Notes for Modern Processor Design: Fundamentals of Superscalar.
2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Systems I Locality and Caching
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1
Lecture 19 Today’s topics Types of memory Memory hierarchy.
Chapter Twelve Memory Organization
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Memory Hierarchy CS 282 – KAUST – Spring 2010 Slides by: Mikko Lipasti Muhamed Mudawar.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Computer Organization & Programming
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.
ECE/CS 552: Cache Concepts © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Multilevel Memories (Improving performance using alittle “cash”)
ECE/CS 552: Virtual Memory
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Lecture 21: Memory Hierarchy
ECE/CS 552: Cache Design © Prof. Mikko Lipasti
Lecture 23: Cache, Memory, Virtual Memory
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 22: Cache Hierarchies, Memory
Lecture 21: Memory Hierarchy
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache Memory Rabi Mahapatra
10/18: Lecture Topics Using spatial locality
Presentation transcript:

ECE/CS 552: Cache Memory Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on set created by Mark Hill Updated by Mikko Lipasti

Big Picture Memory –Just an “ocean of bits” –Many technologies are available Key issues –Technology (how bits are stored) –Placement (where bits are stored) –Identification (finding the right bits) –Replacement (finding space for new bits) –Write policy (propagating changes to bits) Must answer these regardless of memory type

Types of Memory TypeSizeSpeedCost/bit Register< 1KB< 1ns$$$$ On-chip SRAM8KB-6MB< 10ns$$$ Off-chip SRAM1Mb – 16Mb< 20ns$$ DRAM64MB – 1TB< 100ns$ Disk40GB – 1PB< 20ms~0

Technology - Registers

Technology – SRAM “Word” Lines -select a row “Bit” Lines -carry data in/out

Technology – Off-chip SRAM Commodity –1Mb – 16Mb 554 Lab SRAM –AS7C4096 –512K x 8b 512KB Or 4Mb –10-20ns Note: sense amp –Analog circuit –High-gain amplifier “RAS” “CAS”

Technology – DRAM Logically similar to SRAM Commodity DRAM chips –E.g. 1Gb –Standardized address/data/control interfaces Very dense –1T per cell (bit) –Data stored in capacitor – decays over time Must rewrite on read, refresh Density improving vastly over time Latency barely improving

Memory Timing – Read Latch-based SRAM or asynchronous DRAM (FPM/EDO) –Multiple chips/banks share address bus and tristate data bus –Enables are decoded from address to select bank E.g. bbbbbbb0 is bank 0, bbbbbbb1 is bank 1 Timing constraints: straightforward –t EN setup time from Addr stable to EN active (often zero) –t D delay from EN to valid data (10ns typical for SRAM) –t O delay from EN disable to data tristate off (nonzero) © Hill, Lipasti 8 Addr Data EN t EN tDtD tOtO Chip 0Chip 1 Data Addr EN1 EN0

Memory Timing - Write WR & EN triggers write of Data to ADDR Timing constraints: not so easy –t S setup time from Data & Addr stable to WR pulse –t P minimum write pulse duration –t H hold time for data/addr beyond write pulse end Challenge: WR pulse must start late, end early –>t S after Addr/Data, >t H before end of cycle –Requires multicycle control or glitch-free clock divider 9 Addr Data WR tStS tPtP tHtH Chip 0Chip 1 Data Addr EN1 EN0 WR EN

Technology – Disk Covered in more detail later (input/output) Bits stored as magnetic charge Still mechanical! –Disk rotates ( RPM) –Head seeks to track, waits for sector to rotate to it –Solid-state replacements in the works MRAM, etc. Glacially slow compared to DRAM (10-20ms) Density improvements astounding (100%/year)

Memory Hierarchy Registers On-Chip SRAM Off-Chip SRAM DRAM Disk CAPACITY SPEED and COST

Need lots of bandwidth Need lots of storage –64MB (minimum) to multiple TB Must be cheap per bit –(TB x anything) is a lot of money! These requirements seem incompatible Why Memory Hierarchy?

Fast and small memories –Enable quick access (fast cycle time) –Enable lots of bandwidth (1+ L/S/I-fetch/cycle) Slower larger memories –Capture larger share of memory –Still relatively fast Slow huge memories –Hold rarely-needed state –Needed for correctness All together: provide appearance of large, fast memory with cost of cheap, slow memory

Why Does a Hierarchy Work? Locality of reference –Temporal locality Reference same memory location repeatedly –Spatial locality Reference near neighbors around the same time Empirically observed –Significant! –Even small local storage (8KB) often satisfies >90% of references to multi-MB data set

Why Locality? Analogy: –Library (Disk) –Bookshelf (Main memory) –Stack of books on desk (off-chip cache) –Opened book on desk (on-chip cache) Likelihood of: –Referring to same book or chapter again? Probability decays over time Book moves to bottom of stack, then bookshelf, then library –Referring to chapter n+1 if looking at chapter n?

Memory Hierarchy CPU I & D L1 Cache Shared L2 Cache Main Memory Disk Temporal Locality Keep recently referenced items at higher levels Future references satisfied quickly Spatial Locality Bring neighbors of recently referenced to higher levels Future references satisfied quickly

Four Burning Questions These are: –Placement Where can a block of memory go? –Identification How do I find a block of memory? –Replacement How do I make space for new blocks? –Write Policy How do I propagate changes? Consider these for caches –Usually SRAM Will consider main memory, disks later

Placement Memory Type PlacementComments RegistersAnywhere; Int, FP, SPR Compiler/programmer manages Cache (SRAM) Fixed in H/WDirect-mapped, set-associative, fully-associative DRAMAnywhereO/S manages DiskAnywhereO/S manages HUH?

Placement Address Range –Exceeds cache capacity Map address to finite capacity –Called a hash –Usually just masks high-order bits Direct-mapped –Block can only exist in one location –Hash collisions cause problems SRAM Cache Hash Address Index Data Out IndexOffset 32-bit Address Offset Block Size

Placement Fully-associative –Block can exist anywhere –No more hash collisions Identification –How do I know I have the right block? –Called a tag check Must store address tags Compare against address Expensive! –Tag & comparator per block SRAM Cache Hash Address Data Out Offset 32-bit Address Offset Tag Hit Tag Check ?= Tag

Placement Set-associative –Block can be in a locations –Hash collisions: a still OK Identification –Still perform tag check –However, only a in parallel SRAM Cache Hash Address Data Out Offset Index Offset 32-bit Address TagIndex a Tags a Data Blocks Index ?= Tag

Placement and Identification Consider: – : o=6, i=6, t=20: direct-mapped (S=B) – : o=6, i=4, t=22: 4-way S-A (S = B / 4) – : o=6, i=0, t=26: fully associative (S=1) Total size = BS x B = BS x S x (B/S) Offset 32-bit Address TagIndex PortionLengthPurpose Offseto=log 2 (block size)Select word within block Indexi=log 2 (number of sets)Select set of blocks Tagt=32 - o - iID block within set

Replacement Cache has finite size –What do we do when it is full? Analogy: desktop full? –Move books to bookshelf to make room Same idea: –Move blocks to next level of cache

Replacement How do we choose victim? –Verbs: Victimize, evict, replace, cast out Several policies are possible –FIFO (first-in-first-out) –LRU (least recently used) –NMRU (not most recently used) –Pseudo-random (yes, really!) Pick victim within set where a = associativity –If a <= 2, LRU is cheap and easy (1 bit) –If a > 2, it gets harder –Pseudo-random works pretty well for caches

Write Policy Memory hierarchy –2 or more copies of same block Main memory and/or disk Caches What to do on a write? –Eventually, all copies must be changed –Write must propagate to all levels

Write Policy Easiest policy: write-through Every write propagates directly through hierarchy –Write in L1, L2, memory, disk (?!?) Why is this a bad idea? –Very high bandwidth requirement –Remember, large memories are slow Popular in real systems only to the L2 –Every write updates L1 and L2 –Beyond L2, use write-back policy

Write Policy Most widely used: write-back Maintain state of each line in a cache –Invalid – not present in the cache –Clean – present, but not written (unmodified) –Dirty – present and written (modified) Store state in tag array, next to address tag –Mark dirty bit on a write On eviction, check dirty bit –If set, write back dirty line to next level –Called a writeback or castout

Write Policy Complications of write-back policy –Stale copies lower in the hierarchy –Must always check higher level for dirty copies before accessing copy in a lower level Not a big problem in uniprocessors –In multiprocessors: the cache coherence problem I/O devices that use DMA (direct memory access) can cause problems even in uniprocessors –Called coherent I/O –Must check caches for dirty copies before reading main memory

Cache Example Tag0Tag1LRU B Cache: –o=2, i=2, t=2; 2-way set-associative –Initially empty –Only tag array shown on right Trace execution of: ReferenceBinarySet/WayHit/Miss Tag Array

Cache Example Tag0Tag1LRU B Cache: –o=2, i=2, t=2; 2-way set-associative –Initially empty –Only tag array shown on right Trace execution of: ReferenceBinarySet/WayHit/Miss Load 0x2A /0Miss Tag Array

Cache Example Tag0Tag1LRU B Cache: –o=2, i=2, t=2; 2-way set-associative –Initially empty –Only tag array shown on right Trace execution of: ReferenceBinarySet/WayHit/Miss Load 0x2A /0Miss Load 0x2B /0Hit Tag Array

Cache Example Tag0Tag1LRU B Cache: –o=2, i=2, t=2; 2-way set-associative –Initially empty –Only tag array shown on right Trace execution of: ReferenceBinarySet/WayHit/Miss Load 0x2A /0Miss Load 0x2B /0Hit Load 0x3C /0Miss Tag Array

Cache Example Tag0Tag1LRU B Cache: –o=2, i=2, t=2; 2-way set-associative –Initially empty –Only tag array shown on right Trace execution of: ReferenceBinarySet/WayHit/Miss Load 0x2A /0Miss Load 0x2B /0Hit Load 0x3C /0Miss Load 0x /0Miss Tag Array

Cache Example Tag0Tag1LRU B Cache: –o=2, i=2, t=2; 2-way set-associative –Initially empty –Only tag array shown on right Trace execution of: ReferenceBinarySet/WayHit/Miss Load 0x2A /0Miss Load 0x2B /0Hit Load 0x3C /0Miss Load 0x /0Miss Load 0x /1Miss Tag Array

Cache Example Tag0Tag1LRU B Cache: –o=2, i=2, t=2; 2-way set-associative –Initially empty –Only tag array shown on right Trace execution of: ReferenceBinarySet/WayHit/Miss Load 0x2A /0Miss Load 0x2B /0Hit Load 0x3C /0Miss Load 0x /0Miss Load 0x /1Miss Load 0x /0 (lru)Miss/Evict Tag Array

Cache Example Tag0Tag1LRU d B Cache: –o=2, i=2, t=2; 2-way set-associative –Initially empty –Only tag array shown on right Trace execution of: ReferenceBinarySet/WayHit/Miss Load 0x2A /0Miss Load 0x2B /0Hit Load 0x3C /0Miss Load 0x /0Miss Load 0x /1Miss Load 0x /0 (lru)Miss/Evict Store 0x /0Hit/Dirty Tag Array

Caches and Pipelining Instruction cache –No writes, so simpler Interface to pipeline: –Fetch address (from PC) –Supply instruction (to IR) What happens on a miss? –Stall pipeline; inject nop –Initiate cache fill from memory –Supply requested instruction, end stall condition PC IR

I-Caches and Pipelining PC IR “NOP” Hit/Miss Tag ArrayData Array ?= Fill FSM Memory FILL FSM: 1.Fetch from memory Critical word first Save in fill buffer 2.Write data array 3.Write tag array 4.Miss condition ends

D-Caches and Pipelining Pipelining loads from cache –Hit/Miss signal from cache –Stalls pipeline or inject NOPs? Hard to do in current real designs, since wires are too slow for global stall signals –Instead, treat more like branch misprediction Cancel/flush pipeline Restart when cache fill logic is done

D-Caches and Pipelining Stores more difficult –MEM stage: Perform tag check Only enable write on a hit On a miss, must not write (data corruption) –Problem: Must do tag check and data array access sequentially This will hurt cycle time or force extra pipeline stage Extra pipeline stage delays loads as well: IPC hit!

Solution: Pipelining Writes Store #1 performs tag check only in MEM stage – placed in store buffer (SB) When store #2 reaches MEM stage –Store #1 writes to data cache In the meantime, must handle RAW to store buffer –Pending write in SB to address A –Newer loads must check SB for conflict –Stall/flush SB, or forward directly from SB Any load miss must also flush SB first –Otherwise SB D$ write may be to wrong line Can expand to >1 entry to overlap store misses MEM St#1 WB St#2 WB St#1 MEM St#2 … W $ St#1 … Tag Check If hit place in SB If miss stall, start fill Perform D$ Write

Summary Memory technology Memory hierarchy –Temporal and spatial locality Caches –Placement –Identification –Replacement –Write Policy Pipeline integration of caches © Hill, Lipasti 42