Caches Oct. 22, 1998 Topics Memory Hierarchy

Slides:



Advertisements
Similar presentations
CS 105 Tour of the Black Holes of Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Cache Memories September 30, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance.
The Memory Hierarchy CS 740 Sept. 28, 2001 Topics The memory hierarchy Cache design.
Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
The Memory Hierarchy CS 740 Sept. 17, 2007 Topics The memory hierarchy Cache design.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
CPSC 312 Cache Memories Slides Source: Bryant Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on.
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Systems I Locality and Caching
ECE Dept., University of Toronto
Lecture 19: Virtual Memory
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
EEL-4713 Ann Gordon-Ross 1 EEL-4713 Computer Architecture Memory hierarchies.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Computer Organization & Programming
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 Memory Hierarchy ( Ⅲ ). 2 Outline The memory hierarchy Cache memories Suggested Reading: 6.3, 6.4.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Cache Memories February 28, 2002 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Reading:
Systems I Cache Organization
1 Cache Memory. 2 Outline General concepts 3 ways to organize cache memory Issues with writes Write cache friendly codes Cache mountain Suggested Reading:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
Improving Memory Access 2/3 The Cache and Virtual Memory
Lecture 20 Last lecture: Today’s lecture: Types of memory
SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.
Cache Organization 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
CMSC 611: Advanced Computer Architecture
CSE 351 Section 9 3/1/12.
ECE232: Hardware Organization and Design
Cache Memories CSE 238/2038/2138: Systems Programming
Cache Memory Presentation I
ReCap Random-Access Memory (RAM) Nonvolatile Memory
Cache Memories September 30, 2008
Cache Memory.
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Adapted from slides by Sally McKee Cornell University
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

15-213 Caches Oct. 22, 1998 Topics Memory Hierarchy Locality of Reference Cache Design Direct Mapped Associative class18.ppt

Computer System disk Disk Memory-I/O bus Processor Cache Memory I/O controller Display Network interrupt

Levels in Memory Hierarchy cache virtual memory CPU regs C a c h e Memory 4 B 8 B 4 KB disk Register Cache Memory Disk Memory size: speed: $/Mbyte: block size: 200 B 3 ns 4 B 32 KB / 4MB 6 ns $100/MB 8 B 128 MB 100 ns $1.50/MB 4 KB 20 GB 10 ms $0.06/MB larger, slower, cheaper

Alpha 21164 Chip Photo Caches: Microprocessor Report 9/12/94 L1 data L1 instruction L2 unified TLB Branch history

Alpha 21164 Chip Caches Caches: L1 data L1 instruction L2 unified TLB Right Half L2 L3 Control Caches: L1 data L1 instruction L2 unified TLB Branch history L1 Data L1 I n s t r. Right Half L2 L2 Tags

Locality of Reference Principle of Locality Locality in Example Programs tend to reuse data and instructions near those they have used recently. Temporal locality: recently referenced items are likely to be referenced in the near future. Spatial locality: items with nearby addresses tend to be referenced close together in time. sum = 0; for (i = 0; i < n; i++) sum += a[i]; *v = sum; Locality in Example Data Reference array elements in succession (spatial) Instruction Reference instructions in sequence (spatial) Cycle through loop repeatedly (temporal)

Caching: The Basic Idea Main Memory Stores words A–Z in example Cache Stores subset of the words 4 in example Organized in blocks Multiple words To exploit spatial locality Access Word must be in cache for processor to access Big, Slow Memory A B C • Y Z Small, Fast Cache A B G H Processor

Basic Idea (Cont.) Maintaining Cache Initial Read C Read D Read Z A B Y Z G H Cache holds 2 blocks Each with 2 words Load block C+D into cache “Cache miss” Word already in cache “Cache hit” Load block Y+Z into cache Evict oldest entry Maintaining Cache Every time processor performs load or store, bring block containing word into cache May need to evict existing block Subsequent loads or stores to any word in block performed within cache

Accessing Data in Memory Hierarchy Between any two levels, memory divided into blocks. Data moves between levels on demand, in block-sized chunks. Invisible to application programmer Hardware responsible for cache operation Upper-level blocks a subset of lower-level blocks. Access word w in block a (hit) Access word v in block b (miss) w v High Level a a a b b a b Low Level b b a a

Design Issues for Caches Key Questions Where should a block be placed in the cache? (block placement) How is a block found in the cache? (block identification) Which block should be replaced on a miss? (block replacement) What happens on a write? (write strategy) Constraints Design must be very simple Hardware realization All decision making within nanosecond time scale Want to optimize performance for “typical” programs Do extensive benchmarking and simulations Many subtle engineering trade-offs

Direct-Mapped Caches Simplest Design Parameters Physical Address Given memory block has unique cache location Parameters Block size B = 2b Number of bytes in each block Typically 2X–8X word size Number of Sets S = 2s Number of blocks cache can hold Total Cache Size = B*S = 2b+s Physical Address Address used to reference main memory n bits to reference N = 2n total bytes Partition into fields Offset: Lower b bits indicate which byte within block Set: Next s bits indicate how to locate block within cache Tag: Identifies this block when in cache n-bit Physical Address t s b tag set index offset

Indexing into Direct-Mapped Cache 1 • • • B–1 Tag Valid Set 0: Use set index bits to select cache set 1 • • • B–1 Tag Valid Set 1: • 1 • • • B–1 Tag Valid Set S–1: t s b tag set index offset Physical Address

Direct-Mapped Cache Tag Matching Identifying Block Must have tag match high order bits of address Must have Valid = 1 = 1? Selected Set: 1 • • • B–1 Tag Valid =? Lower bits of address select byte or word within cache block t s b tag set index offset Physical Address

Direct Mapped Cache Simulation N=16 byte addresses B=2 bytes/block S=4 sets E=1 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] x t=1 s=2 b=1 xx 1 m[1] m[0] v tag data 0 [0000] (miss) (1) 1 m[1] m[0] v tag data m[13] m[12] 13 [1101] (miss) (2) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 1 m[9] m[8] v tag data 8 [1000] (miss) (3) 1 m[1] m[0] v tag data m[13] m[12] 0 [0000] (miss) (4)

Why Use Middle Bits as Index? High-Order Bit Indexing Middle-Order Bit Indexing 4-block Cache 00 0000 0000 01 0001 0001 10 0010 0010 11 0011 0011 High-Order Bit Indexing Adjacent memory blocks would map to same cache block Poor use of spatial locality Middle-Order Bit Indexing Consecutive memory blocks map to different cache blocks Can hold N-byte region of address space in cache at one time 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 1100 1100 1101 1101 1110 1110 1111 1111

Direct Mapped Cache Implementation (DECStation 3100) 31 30 29 .................. 19 18 17 16 15 14 13 .................. 5 4 3 2 1 0 tag set byte offset valid tag (16 bits) data (32 bits) 16,384 sets = data hit

Properties of Direct Mapped Caches Strength Minimal control hardware overhead Simple design (Relatively) easy to make fast Weakness Vulnerable to thrashing Two heavily used blocks have same cache index Repeatedly evict one to make room for other Cache Block

Vector Product Example float dot_prod(float x[1024], y[1024]) { float sum = 0.0; int i; for (i = 0; i < 1024; i++) sum += x[i]*y[i]; return sum; } Machine DECStation 5000 MIPS Processor with 64KB direct-mapped cache, 16 B block size Performance Good case: 24 cycles / element Bad case: 66 cycles / element

Thrashing Example x[0] y[0] x[1] Cache Block y[1] Cache Block x[2] • • Cache Block Cache Block • • x[1020] y[1020] x[1021] Cache Block y[1021] Cache Block x[1022] y[1022] x[1023] y[1023] Access one element from each array per iteration

Thrashing Example: Good Case y[0] x[1] y[1] Cache Block x[2] y[2] x[3] y[3] Access Sequence Read x[0] x[0], x[1], x[2], x[3] loaded Read y[0] y[0], y[1], y[2], y[3] loaded Read x[1] Hit Read y[1] • • • 2 misses / 8 reads Analysis x[i] and y[i] map to different cache blocks Miss rate = 25% Two memory accesses / iteration On every 4th iteration have two misses Timing 10 cycle loop time 28 cycles / cache miss Average time / iteration = 10 + 0.25 * 2 * 28

Thrashing Example: Bad Case y[0] x[1] y[1] Cache Block x[2] y[2] x[3] y[3] Access Pattern Read x[0] x[0], x[1], x[2], x[3] loaded Read y[0] y[0], y[1], y[2], y[3] loaded Read x[1] Read y[1] • • • 8 misses / 8 reads Analysis x[i] and y[i] map to same cache blocks Miss rate = 100% Two memory accesses / iteration On every iteration have two misses Timing 10 cycle loop time 28 cycles / cache miss Average time / iteration = 10 + 1.0 * 2 * 28

Set Associative Cache Mapping of Memory Blocks Eviction Policy Each set can hold E blocks Typically between 2 and 8 Given memory block can map to any block in set Eviction Policy Which block gets kicked out when bring new block in Commonly “Least Recently Used” (LRU) Least-recently accessed (read or written) block gets evicted 1 • • • B–1 Tag Valid • LRU State Block 0: Block 1: Block E–1: Set i:

Indexing into 2-Way Associative Cache Use middle s bits to select from among S = 2s sets 1 • • • B–1 Tag Valid Set 0: 1 • • • B–1 Tag Valid Set 1: • 1 • • • B–1 Tag Valid Set S–1: t s b tag set index offset Physical Address

2-Way Associative Cache Tag Matching Identifying Block Must have one of the tags match high order bits of address Must have Valid = 1 for this block = 1? Selected Set: 1 • • • B–1 Tag Valid =? Lower bits of address select byte or word within cache block t s b tag set index offset Physical Address

2-Way Set Associative Simulation N=16 addresses B=2 bytes/line S=2 sets E=2 entries/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] xx t=2 s=1 b=1 x 1 00 m[1] m[0] v tag data 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 0 (miss) v tag data 1 00 m[1] m[0] 11 m[13] m[12] 13 (miss) 1 10 m[9] m[8] v tag data 11 m[13] m[12] 8 (miss) (LRU replacement) 1 10 m[9] m[8] v tag data 00 m[1] m[0] 0 (miss) (LRU replacement)

Two-Way Set Associative Cache Implementation Set index selects a set from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Set Index Valid Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16) Compare Adr Tag Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Block Hit

Fully Associative Cache Mapping of Memory Blocks Cache consists of single set holding E blocks Given memory block can map to any block in set Only practical for small caches Entire Cache 1 • • • B–1 Tag Valid • LRU State Block 0: Block 1: Block E–1:

Fully Associative Cache Tag Matching = 1? Identifying Block Must check all of the tags for match Must have Valid = 1 for this block 1 • • • B–1 Tag Valid • • =? Lower bits of address select byte or word within cache block t b tag offset Physical Address

Fully Associative Cache Simulation N=16 addresses B=2 bytes/line S=1 sets E=4 entries/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] t=3 s=0 b=1 xxx x 1 00 m[1] m[0] v tag data 0 (miss) (1) set ø 1 000 m[1] m[0] v tag data 110 m[13] m[12] 13 (miss) (2) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 v tag data 8 (miss) (3) 1 000 m[1] m[0] 110 m[13] m[12] 100 m[9] m[8]

Write Policy Write Through What happens when processor writes to the cache? Should memory be updated as well? Write Through Store by processor updates cache and memory. Memory always consistent with cache Never need to store from cache to memory Viable since ~2X more loads than stores Memory Store Processor Cache Load Cache Load

Write Strategies (Cont.) Write Back Store by processor only updates cache block Modified block written to memory only when it is evicted Requires “dirty bit” for each block Set when modify block in cache Indicates that block in memory is stale Memory not always consistent with cache Processor Write Back Memory Store Cache Load Cache Load

Multi-Level Caches Can have separate Icache and Dcache or unified Icache/Dcache Memory disk TLB L1 Icache L1 Dcache regs L2 Dcache L2 Icache Processor size: speed: $/Mbyte: block size: 200 B 5 ns 4 B 8 KB 5 ns 16 B 1M SRAM 6 ns $200/MB 32 B 128 MB DRAM 70 ns $1.50/MB 4 KB 10 GB 10 ms $0.06/MB larger, slower, cheaper larger block size, higher associativity, more likely to write back

Alpha 21164 Hierarchy Processor Chip L1 Data 1 cycle latency 8KB, direct Write-through Dual Ported 32B blocks L2 Unified 8 cycle latency 96KB 3-way assoc. Write-back Write allocate 32B/64B blocks L3 Unified 1M-64M direct Write-back Write allocate 32B or 64B blocks Main Memory Up to 2GB Regs. L1 Instruction 8KB, direct 32B blocks Improving memory performance was a main design goal Earlier Alpha’s CPUs starved for data

Bandwidth Matching Challenge Effect of Caching Trends CPU works with short cycle times DRAM (relatively) long cycle times How can we provide enough bandwidth between processor & memory? Effect of Caching Caching greatly reduces amount of traffic to main memory But, sometimes need to move large amounts of data from memory into cache Trends Need for high bandwidth much greater for multimedia applications Repeated operations on image data Recent generation machines (e.g., Pentium II) greatly improve on predecessors CPU Short Latency cache bus M Long Latency

High Bandwidth Memory Systems CPU CPU mux cache cache bus bus M M Solution 1 High BW DRAM Solution 2 Wide path between memory & cache Example: Page Mode DRAM RAMbus Example: Alpha AXP 21064 256 bit wide bus, L2 cache, and memory.

Cache Performance Metrics Miss Rate fraction of memory references not found in cache (misses/references) Typical numbers: 5-10% for L1 1-2% for L2 Hit Time time to deliver a block in the cache to the processor (includes time to determine whether the block is in the cache) Typical numbers 1 clock cycle for L1 3-8 clock cycles for L2 Miss Penalty additional time required because of a miss Typically 10-30 cycles for main memory