Download presentation
Presentation is loading. Please wait.
Published byGarey Lynch Modified over 6 years ago
1
CS61CL Machine Structures Lec 11 – Introduction to Cache Design
David Culler Electrical Engineering and Computer Sciences University of California, Berkeley CS252 S05
2
CS61CL Road Map Software Hardware I/O system Instr. Set Proc.
HLL Program Asm Lang. Pgm Compiler Assembler foo.c foo.s foo.o Machine Lang. pgm foo.exe Linker Software Instruction Set Architecture Hardware Machine Organization I/O system Instr. Set Proc. Digital Design Circuit Design Datapath & Control Layout & fab Semiconductor Materials 10/14/09 CS61CL F09
3
Turning “Moore stuff” into performance
11/24/2018 Turning “Moore stuff” into performance 10/14/09 CS61CL F09 EECS 150 Fa07
4
Performance Trends MIPS R3000 11/24/2018 cs61cl f09 lec 5
5
Recall: Performance Performance is in units of things per sec
Speedup( E ) = Performance(with E) / Performance( without E) Performance is in units of things per sec bigger is better If we are primarily concerned with response time performance(x) = execution_time(x) " X is n times faster than Y" means Performance(X) Execution_time(Y) n = = Performance(Y) Execution_time(X) 11/24/2018 cs61cl f09 lec 5
6
Review: Pipelined Execution
°°° PC + A B Ci IR IR_ex IR_mem IR_wb imem Dmem Speedup with N stages is ≤ N Limited by dependences (aka Hazards) Structural hazard: two operations want to use same resource at same time Data Hazard: cannot use a value before it is produced Control Hazard: attempt to branch before condition is determined 11/4/09 UCB CS61CL F09 Lec 10
7
The Problem: Memory Gap
µProc 60%/yr. DRAM 7%/yr. 1 10 100 1000 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance cache off-chip 1989 first Intel CPU (80486) with cache on chip 1995 first Intel CPU (Pentium Pro) with two levels of cache on chip
8
Recall: Where do Objects live and work?
°°° 000..0: FFF..F: n: 000..0: °°° n: read-miss Memory FFF..F: read-hit Processor read load register operate store word 11/24/2018 UCB CS61CL F09
9
Size of memory at each level
Storage Hierarchy Processor Higher Increasing Distance from Proc., Decreasing speed Level 1 Level 2 Level n Level 3 . . . Registers Cache Memory Disk Size of memory at each level Lower As we move to deeper levels the latency goes up and price per bit goes down.
10
Why Caches Work Physics: Statistics: Programs exhibit locality
Large memories are slow, Fast memories are small Statistics: Programs exhibit locality Temporal locality: recently accessed locations are likely to be accessed again soon Spatial Locality: is a location is accessed others nearby are likely to accessed too Use statistics to cheat the laws of physics illusion of a large fast memory on average access to a large memory can be fast keep recently accessed blocks in a small fast memory Ave Mem Access Time = Hit Time + Pmiss* MissPenalty 10/14/09 CS61CL F09
11
Manual vs Automatic Management of the Storage Hierarchy
In everyday life? what books in backpack? desk? library? amazon? music collection? Registers? Files? Cache? 10/14/09 CS61CL F09
12
Cache: Transparent Memory Acceleration
Processor performs reads and writes on memory locations inst. fetch, load, store memory abstraction is unchanged! Cache has copy of a small portion of the memory hit: present in cache => respond quickly miss: absent in cache => obtains it from memory and respond Unit of transfer: Block several words of memory into a cache line Where can it be placed? How can we tell if it is there? What happens to memory on write hit? What happens to cache on write miss? 10/14/09 CS61CL F09
13
Direct-Mapped Cache Each memory address is associated with one possible block within the cache => only need to look in a single location in the cache for the data if it exists in the cache Block is the unit of transfer between cache and memory
14
Direct-Mapped Cache (B=1, S=4)
CacheIndex 4 Byte Direct Mapped Cache Memory Address Memory 1 1 2 2 3 3 Block size = 1 byte 4 5 6 Cache Line 0 can be occupied by data from: Memory location 0, 4, 8, ... 4 blocks any memory location that is multiple of 4 7 8 9 A B C D E F
15
Direct-Mapped Cache (B=2, S=4)
Memory Address Memory Cache Index 8 Byte Direct Mapped Cache 1 1 2 3 2 2 4 5 4 3 Block size = 2 bytes 6 7 6 8 9 8 etc A C E 10 12 Let’s look at the simplest cache one can build. A direct mapped cache that only has 4 bytes. In this direct mapped cache with only 4 bytes, location 0 of the cache can be occupied by data form memory location 0, 4, 8, C, ... and so on. While location 1 of the cache can be occupied by data from memory location 1, 5, 9, ... etc. So in general, the cache location where a memory location can map to is uniquely determined by the 2 least significant bits of the address (Cache Index). For example here, any memory location whose two least significant bits of the address are 0s can go to cache location zero. With so many memory locations to chose from, which one should we place in the cache? Of course, the one we have read or write most recently because by the principle of temporal locality, the one we just touch is most likely to be the one we will need again soon. Of all the possible memory locations that can be placed in cache Location 0, how can we tell which one is in the cache? +2 = 22 min. (Y:02) How is the block located? How is the byte in block selected? e.g., Mem address 11101? 14 16 18 1A 1C 1E
16
How do you tell if the right block in is the line?
Like luggage at the airport … 10/14/09 CS61CL F09
17
Tag-Check (B=2, S=4, N=1) Memory Tag Data 8 2 1E 14 1 1 2 3 1 3 2 2 5
(addresses shown) Mem Address Tag Data Cache Index 8 2 1E 14 1 3 2 1 1 2 3 1 2 3 2 2 4 5 4 3 6 7 6 8 9 8 A etc C What should go in the tag? entire address? don’t need the bits we used in getting there E 10 12 Let’s look at the simplest cache one can build. A direct mapped cache that only has 4 bytes. In this direct mapped cache with only 4 bytes, location 0 of the cache can be occupied by data form memory location 0, 4, 8, C, ... and so on. While location 1 of the cache can be occupied by data from memory location 1, 5, 9, ... etc. So in general, the cache location where a memory location can map to is uniquely determined by the 2 least significant bits of the address (Cache Index). For example here, any memory location whose two least significant bits of the address are 0s can go to cache location zero. With so many memory locations to chose from, which one should we place in the cache? Of course, the one we have read or write most recently because by the principle of temporal locality, the one we just touch is most likely to be the one we will need again soon. Of all the possible memory locations that can be placed in cache Location 0, how can we tell which one is in the cache? +2 = 22 min. (Y:02) 14 16 18 1A 1C 1E
18
Mapping Memory Address to Cache
ttttttttttttttttt iiiiiiiiii oooo tag index byte to check to offset if have select within correct block block* block * Direct map => 1 block per “set” More generally, index to select set
19
Direct-Mapped Cache Example (1/3)
Suppose we have a 8KB of data in a direct-mapped cache with 16 byte blocks Determine the size of the tag, index and offset fields if we’re using a 32-bit architecture Offset need to specify correct byte within a block block contains 16 bytes = 24 bytes need 4 bit to specify correct byte
20
Direct-Mapped Cache Example (2/3)
Index: (~index into an “array of blocks”) need to specify correct block in cache cache contains 8 KB = 213 bytes block contains 16 B = 24 bytes # blocks/cache = bytes/cache bytes/block = 213 bytes/cache bytes/block = 29 blocks/cache need 9 bits to specify this many blocks
21
Direct-Mapped Cache Example (3/3)
Tag: use remaining bits as tag tag length = addr length - offset - index = bits = 19 bits so tag is leftmost 19 bits of memory address
22
Administration Midterms to be returned in Tu/W lab
HW 8 (the last) out today due ??? Proj 4 out today, due ?? with a partner Pick the due dates and plan RRRRR Week 10/14/09 CS61CL F09
23
16 KB Direct Mapped Cache, 16B blocks
Valid bit: determines whether anything is stored in that row (when computer initially turned on, all entries invalid) ... Valid Tag 0xc-f 0x8-b 0x4-7 0x0-3 1 2 3 4 5 6 7 1022 1023 Index
24
1. Load Byte 0x Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index
25
So we read block 1 ( ) Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index
26
No valid data 000000000000000000 0000000001 0100 Tag field Index field
Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index
27
So load that data into cache, setting tag, valid
Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a
28
Read from cache at offset, return word b
Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a
29
2. Read Byte 0x C Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a
30
Index is Valid 000000000000000000 0000000001 1100 Tag field
Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a
31
Index valid, Tag Matches
Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a
32
Index Valid, Tag Matches, return d
Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a
33
3. Load Byte 0x Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a
34
So read block 3 Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a
35
No valid data 000000000000000000 0000000011 0100 Tag field Index field
Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a
36
Load that cache block, return word f
Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a 1 h g f e
37
4. Load Byte 0x Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a 1 h g f e
38
So read Cache Block 1, Data is Valid
Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a 1 h g f e
39
Cache Block 1 Tag does not match (0 != 2)
Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 d c b a 1 h g f e
40
Miss, so replace block 1 with new data & tag
Tag field Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 2 l k j i 1 h g f e
41
And return byte: J 000000000000000010 0000000001 0100 Tag field
Index field Offset ... Valid Tag 1 2 3 4 5 6 7 1022 1023 0xc-f 0x8-b 0x4-7 0x0-3 Index 1 2 l k j i 1 h g f e
42
What to do on a write hit? Write-through Write-back
update the word in cache block and corresponding word in memory Write-back update word in cache block allow memory word to be “stale” add ‘dirty’ bit to each block indicating that memory needs to be updated when block is replaced OS flushes cache before I/O… Performance trade-offs?
43
Types of Cache Misses (Three C’s)
1st C: Compulsory Misses occur when a program is first started cache does not contain any of that program’s data yet, so misses are bound to occur reduced with increasing block size
44
Types of Cache Misses (Three C’s)
1st C: Compulsory Misses 2nd C: Conflict Misses miss that occurs because two distinct memory addresses map to the same cache line when both are needed keep overwriting each other Dealing with Conflict Misses Solution 1: Make the cache size bigger More lines, fewer conflicts Conflicts far apart in address space remain Solution 2: Multiple distinct blocks in the same cache Index
45
Fully Associative Cache (B=32)
Any block anywhere Memory address fields: Offset: byte within block Index: non Tag: all the rest Compare all tags in parallel Byte Offset : Cache Data B 0 4 31 Cache Tag (27 bits long) Valid B 1 B 31 Cache Tag = :
46
Types of Cache Misses (Three C’s)
1st C: Compulsory Misses 2nd C: Conflict Misses 3rd C: Capacity Misses miss that occurs because the cache has a limited size miss that would not occur if we increase the size of the cache
47
N-Way Set Associative Cache
Basic Idea direct-map to set associative lookup of N blocks within it Memory address fields: Tag: same as before Offset: same as before Index: points us to the correct “row” (called a set in this case) Given memory address: Find correct set using Index value. Compare Tag with all Tag values in the determined set. If a match occurs, hit!, otherwise a miss. Finally, use the offset field as usual to find the desired data within the block.
48
Associative Cache Example
Index 1 Memory Memory Address 1 2 3 4 5 6 7 8 9 A B C D E F 2-way set associative cache. Let’s look at the simplest cache one can build. A direct mapped cache that only has 4 bytes. In this direct mapped cache with only 4 bytes, location 0 of the cache can be occupied by data form memory location 0, 4, 8, C, ... and so on. While location 1 of the cache can be occupied by data from memory location 1, 5, 9, ... etc. So in general, the cache location where a memory location can map to is uniquely determined by the 2 least significant bits of the address (Cache Index). For example here, any memory location whose two least significant bits of the address are 0s can go to cache location zero. With so many memory locations to chose from, which one should we place in the cache? Of course, the one we have read or write most recently because by the principle of temporal locality, the one we just touch is most likely to be the one we will need again soon. Of all the possible memory locations that can be placed in cache Location 0, how can we tell which one is in the cache? +2 = 22 min. (Y:02)
49
4-Way Set Associative Cache Circuit
tag index
50
Block Replacement Policy
Direct-Mapped Cache index completely specifies position which position a block can go in on a miss N-Way Set Assoc index specifies a set, but block can occupy any position within the set on a miss Fully Associative block can be written into any position Question: if we have the choice, where should we write an incoming block? If there are any locations with valid bit off (empty), then usually write the new block into the first one. If all possible locations already have a valid block, we must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss.
51
Block Replacement Policy: LRU
LRU (Least Recently Used) Idea: cache out block which has been accessed (read or write) least recently Pro: temporal locality recent past use implies likely future use: in fact, this is a very effective policy Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this
52
Block Replacement Example
We have a 2-way set associative cache with a four word total capacity and one word blocks. We perform the following word accesses (ignore bytes for this problem): 0, 2, 0, 1, 4, 0, 2, 3, 5, 4 How many hits and how many misses will there be for the LRU block replacement policy?
53
Block Replacement: LRU
set 0 set 1 0: miss, bring into set 0 (loc 0) set 0 set 1 lru lru Addresses 0, 2, 0, 1, 4, 0, ... 2 2: miss, bring into set 0 (loc 1) 2 set 0 set 1 lru lru 0: hit 2 lru set 0 set 1 1: miss, bring into set 1 (loc 0) lru 1 set 0 set 1 1 lru 2 lru 4 4: miss, bring into set 0 (loc 1, replace 2) set 0 set 1 4 1 lru lru 0: hit
54
Big Idea How to choose between associativity, block size, replacement & write policy? Design against a performance model Minimize: Average Memory Access Time = Hit Time Miss Penalty x Miss Rate influenced by technology & program behavior Create the illusion of a memory that is large, cheap, and fast - on average How can we improve miss penalty?
55
Improving Miss Penalty
When caches first became popular, Miss Penalty ~ 10 processor clock cycles Today 2400 MHz Processor (0.4 ns per clock cycle) and 80 ns to go to DRAM 200 processor clock cycles! MEM $ $2 DRAM Proc Solution: another cache between memory and the processor cache: Second Level (L2) Cache
56
An actual CPU – Early PowerPC
Cache 32 KB Instructions and 32 KB Data L1 caches External L2 Cache interface with integrated controller and cache tags, supports up to 1 MByte external L2 cache Dual Memory Management Units (MMU) with Translation Lookaside Buffers (TLB) Pipelining Superscalar (3 inst/cycle) 6 execution units (2 integer and 1 double precision IEEE floating point)
57
An Actual CPU – Pentium M
32KB I$ 32KB D$
58
And in Conclusion… We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. So we create a memory hierarchy: each successively lower level contains “most used” data from next higher level exploits temporal & spatial locality do the common case fast, worry less about the exceptions (design principle of MIPS) Locality of reference is a Big Idea
59
And in Conclusion… Valid Tag 0xc-f 0x8-b 0x4-7 0x0-3 1 2 3
Mechanism for transparent movement of data among levels of a storage hierarchy set of address/value bindings address index to set of candidates compare desired address with tag service hit or miss load new block and binding on miss Valid Tag 0xc-f 0x8-b 0x4-7 0x0-3 1 2 3 ... d c b a address: tag index offset
60
And in Conclusion… We’ve discussed memory caching in detail. Caching in general shows up over and over in computer systems Filesystem cache, Web page cache, Game databases / tablebases, Software memoization, Others? Big idea: if something is expensive but we want to do it repeatedly, do it once and cache the result. Cache design choices: Size of cache: speed v. capacity Block size (i.e., cache aspect ratio) Write Policy (Write through v. write back Associativity choice of N (direct-mapped v. set v. fully associative) Block replacement policy 2nd level cache? 3rd level cache? Use performance model to pick between choices, depending on programs, technology, budget, ...
61
Bonus slides These are extra slides that used to be included in lecture notes, but have been moved to this, the “bonus” area to serve as a supplement. The slides will appear in the order they would have in the normal presentation Bonus
62
TIO The great cache mnemonic
AREA (cache size, B) = HEIGHT (# of blocks) * WIDTH (size of one block, B/block) 2(H+W) = 2H * 2W WIDTH (size of one block, B/block) Tag Index Offset HEIGHT (# of blocks) AREA (cache size, B)
63
Accessing data in a direct mapped cache
Memory Ex.: 16KB of data, direct-mapped, 4 word blocks Can you work out height, width, area? Read 4 addresses 0x 0x C 0x 0x Memory vals here: Address (hex) Value of Word C a b c d ... C e f g h C i j k l
64
Accessing data in a direct mapped cache
4 Addresses: 0x , 0x C, 0x , 0x 4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields Tag Index Offset
65
Do an example yourself. What happens?
Chose from: Cache: Hit, Miss, Miss w. replace Values returned: a ,b, c, d, e, ..., k, l Read address 0x ? Read address 0x c ? Cache Valid 0x0-3 0x4-7 0x8-b 0xc-f Index Tag 1 1 2 l k j i 2 3 1 h g f e 4 5 6 7 ... ...
66
Answers 0x a hit Index = 3, Tag matches, Offset = 0, value = e 0x c a miss Index = 1, Tag mismatch, so replace from memory, Offset = 0xc, value = d Since reads, values must = memory values whether or not cached: 0x = e 0x c = d Memory Address (hex) Value of Word C a b c d ... C e f g h C i j k l
67
Block Size Tradeoff (1/3)
Benefits of Larger Block Size Spatial Locality: if we access a given word, we’re likely to access other nearby words soon Very applicable with Stored-Program Concept: if we execute a given instruction, it’s likely that we’ll execute the next few as well Works nicely in sequential array accesses too As I said earlier, block size is a tradeoff. In general, larger block size will reduce the miss rate because it take advantage of spatial locality. But remember, miss rate NOT the only cache performance metrics. You also have to worry about miss penalty. As you increase the block size, your miss penalty will go up because as the block gets larger, it will take you longer to fill up the block. Even if you look at miss rate by itself, which you should NOT, bigger block size does not always win. As you increase the block size, assuming keeping cache size constant, your miss rate will drop off rapidly at the beginning due to spatial locality. However, once you pass certain point, your miss rate actually goes up. As a result of these two curves, the Average Access Time (point to equation), which is really the more important performance metric than the miss rate, will go down initially because the miss rate is dropping much faster than the increase in miss penalty. But eventually, as you keep on increasing the block size, the average access time can go up rapidly because not only is the miss penalty is increasing, the miss rate is increasing as well. Let me show you why your miss rate may go up as you increase the block size by another extreme example. +3 = 33 min. (Y:13)
68
Block Size Tradeoff (2/3)
Drawbacks of Larger Block Size Larger block size means larger miss penalty on a miss, takes longer time to load a new block from next level If block size is too big relative to cache size, then there are too few blocks Result: miss rate goes up In general, minimize Average Memory Access Time (AMAT) = Hit Time + Miss Penalty x Miss Rate
69
Block Size Tradeoff (3/3)
Hit Time time to find and retrieve data from current level cache Miss Penalty average time to retrieve data on a current level miss (includes the possibility of misses on successive levels of memory hierarchy) Hit Rate % of requests that are found in current level cache Miss Rate 1 - Hit Rate
70
Extreme Example: One Big Block
Cache Data Valid Bit B 0 B 1 B 3 Tag B 2 Cache Size = 4 bytes Block Size = 4 bytes Only ONE entry (row) in the cache! If item accessed, likely accessed again soon But unlikely will be accessed again immediately! The next access will likely to be a miss again Continually loading data into the cache but discard data (force out) before use it again Nightmare for cache designer: Ping Pong Effect
71
Block Size Tradeoff Conclusions
Miss Rate Block Size Miss Penalty Block Size Exploits Spatial Locality Fewer blocks: compromises temporal locality Average Access Time Block Size Increased Miss Penalty & Miss Rate
72
Analyzing Multi-level cache hierarchy
DRAM Proc $ $2 L2 hit time L2 Miss Rate L2 Miss Penalty L1 hit time L1 Miss Rate L1 Miss Penalty Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * L1 Miss Penalty L1 Miss Penalty = L2 Hit Time + L2 Miss Rate * L2 Miss Penalty Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * (L2 Hit Time + L2 Miss Rate * L2 Miss Penalty)
73
Example Assume Avg mem access time Hit Time = 1 cycle Miss rate = 5%
Miss penalty = 20 cycles Calculate AMAT… Avg mem access time = x 20 = cycles = 2 cycles
74
Ways to reduce miss rate
Larger cache limited by cost and technology hit time of first level cache < cycle time (bigger caches are slower) More places in the cache to put each block of memory – associativity fully-associative any block any line N-way set associated N places for each block direct map: N=1
75
Typical Scale L1 size: tens of KB hit time: complete in one clock cycle miss rates: 1-5% L2: size: hundreds of KB hit time: few clock cycles miss rates: 10-20% L2 miss rate is fraction of L1 misses that also miss in L2 why so high?
76
Example: with L2 cache Assume L1 miss penalty = 5 + 0.15 * 200 = 35
L1 Hit Time = 1 cycle L1 Miss rate = 5% L2 Hit Time = 5 cycles L2 Miss rate = 15% (% L1 misses that miss) L2 Miss Penalty = 200 cycles L1 miss penalty = * 200 = 35 Avg mem access time = x = 2.75 cycles
77
Example: without L2 cache
Assume L1 Hit Time = 1 cycle L1 Miss rate = 5% L1 Miss Penalty = 200 cycles Avg mem access time = x = 11 cycles 4x faster with L2 cache! (2.75 vs. 11)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.