Download presentation
Presentation is loading. Please wait.
Published byBrian Rogers Modified over 9 years ago
1
1 Memory Hierarchy ( Ⅲ )
2
2 Outline The memory hierarchy Cache memories Suggested Reading: 6.3, 6.4
3
3 Storage technologies and trends Locality The memory hierarchy Cache memories
4
4 Memory Hierarchy Fundamental properties of storage technology and computer software –Different storage technologies have widely different access times –Faster technologies cost more per byte than slower ones and have less capacity –The gap between CPU and main memory speed is widening –Well-written programs tend to exhibit good locality
5
5 Main memory holds disk blocks retrieved from local disks. registers on-chip L1 cache (SRAM) main memory (DRAM) local secondary storage (local disks) Larger, slower, and cheaper (per byte) storage devices remote secondary storage (distributed file systems, Web servers) Local disks hold files retrieved from disks on remote network servers. off-chip L2 cache (SRAM) L1 cache holds cache lines retrieved from the L2 cache. CPU registers hold words retrieved from cache memory. L2 cache holds cache lines retrieved from memory. L0: L1: L2: L3: L4: L5: Smaller, faster, and costlier (per byte) storage devices An example memory hierarchy
6
Caches memory hierarchyFundamental idea of a memory hierarchy: –For each K, the faster, smaller device at level K serves as a cache for the larger, slower device at level K+1. Why do memory hierarchies work? –Because of locality, programs tend to access the data at level k more often than they access the data at level k+1. –Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit. Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.
7
General Cache Concepts 0123 4567 891011 12131415 89143 Cache Larger, slower, cheaper memory viewed as partitioned into “blocks” Data is copied in block- sized transfer units Smaller, faster, more expensive memory caches a subset of the blocks 4 4 4 10
8
General Cache Concepts: Hit 0123 4567 891011 12131415 89143 Cache Data in block b is needed Request: 14 14 Block b is in cache: Hit!
9
General Cache Concepts: Miss 0123 4567 891011 12131415 89143 Cache Data in block b is needed Request: 12 Block b is not in cache: Miss! Block b is fetched from memory Request: 12 12 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (victim)
10
Types of Cache Misses Cold (compulsory) miss –Cold misses occur because the cache is empty. Capacity miss –Occurs when the set of active cache blocks (working set) is larger than the cache.
11
Types of Cache Misses Conflict miss –Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k. e.g. Block i at level k+1 must be placed in block (i mod 4) at level k. –Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. e.g. Referencing blocks 0, 8, 0, 8, 0, 8,... would miss every time.
12
12 Cache Memory HistoryHistory –At very beginning, 3 levels Registers, main memory, disk storage –10 years later, 4 levels Register, SRAM cache, main DRAM memory, disk storage –Modern processor, 5~6 levels Registers, SRAM L1, L2(,L3) cache, main DRAM memory, disk storage –Cache memories small, fast SRAM-based memories managed by hardware automatically can be on-chip, on-die, off-chip
13
Examples of Caching in the Hierarchy Hardware0On-Chip TLBAddress translationsTLB Web browser10,000,000Local diskWeb pagesBrowser cache Web cache Network buffer cache Buffer cache Virtual Memory L2 cache L1 cache Registers Cache Type Web pages Parts of files 4-KB page 64-bytes block 4-8 bytes words What is Cached? Web proxy server 1,000,000,000Remote server disks OS100Main memory Hardware1On-Chip L1 Hardware10On/Off-Chip L2 AFS/NFS client10,000,000Local disk Hardware + OS100Main memory Compiler0 CPU core Managed ByLatency (cycles)Where is it Cached? Disk cacheDisk sectorsDisk controller100,000Disk firmware
14
14 Storage technologies and trends Locality The memory hierarchy Cache memories
15
15 Cache Memory main memory I/O bridge bus interface ALU register file CPU chip system busmemory bus Cache memory CPU looks first for data in L1, then in L2, then in main memory –Hold frequently accessed blocks of main memory in caches
16
16 Inserting an L1 cache between the CPU and main memory a b c d block 10 p q r s block 21... w x y z block 30... The big slow main memory has room for many 8-word blocks. The small fast L1 cache has room for two 8-word blocks. The tiny, very fast CPU register file has room for four 4-byte words. The transfer unit between the cache and main memory is a 8-word block (32 bytes). The transfer unit between the CPU register file and the cache is a 4-byte block. line 0 line 1
17
17 Generic Cache Memory Organization B–110 B–110 valid tag set 0: B = 2 b bytes per cache block E lines per set S = 2 s sets t tag bits per line 1 valid bit per line B–110 B–110 valid tag set 1: B–110 B–110 valid tag set S-1: Cache is an array of sets. Each set contains one or more lines. Each line holds a block of data.
18
18 Cache Memory Fundamental parameters ParametersDescriptions S = 2 s E B=2 b m=log 2 (M) Number of sets Number of lines per set Block size(bytes) Number of physical(main memory) address bits
19
19 Cache Memory Derived quantities ParametersDescriptions M=2 m s=log2(S) b=log2(B) t=m-(s+b) C=B E S Maximum number of unique memory address Number of set index bits Number of block offset bits Number of tag bits Cache size (bytes) not including overhead such as the valid and tag bits
20
20 Memory Accessing For a memory accessing instruction –movl A %eax Access cache by A directly If cache hit –get the value from the cache Otherwise, –cache miss handling –get the value
21
21 Addressing caches t bitss bits b bits 0m-1 Physical Address A: 0m-1 Split into 3 parts:
22
22 Direct-mapped cache Simplest kind of cache Characterized by exactly one line per set. valid tag set 0: set 1: set S-1: E=1 lines per set cache block p633
23
23 Accessing Direct-Mapped Caches Three steps –Set selection –Line matching –Word extraction
24
24 Set selection Use the set index bits to determine the set of interest valid tag set 0: set 1: set S-1: t bitss bits 0 0 0 0 1 0m-1 b bits tag set index block offset selected set cache block
25
25 Line matching 1 t bitss bits i0110 0m-1 b bits tagset indexblock offset selected set (i): =1? = ? (1) The valid bit must be set (2) The tag bits in the cache line must match the tag bits in the address 0110 30127456 Find a valid line in the selected set with a matching tag
26
26 Word Extraction 1 t bitss bits 100i0110 0m-1 b bits tagset indexblock offset selected set (i): block offset selects starting byte 0110w3w3 w0w0 w1w1 w2w2 30127456
27
27 Simple Memory System Cache Cache –16 lines –4-byte line size –Direct mapped 11109876543210 Offset Index Tag
28
28 Simple Memory System Cache IdxTagValidB0B1B2B3 019199112311 1150–––– 21B100020408 3360–––– 4321436D8F09 50D13672F01D 6310–––– 716111C2DF03
29
29 Simple Memory System Cache IdxTagValidB0B1B2B3 82413A005189 92D0–––– A 19315DA3B B0B0–––– C120–––– D16104963415 E13183771BD3 F140––––
30
30 Address Translation Example Address: 0x354 11109876543210 Offset Index Tag 001010101100 Offset: 0x0 Index: 0x05 Tag: 0x0D Hit? Yes Byte: 0x36
31
31 Line Replacement on Misses Check the cache line of the set indicated by the set index bits –If the cache line valid, it must be evicted Retrieve the requested block from the next level Current line is replaced by the newly fetched line
32
32 Check the cache line 1 selected set (i): =1? If valid bit is set, evict the line Tag 30127456
33
33 Get the Address of the Starting Byte Consider memory address looks like the following Clear the last bits and get the address A 0m-1 0m-1 xxx… … … xxx 000 … … …000
34
34 Read the Block from the Memory Put A on the Bus (A is A’000) A 0 A’ x main memory I/O bridge bus interface ALU register file CPU chip system busmemory bus Cache memory
35
35 Read the Block from the Memory Main memory reads A’ from the memory bus, retrieves 8 bytes x, and places it on the bus X 0 A’ x main memory I/O bridge bus interface ALU register file CPU chip system busmemory bus Cache memory
36
36 Read the Block from the Memory CPU read x from the bus and copies it into the cache line 0 A’ x main memory I/O bridge bus interface ALU register file CPU chip system busmemory bus Cache memory X
37
37 Read the Block from the Memory Increase A’ by 1, and copy y in A’+1 into the cache line. Repeat several times (4 or 8 ) 0 A’+4 W main memory I/O bridge bus interface ALU register file CPU chip system busmemory bus Cache memory XYZW
38
38 Cache line, set and block Block –A fixed-sized packet of information Moves back and forth between a cache and main memory (or a lower-level cache) Line –A container in a cache that stores A block, the valid bit, the tag bits Other information Set –A collection of one or more lines
39
39 Direct-mapped cache simulation Example M=16 byte addresses B=2 bytes/block, S=4 sets, E=1 line/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] x t=1 s=2 b=1 xxx 0 vtagdata 0 0 0 0 1 2 3 set
40
40 Direct-mapped cache simulation Example M=16 byte addresses B=2 bytes/block, S=4 sets, E=1 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] miss x t=1 s=2 b=1 xxx 10m[0]m[1] vtagdata 0 0 0 0 [0000] (miss) (1)(1) 0 1 2 3 set
41
41 Direct-mapped cache simulation Example M=16 byte addresses B=2 bytes/block, S=4 sets, E=1 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] miss hit x t=1 s=2 b=1 xxx 10m[0]m[1] vtagdata 0 0 0 1 [0001] (hit) (2)(2) 0 1 2 3 set
42
42 Direct-mapped cache simulation Example M=16 byte addresses B=2 bytes/block, S=4 sets, E=1 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] miss hit miss x t=1 s=2 b=1 xxx 10m[0]m[1] vtagdata 0 11m[12]m[13] 0 13 [1101] (miss) (3)(3) 0 1 2 3 set
43
43 Direct-mapped cache simulation Example M=16 byte addresses B=2 bytes/block, S=4 sets, E=1 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] miss hit miss miss x t=1 s=2 b=1 xxx 11m[8]m[9] vtagdata 0 11m[12]m[13] 0 8 [1000] (miss) (4)(4) 0 1 2 3 set
44
44 Direct-mapped cache simulation Example M=16 byte addresses B=2 bytes/block, S=4 sets, E=1 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] miss hit miss miss miss x t=1 s=2 b=1 xxx 10m[0]m[1] vtagdata 0 11m[12]m[13] 0 0 [0000] (miss) (5)(5) 0 1 2 3 set
45
45 Direct-mapped cache simulation Example M=16 byte addresses B=2 bytes/block, S=4 sets, E=1 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] miss hit miss miss miss x t=1 s=2 b=1 xxx 10m[0]m[1] vtagdata 0 11m[12]m[13] 0 0 [0000] (miss) (5)(5) 0 1 2 3 set Thrashing!
46
46 Conflict Misses in Direct-Mapped Caches 1 float dotprod(float x[8], float y[8]) 2 { 3 float sum = 0.0; 4 int i; 5 6 for (i = 0; i < 8; i++) 7 sum += x[i] * y[i]; 8 return sum; 9 }
47
47 Conflict Misses in Direct-Mapped Caches Assumption for x and y –x is loaded into the 32 bytes of contiguous memory starting at address 0 –y starts immediately after x at address 32 Assumption for the cache –A block is 16 bytes big enough to hold four floats –The cache consists of two sets A total cache size of 32 bytes
48
48 Conflict Misses in Direct-Mapped Caches Trashing –Read x[0] will load x[0] ~ x[3] into the cache –Read y[0] will overload the cache line by y[0] ~ y[3]
49
49 Conflict Misses in Direct-Mapped Caches Padding can avoid thrashing –Claim x[12] instead of x[8]
50
50 Direct-mapped cache simulation Address bits Address (decimal) Tag bits (t=1) Index bits (s=2) Offset bits (b=1) Set number (decimal) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 00000000111111110000000011111111 00 01 10 11 00 01 10 11 01010101010101010101010101010101 00112233001122330011223300112233
51
51 Direct-mapped cache simulation Address bits Address (decimal) Index bits (s=2) Tag bits (t=1) Offset bits (b=1) Set number (decimal) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150000 00000000010101011010101011111111000000000101010110101010111111111111 00110011001100110011001100110011 01010101010101010101010101010101 00001111222233330000111122223333
52
52 Why use middle bits as index? 4-line Cache High-Order Bit Indexing Middle-Order Bit Indexing 00 01 10 11 00 0001 0010 0011 0100 01 0110 0111 1000 1001 10 1011 1100 1101 1110 11 00 0001 0010 0011 0100 01 0110 0111 1000 1001 10 1011 1100 1101 1110 11
53
53 Why use middle bits as index? High-Order Bit Indexing –Adjacent memory lines would map to same cache entry –Poor use of spatial locality Middle-Order Bit Indexing –Consecutive memory lines map to different cache lines –Can hold C-byte region of address space in cache at one time
54
54 Set associative caches Characterized by more than one line per set validtag set 0: E=2 lines per set set 1: set S-1: cache block validtagcache block validtagcache block validtagcache block validtagcache block validtagcache block
55
55 Accessing set associative caches Set selection –identical to direct-mapped cache valid tag set 0: valid tag set 1: valid tag set S-1: t bitss bits 0 0 0 0 1 0m-1 b bits tagset indexblock offset Selected set cache block
56
56 Accessing set associative caches Line matching and word selection –must compare the tag in each valid line in the selected set. (3) If (1) and (2), then cache hit, and block offset selects starting byte. 10110 w3w3 w0w0 w1w1 w2w2 11001 t bitss bits 100i0110 0m-1 b bits tagset indexblock offset selected set (i): =1? = ? (2) The tag bits in one of the cache lines must match the tag bits in the address (1) The valid bit must be set. 30127456
57
57 Associative Cache Cache –16 lines –4-byte line size –2-way set associative 11109876543210 Offset Index Tag
58
58 Simple Memory System Cache IdxTagValidB0B1B2B3 019199112311 0150–––– 11B100020408 1360–––– 2321436D8F09 20D13672F01D 3310–––– 316111C2DF03
59
59 Simple Memory System Cache IdxTagValidB0B1B2B3 42413A005189 42D0–––– 5 19315DA3B 50B0–––– 6120–––– 616104963415 713183771BD3 7140––––
60
60 Address Translation Example Address: 0x354 11109876543210 Offset Index Tag 001010101100 Offset: 0x0 Index: 0x05 Tag: 0x1A Hit? No
61
61 Line Replacement on Misses If all the cache lines of the set are valid –Which line is selected to be evicted ? LFU (least-frequently-used) –Replace the line that has been referenced the fewest times over some past time window LRU (least-recently-used) –Replace the line that was last accessed the furthest in the past All of these policies require additional time and hardware
62
62 Set Associative Cache Simulation Example M=16 byte addresses B=2 bytes/block, S=2 sets, E=2 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] xx t=2 s=1 b=1 xx 0 vtagdata 0 0 0 0 0 1 1 set
63
63 Set Associative Cache Simulation Example M=16 byte addresses B=2 bytes/block, S=2 sets, E=2 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] miss 10m[0]m[1] vtagdata 0 0 0 0 [0000] (miss) (1)(1) xx t=2 s=1 b=1 xx
64
64 Set Associative Cache Simulation Example M=16 byte addresses B=2 bytes/block, S=2 sets, E=2 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] miss hit 10m[0]m[1] vtagdata 0 0 0 1 [0001] (hit) (2)(2) xx t=2 s=1 b=1 xx 0 0 1 1 set
65
65 Set Associative Cache Simulation Example M=16 byte addresses B=2 bytes/block, S=2 sets, E=2 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] miss hit miss 10m[0]m[1] vtagdata 0 1 11 m[12]m[13] 0 13 [1101] (miss) (3)(3) xx t=2 s=1 b=1 xx 0 0 1 1 set
66
66 Set Associative Cache Simulation Example M=16 byte addresses B=2 bytes/block, S=2 sets, E=2 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] miss hit miss miss 110m[8]m[9] vtagdata 111m[12]m[13] 1 0 8 [1000] (miss)LRU (4)(4) xx t=2 s=1 b=1 xx 0 0 1 1 set
67
67 Set Associative Cache Simulation Example M=16 byte addresses B=2 bytes/block, S=2 sets, E=2 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] miss hit miss miss miss 110m[8]m[9] vtagdata 100m[0]m[1] 1 0 0 [0000] (miss) LRU (5)(5) xx t=2 s=1 b=1 xx 0 0 1 1 set
68
68 Fully associative caches Characterized by all of the lines in the only one set No set index bits in the address set 0: valid tag cache block validtagcache block … E=C/B lines in the one and only set t bits b bits tagblock offset
69
69 Accessing fully associative caches Word selection –must compare the tag in each valid line 00110 w3w3 w0w0 w1w1 w2w2 11001 t bits 1000110 0m-1 b bits tag block offset =1? = ? (3) If (1) and (2), then cache hit, and block offset selects starting byte. (2) The tag bits in one of the cache lines must match the tag bits in the address (1) The valid bit must be set. 30127456 1 0 0110 1110
70
70 Issues with Writes Write hits –Write through Cache updates its copy Immediately writes the corresponding cache block to memory –Write back Defers the memory update as long as possible Writing the updated block to memory only when it is evicted from the cache Maintains a dirty bit for each cache line
71
71 Issues with Writes Write misses –Write-allocate Loads the corresponding memory block into the cache Then updates the cache block –No-write-allocate Bypasses the cache Writes the word directly to memory Combination –Write through, no-write-allocate –Write back, write-allocate (modern implementation)
72
72 Multi-level caches L1 d-cache, i-cache 32k 8-way Access: 4 cycles L2 unified-cache 258k 8-way Access: 11 cycles L3 unified-cache 8M 16-way Access: 30~40 cycles Block size 64 bytes for all cache
73
73 Cache performance metrics Miss Rate –fraction of memory references not found in cache (misses/references) –Typical numbers: 3-10% for L1 Can be quite small (<1%) for L2, depending on size Hit Rate –fraction of memory references found in cache (1 - miss rate)
74
74 Cache performance metrics Hit Time –time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) –Typical numbers: 1-2 clock cycles for L1 (4 cycles in core i7) 5-10 clock cycles for L2 (11 cycles in core i7) Miss Penalty –additional time required because of a miss Typically 50-200 cycles for main memory (Trend: increasing!)
75
75 What does Hit Rate Mean? Consider –Hit Time: 2 cycles –Miss Penalty: 200 cycles –Average access time: –Hit rate 99%: 2*0.99 + 200*0.01 = 4 cycles –Hit rate 97%: 2*0.97 + 200*0.03 = 8 cycles This is why “miss rate” is used instead of “hit rate”
76
76 Cache performance metrics Cache size –Hit rate vs. hit time Block size –Spatial locality vs. temporal locality Associativity –Thrashing –Cost –Speed –Miss penalty Write strategy –Simple, read misses, fewer transfer
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.