Download presentation
Presentation is loading. Please wait.
Published byMeagan McKenzie Modified over 9 years ago
1
1 ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology
2
2 Performance matters Consider basic 5 stage pipeline design CPU runs at 1ns cycle time (1GHz) Main memory runs at 100ns access time How is performance? Fetch A (100 cycles) Decode A (1 cycle) + Fetch B (100 cycles) Ex A (1 cy) + Dec B (1 cy) + Fetch C (100 cycles) Effective 1 instr per 100 cycles --> 10MHz CPU Latency killing system Problem only getting worse CPU speeds grow much faster than DRAM speeds DRAM bandwidth improving well
3
3 Typical Solution With the large gap in CPU vs DRAM speeds, must have solution Cache (“slave memories”) Ultra-fast, small local memory Cache runs at 1ns access time... Fetch A (1 cycle) Decode A (1 cycle) + Fetch B (1 cycle) Ex A (1 cy) + Dec B (1 cy) + Fetch C (1 cy) Effective 1 instr per 1 cyc ----> 1GHz CPU Special terms hit miss rates (hit/miss)
4
4 SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10) * ignoring new technologies ‘on the horizon’ Memories: Two basic types* Word line Pass transistor Bit line Bit line bar Word line Pass transistor Capacitor Bit line
5
5 Memory Photos Intel Paxville (dual Core) 90nm 8-way 2MB L2 for each core Intel Itanium2.13µm 24-way 6MB L3 240-pin DDR2 DRAM
6
6 Users want large and fast memories! For example: SRAM access times are 700ps-1ns (1-3 cycles) DRAM access times are 60-100ns (100-250 cycles) Disk access times are ~1 million ns (~3M cycles) Try and give it to them anyway build a memory hierarchy Exploiting Memory Hierarchy
7
7 Model of Memory Hierarchy RegFileL1 Data cache L1 Inst cache L2 Cache Cache Main Memory Memory DISKSRAMDRAM
8
8 P4: Prescott w/ 2MB L2 (90nm) Prescott runs very fast (3.4+ GHz) 2MB L2 Unified Cache 12K* trace cache (think “I$”) 16KB data cache Where is the cache? What about the similar blocks? Why the visual differences? Why is it square? What’s with the colors? Check this out: www.chip-architect.com TC L1D L2
9
9 Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection between devices and the system — the memory hierarchy — the operating system A variety of different users (e.g., banks, supercomputers, engineers)
10
10 I/O Devices Very diverse devices — behavior (i.e., input vs. output) — partner (who is at the other end?) — data rate
11
11 I/O Example: Disk Drives To access data: — seek: position head over the proper track (3.5-10 ms. avg.) — rotational latency: wait for desired sector (.5 / RPM) — transfer: grab the data (one or more sectors) 2 to 15 MB/sec not considering disk buffer hits (100-320 MB/s)
12
12 Locality A principle that makes having a memory hierarchy a good idea Temporal locality If an item is referenced, Temporal locality: it will tend to be referenced again soon Spatial locality Spatial locality: nearby items will tend to be referenced soon. Why does code have locality? What about data locality? Our initial focus: two levels (upper, lower) block: minimum unit of data hit: data requested is in the upper level miss: data requested is not in the upper level
13
13 Two issues: How do we know if a data item is in the cache? If it is, how do we find it? Our first example: block size is one word of data "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level Cache
14
14 Mapping: address is modulo the number of blocks in the cache Direct Mapped Cache
15
15 For MIPS: What kind of locality are we taking advantage of? Direct Mapped Cache
16
16 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] (4 bytes)
17
17 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes)
18
18 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 #24 is 0001 1000 (4 bytes)
19
19 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 #24 is 0001 1000 (4 bytes) 4-byte block, drop low 2 bits for byte offset! Only matters for byte- addressable systems
20
20 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 #24 is 0001 1000 (4 bytes) Next log 2 (8) bits mod 8 = Index
21
21 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is 0001 1000 Next log 2 (8) bits mod 8 = Index
22
22 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is 0001 1000
23
23 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is 0001 1000 [a] copy
24
24 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is 0001 1000 [a] copy 000
25
25 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is 0001 1000 [a] copy 000
26
26 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000
27
27 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000
28
28 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 #28 is 0001 1100
29
29 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 #28 is 0001 1100 [b] copy000
30
30 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [b] copy000
31
31 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [b] copy000
32
32 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [b] copy000 #60 is 0011 1100 It’s valid! How to tell it’s the wrong address?
33
33 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [b] copy000 #60 is 0011 1100 The tags don’t match! It’s not what we want to access!
34
34 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 #60 is 0011 1100
35
35 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 #60 is 0011 1100 [c] copy001
36
36 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001
37
37 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What if the machine is only word-addressable?
38
38 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What if the machine is only word-addressable?
39
39 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What if the machine is only word-addressable? #60 is 0011 1100
40
40 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What if the machine is only word-addressable? Tag is 2 bits larger; otherwise same (note index+data change!) #60 is 0011 1100
41
41 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What about writing back to memory?
42
42 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? Do we update memory now? Or later?
43
43 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? Assume later...
44
44 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory?
45
45 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? #188 1011 1100 Now What? How do we know to write back?
46
46 Example: DM$, 8-Entry, 4B Valid Tag Dat Index 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? #188 1011 1100 Need extra state! The “dirty” bit! a Dirty 0 0 0 0 0 0 0 1
47
47 Example: DM$, 8-Entry, 4B Valid Tag Dat Index 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] NEW [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? #188 1011 1100 a Dirty 0 0 0 0 0 0 0 1
48
48 Example: DM$, 8-Entry, 4B Valid Tag Dat Index 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] NEW [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 Q: What about writing back to memory? #188 1011 1100 a Dirty 0 0 0 0 0 0 0 0
49
49 Example: DM$, 8-Entry, 4B Valid Tag Dat Index 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] NEW [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 Q: What about writing back to memory? #188 1011 1100 a Dirty 0 0 0 0 0 0 0 0 [d] copy101
50
50 Example: DM$, 8-Entry, 4B Valid Tag Dat Index 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] NEW [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 Q: What about writing back to memory? a Dirty 0 0 0 0 0 0 0 0 [d] copy101
51
51 Example: DM$, 8-Entry, 4B Valid Tag Dat Index 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] NEW [d] OLD #24 #28 #60 #188 (4 bytes) [a] copy 000 Q: What about writing back to memory? a Dirty 0 0 0 0 0 0 0 1 [d] NEW101
52
52 Trade-Offs: Write-back or Write-Through? Write-Alloc or No-Write-Alloc? How does Tag change with # of Entries? How does minimum machine word size impact tag? What kind of locality are we taking advantage of? DM$: Thoughts
53
53 Taking advantage of spatial locality: Direct Mapped Cache Address (showing bit positions) 1612Byte offset V Tag Data HitData 16 32 4K entries 16 bits128 bits Mux 323232 2 32 Block offsetIndex Tag 31 1615 43210
54
54 Read hits this is what we want! Read misses stall the CPU, fetch block from memory, deliver to cache, restart Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later) Write misses: read the entire block into the cache, then write the word… ? Hits vs. Misses
55
55 Make reading multiple words easier by using banks of memory It can get a lot more complicated... Hardware Issues
56
56 Increasing the block size tends to decrease miss rate : Use split caches because there is more spatial locality in code: Performance 256 40% 35% 30% 25% 20% 15% 10% 5% 0% M i s s r a t e 64164 Block size (bytes) 1 KB 8 KB 16 KB 64 KB 256 KB
57
57 Performance Simplified model: execution time = (execution cycles + stall cycles) cycle time stall cycles = # of instructions miss ratio miss penalty Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty What happens if we increase block size?
58
58 Compared to direct mapped, give a series of references that: results in a lower miss ratio using a 2-way set associative cache results in a higher miss ratio using a 2-way set associative cache (assuming “least recently used” replacement strategy) Decreasing miss ratio with associativity
59
59 An implementation
60
60 Set-Associative Cache Multiple cache blocks (lines) can be allocated into the same set When full, needs to evict some block out of the cache Need to consider the locality Replacement policy Last-In First-Out (LIFO), like a stack Random First-In First-Out (FIFO) Least Recently Used (LRU)
61
61 Least Recently Used (LRU) ABCD ABCD MRU LRULRU+1MRU-1 Access C CABD Access D DCAB Access E EDCA Access C CEDA Access G GCED
62
62 Performance
63
63 Decreasing miss penalty with multilevel caches Add a second level cache: often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory (DRAM) miss penalty goes down if data is in 2nd level cache Example: CPI of 1.0 on a 5GHz machine for no cache miss CPI The same machine with 1 st level cache (L1) with a 2% miss rate per instruction, and a 100ns DRAM access (what is the CPI ?) speedup Adding 2nd level cache with 5ns access time (including L1 access time) decreases miss rate per instruction to 0.5%, what is the speedup over the machine with only L1 Using multilevel caches: try and optimize the hit time on the 1st level cache try and optimize the miss rate on the 2nd level cache
64
64 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: illusion of having more physical memory program relocation protection
65
65 Pages: virtual memory blocks Page faults: the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) reducing page faults is important (LRU is worth the price) can handle the faults in software instead of hardware using write-through is too expensive so we use writeback 3 2 1 011 10 9 815 14 13 1231 30 29 28 27 Page offsetVirtual page number Virtual address 3 2 1 011 10 9 815 14 13 1229 28 27 Page offsetPhysical page number Physical address Translation
66
66 Page Tables Physical memory Disk storage Valid 1 1 1 1 0 1 1 0 1 1 0 1 Page table Virtual page number Physical page or disk address
67
67 Page Tables
68
68 Making Address Translation Fast A cache for address translations: translation lookaside buffer
69
69 TLBs and caches
70
70 Modern Systems Very complicated memory systems:
71
71 Processor speeds continue to increase very fast — much faster than either DRAM or disk access times Design challenge: dealing with this growing disparity Trends: synchronous SRAMs (provide a burst of data) redesign DRAM chips to provide higher bandwidth or processing restructure code to increase locality use prefetching (make cache visible to ISA) Some Issues
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.