Download presentation
Presentation is loading. Please wait.
1
Lecture 11 Cache and Virtual Memory
Peng Liu
2
Associative Cache Example
3
Tag & Index with Set-Associative Caches
Assume a 2n-byte cache with 2m-byte blocks that is 2a set-associative Which bits of the address are the tag or the index? m least significant bits are byte select within the block Basic idea The cache contains 2n/2m=2n-m blocks Each cache way contains 2n-m/2a=2n-m-a blocks Cache index: (n-m-a) bits after the byte select Same index used with all cache ways … Observation For fixed size, length of tags increases with the associativity Associative caches incur more overhead for tags
4
Placement Policy Memory Cache Fully (2-way) Set Direct
3 3 0 1 Block Number Memory Set Number Cache Simplest scheme is to extract bits from ‘block number’ to determine ‘set’. More sophisticated schemes will hash the block number ---- why could that be good/bad? Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into set block 4 (12 mod 4) (12 mod 8) block 12 can be placed
5
Direct-Mapped Cache Tag Index t k b V Tag Data Block 2k lines t = HIT
Offset t k b V Tag Data Block 2k lines t = HIT Data Word or Byte
6
2-Way Set-Associative Cache
Tag Index Block Offset b t k V Tag Data Block V Tag Data Block t Compare latency to direct mapped case? Data Word or Byte = = HIT
7
Fully Associative Cache
Tag Data Block t = Tag t = HIT Block Offset Data Word or Byte = b
8
Replacement Methods Which line do you replace on a miss? Direct Mapped
Easy, you have only one choice Replace the line at the index you need N-way Set Associative Need to choose which way to replace Random (choose one at random) Least Recently Used (LRU) (the one used least recently) Often difficult to calculate, so people use approximations. Often they are really not recently used
9
Replacement only happens on misses
Replacement Policy In an associative cache, which block from a set should be evicted when the set becomes full? Random Least Recently Used (LRU) LRU cache state must be updated on every access true implementation only feasible for small sets (2-way) pseudo-LRU binary tree often used for 4-8 way First In, First Out (FIFO) a.k.a. Round-Robin used in highly associative caches Not Least Recently Used (NLRU) FIFO with exception for most recently used block or blocks This is a second-order effect. Why? NLRU used in Alpha TLBs. Replacement only happens on misses
10
Block Size and Spatial Locality
Block is unit of transfer between the cache and memory 4 word block, b=2 Tag Word0 Word1 Word2 Word3 Split CPU address block address offsetb 32-b bits b bits 2b = block size a.k.a line size (in bytes) Larger block size has distinct hardware advantages less tag overhead exploit fast burst transfers from DRAM exploit fast burst transfers over wide busses What are the disadvantages of increasing block size? Larger block size will reduce compulsory misses (first miss to a block). Larger blocks may increase conflict misses since the number of blocks is smaller. Fewer blocks => more conflicts. Can waste bandwidth.
11
CPU-Cache Interaction (5-stage pipeline)
0x4 E Add M Decode, Register Fetch A ALU we Y addr nop IR B Primary Data Cache rdata R addr PC inst D hit? hit? wdata wdata PCen Primary Instruction Cache MD1 MD2 Stall entire CPU on data cache miss To Memory Control Cache Refill Data from Lower Levels of Memory Hierarchy
12
Improving Cache Performance
Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the hit time reduce the miss rate reduce the miss penalty What is the simplest design strategy? Design the largest primary cache without slowing down the clock Or adding pipeline stages. Biggest cache that doesn’t increase hit time past 1-2 cycles (approx 8-32KB in modern technology) [ design issues more complex with out-of-order superscalar processors ]
13
Serial-versus-Parallel Cache and Memory Access
a is HIT RATIO: Fraction of references in cache 1 - a is MISS RATIO: Remaining references CACHE Processor Main Memory Addr Data Average access time for serial search: tcache + (1 - a) tmem CACHE Processor Main Memory Addr Data Average access time for parallel search: a tcache + (1 - a) tmem Savings are usually small, tmem >> tcache, hit ratio a high High bandwidth required for memory path Complexity of handling parallel paths can slow tcache
14
Causes for Cache Misses
Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with infinite cache Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect replacement policy Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity
15
Effect of Cache Parameters on Performance
Larger cache size reduces capacity and conflict misses hit time will increase Higher associativity reduces conflict misses may increase hit time Larger block size reduces compulsory and capacity (reload) misses increases conflict misses and miss penalty Requested block first…. The following could be in the slide… spatial locality reduces compulsory misses and capacity reload misses fewer blocks may increase conflict miss rate larger blocks may increase miss penalty
16
Multilevel Caches DRAM CPU L2$ L1$
A memory cannot be large and fast Increasing sizes of cache at each level CPU L1$ L2$ DRAM Local miss rate = misses in cache / accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions MPI makes it easier to compute overall performance.
17
Multilevel Caches Primary (L1) caches attached to CPU
Small, but fast Focusing on hit time rather than hit rate Level-2 cache services misses from primary cache Larger, slower, but still faster than main memory Unified instruction and data Focusing on hit rate rather than hit time Main memory services L2 cache misses Some high-end systems include L3 cache
18
A Typical Memory Hierarchy
Split instruction & data primary caches (on-chip SRAM) Multiple interleaved memory banks (off-chip DRAM) L1 Instruction Cache Unified L2 Cache Memory CPU Memory Memory L1 Data Cache RF Memory Implementation close to the CPU looks like a Harvard machine. Multiported register file (part of CPU) Large unified secondary cache (on-chip SRAM)
19
Itanium-2 On-Chip Caches (Intel/HP, 2002)
Level 1: 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency Level 2: 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency Level 3: 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency If two is good, then three must be better
20
What About Writes? Where do we put the data we want to write?
In the cache? In main memory? In both? Caches have different policies for this question Most systems store the data in the cache (why?) Some also store the data in memory as well (why?) Interesting observation Processor does not need to “wait” until the store completes
21
Cache Write Policies: Major Options
Write-through (write data go to cache and memory) Main memory is updated on each cache write Replacing a cache entry is simple (just overwrite new block) Memory write causes significant delay if pipeline must stall Write-back (write data only goes to the cache) Only the cache entry is updated on each cache write so main memory and cache data are inconsistent Add “dirty” bit to the cache entry to indicate whether the data in the cache entry must be committed to memory Replacing a cache entry requires writing the data back to memory before replacing the entry if it is “dirty”
22
Write Policy Trade-offs
Write-through Misses are simpler and cheaper (no write-back to memory) Easier to implement Requires buffering to be practical Uses a lot of bandwidth to the next level of memory Write-back Writes are fast on a hit Multiple writes within a block require only one “writeback” later Efficient block transfer on write back to memory at evicaiton
23
Write Policy Choices Cache hit:
write through: write both cache & memory generally higher traffic but simplifies cache coherence write back: write cache only (memory is written only when the entry is evicted) a dirty bit per block can further reduce the traffic Cache miss: no write allocate: only write to main memory write allocate (aka fetch on write): fetch into cache Common combinations: write through and no write allocate write back with write allocate
24
Write Buffer to Reduce Read Miss Penalty
Unified L2 Cache Data Cache CPU Write buffer RF Evicted dirty lines for writeback cache OR All writes in writethru cache Processor is not stalled on writes, and read misses can go ahead of write to main memory Problem: Write buffer may hold updated value of location needed by a read miss Simple scheme: on a read miss, wait for the write buffer to go empty Faster scheme: Check write buffer addresses against read miss addresses, if no match, allow read miss to go ahead of writes, else, return value in write buffer Deisgners of the MIPS M/1000 estimated that waiting for a four-word buffer to empty increased the read miss penalty by a factor of 1.5.
25
Write Buffers for Write-Through Caches
Processor Cache Write Buffer Lower Level Memory Holds data awaiting write-through to lower level memory Q. Why a write buffer ? A. So CPU doesn’t stall Q. Why a buffer, why not just one register ? A. Bursts of writes are common. Q. Are Read After Write (RAW) hazards an issue for write buffer? A. Yes! Drain buffer before next read, or check write buffers.
26
Avoiding the Stalls for Write-Through
Use write buffer between cache and memory Processor writes data into the cache and the write buffer Memory controller slowly “drains” buffer to memory Write buffer: a first-in-first-out buffer (FIFO) Typically holds a small number of writes Can absorb small bursts as long as the long term rate of writing to the buffer does not exceed the maximum rate of writing to DRAM
27
Cache Write Policy: Allocation Options
What happens on a cache write that misses? It’s actually two subquestions Do you allocate space in the cache for the address? Write-allocate VS no-write allocate Actions: select a cache entry, evict old contents, update tags, Do you fetch the rest of the block contents from memory? Of interest if you do write allocate Remember a store updates up to 1 word from a wider block Fetch-on-miss VS no-fetch-on-miss For no-fecth-on-miss must remember which words are valid Use fine-grain valid bits in each cache line
28
Typical Choices Write-back caches Write-through caches
Write-allocate, fetch-on-miss Write-through caches Write-allocate, no-fetch-on-miss No-write-allocate, write-around Modern HW support multiple polices Select by OS on at some coarse granularity Which program patters match each policy?
29
Write Miss Actions Write through Write back Write allocate
No write allocate Steps Fetch on miss No fetch on miss Write around Write invalidate Hit 1 Pick replacement 2 Invalidate tag 3 Fetch block 4 Write cache Write partial cache 5 Write memory
30
Be Careful, Even with Write Hits
Reading form a cache Read tags and data in parallel If it hits, return the data, else go to lower level Writing a cache can take more time First read tag to determine hit/miss Then overwrite data on a hit Otherwise, you may overwrite dirty data or write the wrong cache way Can you ever access tag and write data in parallel?
31
Splitting Caches Most processors have separate caches for instructions & data Often noted $I and $D Advantages Extra access port Can customize to specific access patterns Low hit time Disadvantages Capacity utilization Miss rate
32
Write Policy: Write-Through vs Write-Back
Data written to cache block also written to lower-level memory Write data only to the cache Update lower level when a block falls out of the cache Debug Easy Hard Do read misses produce writes? No Yes Do repeated writes make it to lower level? Additional option -- let writes to an un-cached address allocate a new cache line (”write-allocate”).
33
Cache Design: Datapath + Control
Most design errors come from incorrect specification of state machine behavior! Common bugs: Stalls, Block replacement, Write buffer To Lower Level Memory To CPU Control State Machine Control Control To Lower Level Memory Addr Addr To CPU Blocks Tags Din Din Dout Dout
34
Cache Controller Example cache characteristics
Direct-mapped, write-back, write allocate Block size: 4 words (16 bytes) Cache size: 16KB (1024 blocks) 32-bit byte addresses Valid bit and dirty bit per block Blocking cache CPU waits until assess is complete Address
35
Signals between the Processor and the Cache
36
Finite State Machines Use and FSM to sequence control steps
Set of states, transition on each clock edge State values are binary encoded Current state stored in a register Next state = fn (current state, current inputs) Control output signals = fo (current state)
37
Cache Controller FSM Idle state
Waiting for a valid read or write request from the processor Compare Tag state Testing if hit or miss If hit, set Cache Ready after read or write -> Idle state If miss, updates the cache tag If dirty ->Write-Back state, else -> Allocate state Write-Back state Writing the 128-bit block to memory Waiting for ready signal from memory ->Allocate state Allocate state Fetching new blocks is from memory
38
Main Memory Supporting Caches
Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Example cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer For 4-word block, 1-word-wide DRAM Miss penalty = 1 + 4x15 + 4x1 = 65 bus cycles Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle
39
Increasing Memory Bandwidth
40
DRAM Technology
41
Why DRAM over SRAM? Density!
bit word 1 SRAM Cell: Large 6 transistors nFET and pFET 3 interface wires Vdd and Gnd DRAM Cell: Small transistor + capacitor nFET only 2 interface wires no Vdd Density advantage: 3X to 10X, depends on metric
42
DRAM: Reading, Writing, Refresh
Writing DRAM: Drive data on bit line Select row 1 1 1 Capacitor holds state for 60 ms -- then must do “refresh” 1 1 Reading DRAM Select row Sense bit line (~1 million electrons) Write value back 1 1 Refresh: a dummy read 1
43
Synchronous DRAM (SDRAM) Interface
A clocked bus protocol (ex: 100 MHz) Note! This example is best-case! For a random access, DRAM takes many more than 2 cycles! Cache controller puts commands on bus Data comes out several cycles later. (CAS = Column Address Strobe)
44
Advanced DRAM Organization
Bits in a DRAM are organized as a rectangular array DRAM accesses an entire row Burst mode: supply successive words from a row with reduced latency Double data rate (DDR) DRAM Transfer on rising and falling clock edges Quad data rate (QDR) DRAM Separate DDR inputs and outputs DIMMs: small boards with multiple DRAM chips connected in parallel Functions as a higher capacity, wider interface DRAM chip Easier to manipulate, replace, ..
45
Measuring Performance
46
Measuring Performance
Memory system is important for performance Cache access time often determines the overall system clock cycle time since it is often the slowest pipeline stage Memory stalls is a large contributor to CPI Stall due to instructions & data, reading & writing Stalls include both cache miss stalls and write buffer stalls Memory system & performance CPU Time = (CPU Cycles + Memory Stall Cycles) * Cycle Time MemStallCycles = Read Stall Cycles + Write Stalls Cycles CPI = CPIpipe + AcgMemStallCycles CPIpipe = 1 + HazardStallsCycles
47
Memory Performance Read stalls are fairly easy to understand
Read Cycles = Read/prog * ReadMissRate * ReadMissPenalty Write stalls depend upon the write policy Write-through Write Stall = (Writes/Prog * WriteMissRate *WriteMissPenalty)+ Write Buffer Stalls Write-back Write Stall = (Writes/Prog * WriteMissRate * WriteMissPenalty) “Write miss penalty” can be complex: Can be partially hidden if processor can continue executing Can include extra time to write-back a value we are evicting
48
Worst-Case Simplicity
Assume that write and read misses cause the same delay In a single-level cache system MissPenalty = latency of DRAM In a multi-level cache system MissPenalty is the latency of L2 cache etc Calculate by considering MissRateL2, MissPenaltyL2 etc Watch out: global vs local miss rate for L2
49
Simple Cache Performance Example
Consider the following Miss rate for instruction access is 5% Miss rate for data access is 8% Data references per instruction are 0.4 CPI with perfect cache is 2 Read and write miss penalty is 20 cycles Including possible write buffer stalls What is the performance of this machine relative to one without misses? Always start by considering execution times (IC*CPI*CCT) But IC and CCT are the same here, so focus on CPI
50
Performance Solution Find the CPI for the base system without misses
CPI no misses = CPIperfect = 2 Find the CPI for system with misses Misses/inst = I Cache Misses + D Cache Misses = 0,05 + (0.08*0.4) = 0.082 Memory Stall Cycles = Misses/Inst * MissPenalty = 0.082*20 = 1.64 cycles/inst CPI with misses = CPIperfect + Memory Stall Cycles = = 3.64 Compare the performance
51
Another Cache Problem Given the following data
Base CPI of 1.5 1 instruction reference per instruction fetch 0.27 loads/instruction 0.13 stores/instruction A 64KB S.A, cache with 4-word block size has a miss rate of 1.7% Memory access time = 4 cycles + #words/block Suppose the cache uses a write through, write-around write strategy without a write buffer. How much faster would the machine be with a perfect write buffer? CPUtime = Instruction Count*(CPIbase + CPImemory) * ClockCycleTime Performance is proportional to CPI = CPImemory
52
No Write Buffer CPI memory = reads/inst.*miss rate * read miss penalty
Cache Lower Level Memory CPU CPI memory = reads/inst.*miss rate * read miss penalty + writes/inst.* write penalty read miss penalty = 4 cycles + 4 words * 1cycle/word = 8 cycles write penalty = 4 cycles + 1word * 1cycle/word = 5 cycles CPI memory = (1 if ld)(1/inst.)*(0.017)*8 cycles + (0.13st)(1/inst.)*5cycles CPI memory = 0.17 cycles/inst cycles/inst. = 0.82 cycles/inst. CPI overall = 1.5 cycles/inst cycles/inst. = 2.32 cycles/inst.
53
Perfect Write Buffer Cache Lower Level Memory CPU Wbuff CPI memory = reads/inst.*miss rate * 8 cycle read miss penalty + writes/inst.* (1- miss rate) * 1 cycle hit penalty A hit penalty is required because on hits we must Access the cache tags during the MEM cycle to determine a hit Stall the processor for a cycle to update a hit cache block CPI memory = 0.17 cycles/inst. + (0.13st)(1/inst.)*( )*1cycle CPI memory = 0.17 cycles/inst cycles/inst. = 0.30 cycles/inst. CPI overall = 1.5 cycles/inst cycles/inst. = 1.80 cycles/inst.
54
Perfect Write Buffer + Cache Write Buffer
WBuff Lower Level Memory Cache CPU CWB CPI memory = reads/inst.*miss rate * 8 cycle read miss penalty Avoid a hit penalty on write by: Add a one-entry write buffer to the cache itself Write the last store hit to the data array during next stors’s MEM Hazard: On loads, must check CWB along with cache! CPI memory = 0.17 cycles/inst. . CPI overall = 1.5 cycles/inst cycles/inst. = 1.67 cycles/inst.
55
Virtual Memory
56
Motivation #1: Large Address Space for Each Executing Program
Each program thinks it has a ~232 byte address space of its own May not use it all though Available main memory may be much smaller
57
Motivation #2: Memory Management for Multiple Programs
At an point in time, a computer may be running multiple programs E.g., Firefox + Thunderbird Questions: How do we share memory between multiple programs? How do we avoid address conflicts? How do we protect programs Isolations and selective sharing
58
Virtual Memory in a Nutshell
Use hard disk (or Flash) as a large storage for data of all programs Main memory (DRAM) is a cache for the disk Managed jointly by hardware and the operating system (OS) Each running program has its own virtual address space Address space as shown in previous figure Protected from other programs Frequently-used portions of virtual address space copied to DRAM DRAM = physical address space Hardware + OS translate virtual addresses (VA) used by program to physical addresses (PA) used by the hardware Translation enables relocation (DRAM disk) & protection
59
Reminder: Memory Hierarchy Everything is a Cache for Something Else
Access time Capacity Managed by 1 cycle ~500B Software/compiler 1-3 cycles ~64KB hardware 5-10 cycles 1-10MB ~100 cycles ~10GB Software/OS cycles ~100GB
60
DRAM vs. SRAM as a “Cache”
DRAM vs. disk is more extreme than SRAM vs. DRAM Access latencies: DRAM ~10X slower than SRAM Disk ~100000X slower than DRAM Importance of exploiting spatial locality First byte is ~100,000X slower than successive bytes on disk vs, ~4X improvement for page-mode vs. regular accesses to DRAM
61
Impact of These Properties on Design
Bottom line: Design decision made for virtual memory driven by enormous cost of misses (disk accesses) Consider the following parameters for DRAM as a “cache” for the disk Line size? Large, since disk better at transferring large blocks and minminzes miss rate Associativity? High, to miminze miss rate Write through or write back? Write back, since can’t afford to perform small writes to disk
62
Terminology for Virtual Memory
Virtual memory used DRAM as a cache for disk New terms VM block is called a “page” The unit of data moving between disk and DRAM It is typically larger than a cache block (e.g., 4KB or 16KB) Virtual and physical address spaces can be divided up to virtual pages and physical pages (e.g., contiguous chunks of 4KB) VM miss is called a “page fault” More on this later
63
Locating an Object in a “Cache”
SRAM Cache (L1,L2, etc) Tag stored with cache line Maps from cache block to a memory address No tag for blocks not in cache If not in cache, then it is in main memory Hardware retrieves and manages tag information Can quickly match against multiple tags
64
Locating an Object in a “Cache” (cont.)
SRAM Cache (virtual memory) Each allocated page of virtual memory has entry in page table Mapping from virtual pages to physical pages One entry per page in the virtual address space Page table entry even if page not in memory Specifies disk address OS retrieve and manages page table information
65
A System with Physical Memory Only
Examples: Most Cray machines, early PCs, nearly all embedded systems, etc Addresses generated by the CPU point directly to bytes in physical memory
66
A System with Virtual Memory
Examples: Workstations, serves, modern PCs, etc. Address Translation: Hardware converts virtual addresses to physical addresses via an OS-managed lookup table (page table)
67
Page Faults (Similar to “Cache Misses”)
What if an object is on disk rather than in memory? Page table entry indicates virtual address not in memory OS exception handler invoked to move data from disk into memory OS has full control over placement Full-associativity to minimize future misses Before fault After fault
68
Does VM Satisfy Original Motivations?
Multiple active programs can share physical address space Address conflicts are resolved All programs think their code is at 0x400000 Data from different programs can be protected Programs can share data or code when desired
69
Answer: Yes, Using Separate Addresses Spaces Per Program
Each program has its own virtual address space and own page table Addresses 0x from different programs can map to different locations or same location as desired OS control how virtual pages as assigned to physical memory
70
Translation: High-level View
Fixed-size pages Physical page sometimes called as frame
71
Translation: Process
72
Translation Process Explained
Valid page Check access rights (R,W,X) against access type Generate physical address if allowed Generate a protection fault (exception) if illegal access Invalid page Page is not currently mapped and a page fault is generated Faults are handled by the operating system Sometimes due to a program error => program terminated E.g. accessing out of the bounds of array Sometimes due to “caching” => refill & restart Desired data or code available on disk Space allocated in DRAM, page copied from disk, page table updated Replacement may be needed
73
VM: Replacement and Writes
To reduce page fault rate, OS uses least-recently used (LRU) replacement Reference bit (aka use bit) in PTE set to 1 on access to page Periodically cleared to 0 by OS A page with reference bit = 0 has not been used recently Disk writes take millions of cycles Block at once, not individual locations Write through is impractical Use write-back Dirty bit in PTE set when page is written
74
VM: Issues with Unaligned Accesses
Memory access might be aligned or unaligned What happens if unaligned address access straddles a page boundary? What if one page is present and the other is not? Or, what if neither is present? MIPS architecture disallows unaligned memory access Interesting legacy problem on 80x86 which does support unaligned access
75
Fast Translation Using a TLB
Address translation would appear to require extra memory references One to access the PTE Then the actual memory access But access to page tables has good locality So use a fast hardware cache of PTEs within the processor Called a Translation Look-aside Buffer (TLB) Typical: PTEs, cycle for hit cycles for miss, 0.01%-1% miss rate Misses could be handled by hardware or software
76
Fast Translation Using a TLB
77
TLB Entries The TLB is a cache for page table entries (PTE)
The data for a TLB entry ( == a PTE entry) Physical page number (frame #) Access rights (R/W bits) Any other PTE information (dirty bit, LRU info, etc) The tags for a TLB entry Physical page number Portion of it not used for indexing into the TLB Valid bit LRU bits If TLB is associative and LRU replacement is used
78
TLB Case Study: MIPS R2000/R3000
Consider the MIPS R2000/R3000 processors Addresses are 32 bits with 4KB pages (12 bit offset) TLB has 64 entries, fully associative Each entry is 64 bits wide
79
TLB Misses If page is in memory If page is not in memory (page fault)
Load the PTE from memory and retry Could be handled in hardware Can get complex for more complicated page table structures Or in software Raise a special exception, with optimized handler This is what MIPS does using a special vectored interrupt If page is not in memory (page fault) OS handles fetching the page and updating the page table Then restart the faulting instruction
80
TLB & Memory Hierarchies
Once address is translated, it used to access memory hierarchy A hierarchy of caches (L1, L2, etc)
81
TLB and Cache Interaction
Basic process Use TLB to get VA Use VA to access caches and DRAM Question: can you ever access the TLB and the cache in parallel?
82
TLB Caveats What happens to the TLB when switching between programs
The OS must flush the entries in the TLB Large number of TLB misses after every switch Alternatively, use PIDs (process ID) in each TLB entry Allows entries from multiple programs to co-exist Gradual replacement Limited reach 64 entry TLB with 8KB pages maps 0.5MB Smaller than many L2 caches in most systems TLB miss rate > L2 miss rate! Potential solutions Multilevel TLBs ( just like multi-level caches) Larger pages
83
Page Size Tradeoff Larger Pages Advantages Disadvantages Smaller Pages
Smaller page tables Fewer page faults and more efficient transfer with larger applications Improved TLB coverage Disadvantages Higher internal fragmentation Smaller Pages Improved time to start up small processes with fewer pages Internal fragmentation is low (important for small programs) High overhead in large page tables General trend toward larger pages 1978:512B, 1984:4KB, 1990:16KB, 2000:64KB
84
Multiple Page Sizes Many machines support multiple page sizes
SPARC: 8KB, 64KB, 1MB, 4MB MIPS R4000: 4KB -16MB Page size dependent upon application OS kernel used large pages User applications use smaller pages Issues Software complexity TLB complexity
85
Virtual Memory Summary
Use hard disk ( or Flash) as large storage for data of all programs Main memory (DRAM) is a cache for the disk Managed jointly by hardware and the operating system (OS) Each running program has its own virtual address space Address space as shown in previous figure Protected from other programs Frequently-used portions of virtual address space copied to DRAM DRAM = physical address space Hardware + OS translate virtual addresses (VA) used by program to physical addresses (PA) used by the hardware Translation enables relocation & protection
86
Acknowledgements These slides contain material from courses: UCB CS152
Stanford EE108B
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.