COSC3330 Computer Architecture Lecture 20. Virtual Memory

COSC3330 Computer Architecture Lecture 20. Virtual Memory
Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Topics Virtual Memory

Reducing Cache Miss Penalty (#2)
Use multiple levels of caches Primary cache (Level-1 or L1) attached to CPU Small, but fast Level-2 (L2) cache services misses from primary cache Larger, slower, but still faster than main memory Level-3 (L3) cache services misses from L2 cache Largest, slowest, but still faster than main memory Main memory services L3 cache misses Advancement in semiconductor technology allows enough room on the die for L1, L2, and, L3 caches L2 and L3 caches are typically unified, meaning that it holds both instructions and data 3 3

Core i7 Example .. Core i7 CPU Core Reg File L2 Cache (256KB)
L1 I$ (32KB) L1 D$ (32KB) L3 Cache (8MB) - Shared .. 4 4

Multilevel Cache Example
Given CPU base CPI = 1 Clock frequency = 4GHz Miss rate/instruction = 2% Main memory access time = 100ns With just L1 cache L1 $ access time = 0.25 ns (1 cycle) L1 miss penalty 100ns/0.25ns = 400 cycles CPI × 400 = 9 CPI (= 98% x 1 CPI + 2% x ( )) Now add L2 cache L2 $ access time = 5ns Global miss rate to main memory = 0.5% L1 miss with L2 hit L1 miss penalty = 5ns/0.25ns = 20 cycles L1 miss with L2 miss Extra penalty = 400 cycles CPI × × 400 = 3.4 CPI (= x (75% x % x (20+400))) Performance ratio = 9/3.4 = 2.6 Addition of L2 cache in this example achieves 2.6x speedup 5 5

Multilevel Cache Design Considerations
Design considerations for L1 and L2 caches are very different L1 should focus on minimizing hit time in support of a shorter clock cycle Smaller in size and may use smaller block sizes L2 (L3) should focus on reducing miss rate to reduce the penalty of long main memory access times Larger in size and may use larger block sizes Higher levels of associativity 6 6

Multilevel Cache Design Considerations
The miss penalty of the L1 cache is significantly reduced by the presence of L2 and L3 caches So, L1 can be smaller (i.e., faster) but have a higher miss rate For the L2 (L3) cache, the hit time is less important than the miss rate The L2 (L3) hit time determines L1’s miss penalty

Views of Memory Real machines have limited amounts of memory
640KB? A few GB? (This laptop = 4GB) Programmer doesn’t want to be bothered Do you think, “oh, this computer only has 128MB so I’ll write my code this way…” What happens if you run on a different machine?

Programmer’s View Example 32-bit memory
When programming, you don’t care about how much real memory there is Even if you use a lot, memory can always be paged to disk Kernel 0-2GB Text Data Heap Stack AKA Virtual Addresses 4GB

Programmer’s View Really “Program’s View”
Each program/process gets its own 4GB space Kernel Text Data Heap Stack Kernel Text Data Heap Stack Kernel Text Data Heap Stack

CPU’s View At some point, the CPU is going to have to load-from/store-to memory… all it knows is the real, A.K.A. physical memory … which unfortunately is often < 4GB … and is pretty much never 4GB per process

How could location independence be achieved?
Absolute Addresses EDSAC, early 50’s Only one program ran at a time, with unrestricted access to entire machine (RAM + I/O devices) Addresses in a program depended upon where the program was to be loaded in memory But it was more convenient for programmers to write location-independent subroutines How could location independence be achieved? Linker and/or loader modify addresses of subroutines and callers when building a program memory image

Dynamic Address Translation
Motivation In the early machines, I/O operations were slow and each word transferred involved the CPU Higher throughput if CPU and I/O of 2 or more programs were overlapped. How? multiprogramming Location-independent programs Programming and storage management ease => need for a base register Protection Independent programs should not affect each other inadvertently => need for a bound register prog1 Physical Memory prog2

Simple Base and Bound Translation
Segment Length Bound Register Bounds Violation?  Physical Address current segment Main Memory Effective Address Load X + Base Register Base Physical Address Program Address Space Base and bounds registers are visible/accessible only when processor is running in the supervisor mode

Memory Fragmentation free Users 4 & 5 arrive Users 2 & 5 leave OS Space 16K 24K 32K user 1 user 2 user 3 OS Space 16K 24K 32K user 1 user 2 user 3 user 5 user 4 8K OS Space 16K 24K 32K user 1 user 4 8K user 3 As users come and go, the storage is “fragmented”. Therefore, at some stage programs have to be moved around to compact the storage.

Pages Memory is divided into pages, which are nothing more than fixed sized and aligned regions of memory Typical size: 4KB/page (but not always) 0-4095 Page 0 Page 1 Page 2 Page 3 …

“Physical Location” may
Page Table Map from virtual addresses to physical locations Physical Addresses 0K Page Table implements this VP mapping 4K 0K 8K 4K 12K 8K 16K 12K “Physical Location” may include hard-disk 20K 24K 28K Virtual Addresses

Page Tables Physical Memory 0K 4K 0K 8K 4K 12K 8K 16K 12K 20K 24K 28K

Page Fault Handler When the referenced page is not in DRAM:
The missing page is located (or created) It is brought in from disk, and page table is updated Another job may be run on the CPU while the first job waits for the requested page to be read from disk If no free pages are left, a page is swapped out Pseudo-LRU replacement policy Since it takes a long time to transfer a page (msecs), page faults are handled completely in software by the OS

Where Should Page Tables Reside?
Space required by the page tables (PT) is proportional to the address space, number of users, ...  Space requirement is large Idea: Keep PTs in the main memory needs one reference to retrieve the page base address and another to access the data word  doubles the number of memory references!

Need for Translation 0xFC51908B Virtual Address Virtual Page Number
Page Offset Main Memory Physical Address Page Table 0xFC519 0x00152 0x B

Simple Page Table Flat organization Total size? One entry per page
Entry contains physical page number (PPN) or indicates page is on disk or invalid Also meta-data (e.g., permissions, dirtiness, etc.) Total size? One entry per page

Multi-Level Page Tables
Virtual Page Number Level 1 Level 2 Page Offset Physical Page Number

Multi-Level Page Tables
Virtual Address 31 22 21 12 11 p p offset 10-bit L1 index 10-bit L2 index offset Root of the Current Page Table p2 p1 (Processor Register) Level 1 Page Table Level 2 Page Tables page in primary memory page in secondary memory PTE of a nonexistent page Data Pages

Choosing a Page Size Page size inversely proportional to page table overhead Large page size permits more efficient transfer to/from disk vs. many small transfers Think NFS… Small page leads to less fragmentation Big page likely to have more bytes unused

CPU Memory Access Program deals with virtual addresses
“Load R1 = 0[R2]” On memory instruction Compute virtual address (0[R2]) Compute virtual page number Compute physical address of VPN’s page table entry Load* mapping Compute physical address Do the actual Load* from memory Could be more depending On page table organization

Impact on Performance? Every time you load/store, the CPU must perform two (or more) accesses! Even worse, every fetch requires translation of the PC! Observation: Once a virtual page is mapped into a physical page, it’ll likely stay put for quite some time

Idea: Caching! Not caching of data, but caching of translations 5
0K 4K 8K 12K Virtual Addresses 16K 20K 24K 28K Physical 5 VPN 2 1 1 3 X PPN 4 2 4

Translation Cache: TLB
TLB = Translation Look-aside Buffer Physical Address TLB Cache Data Virtual Address Cache Tags Hit? If TLB hit, no need to do page table lookup from memory Note: data cache accessed by physical addresses now

PAPT Cache Previous slide showed Physically-Addressed Physically-Tagged cache Sometimes called PIPT (I=Indexed) Con: TLB lookup and cache access serialized Caches already take > 1 cycle Pro: cache contents valid so long as page table not modified

Virtually Addressed Cache
Data (VIVT: vitually indexed, virtually tagged) Virtual Address TLB On Cache Miss Physical Address To L2 Cache Tags Hit? Pro: latency – no need to check TLB Con: Cache must be flushed on process change

Virtually Indexed Physically Tagged
Cache Data Virtual Address Cache Tags Physical Tag = Hit? TLB Physical Address Big page size can help here Pro: latency – TLB parallelized Pro: don’t need to flush $ on process swap Con: Limit on cache indexing (can only use bits not from the VPN/PPN)

TLB Design Often fully-associative If many misses:
For latency, this means few entries However, each entry is for a whole page Ex. 32-entry TLB, 4KB page… how big of working set while avoiding TLB misses? If many misses: Increase TLB size (latency problems) Increase page size (fragmenation problems)

Process Changes With physically-tagged caches, don’t need to flush cache on context switch But TLB is no longer valid! Add process ID to translation Only flush TLB when Recycling PIDs PID:0 VPN:8 4 20 1 32 1 12 36 PPN: 28 8 28 16 PID:1 VPN:8 12 8 PPN: 44 1 8 44 1 4 52

Hit or Miss or Page Fault
Example TLB Page Table Valid VPN PPN Valid PPN or in Disk 1 12 1 4 9 1 5 1 1 3 6 12 1 disk 2 1 7 4 disk 3 5 1 5 11 1 6 4 1 9 5 1 11 6 disk 7 1 4 VPN Hit or Miss or Page Fault Miss 3 Hit 3 Hit 1 Page Fault 1 Hit 2 Page Fault

COSC3330 Computer Architecture Lecture 20. Virtual Memory

Similar presentations

Presentation on theme: "COSC3330 Computer Architecture Lecture 20. Virtual Memory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COSC3330 Computer Architecture Lecture 20. Virtual Memory

Similar presentations

Presentation on theme: "COSC3330 Computer Architecture Lecture 20. Virtual Memory"— Presentation transcript:

Similar presentations

About project

Feedback