The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science
Computer Architecture A Quantitative Approach, Fifth Edition The University of Adelaide, School of Computer Science 22 February 2019 Chapter 2 Memory Hierarchy Design Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory Solution: organize memory system into a hierarchy Entire addressable memory space available in largest, slowest memory Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Temporal and spatial locality insures that nearly all references can be found in smaller memories Gives the illusion of a large, fast memory being presented to the processor Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Memory Performance Gap
The University of Adelaide, School of Computer Science 22 February 2019 Memory Performance Gap Introduction Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Memory Hierarchy Design
The University of Adelaide, School of Computer Science 22 February 2019 Memory Hierarchy Design Introduction Memory hierarchy design becomes more crucial with recent multi-core processors: Aggregate peak bandwidth grows with # cores: Intel Core i7 can generate two references per core per clock Four cores and 3.2 GHz clock 25.6 billion 64-bit data references/second + 12.8 billion 128-bit instruction references = GB/s! DRAM bandwidth is only 6% of this (25 GB/s) Requires: Multi-port, pipelined caches Two levels of cache per core Shared third-level cache on chip Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Performance and Power Introduction High-end microprocessors have >10 MB on-chip cache Consumes large amount of area and power budget Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Memory Hierarchy Basics
The University of Adelaide, School of Computer Science 22 February 2019 Memory Hierarchy Basics Introduction When a word is not found in the cache, a miss occurs: Fetch word from lower level in hierarchy, requiring a higher latency reference Lower level may be another cache or the main memory Also fetch the other words contained within the block Takes advantage of spatial locality Place block into cache in any location within its set, determined by address block address MOD number of sets Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty (500 instructions on 21264!) A HIT is when the data the processor wants to access is found in the upper level (Blk X). The fraction of the memory access that are HIT is defined as HIT rate. HIT Time is the time to access the Upper Level where the data is found (X). It consists of: (a) Time to access this level. (b) AND the time to determine if this is a Hit or Miss. If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level. By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level. (b) And then the time it takes to deliver this new block to the processor. It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy. +2 = 14 min. (X:54) Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y CS252 S05

The University of Adelaide, School of Computer Science 22 February 2019 Memory Hierarchy Basics n blocks per set => n-way set associative Direct-mapped cache => one block per set (one-way) Fully associative => one set Place block into cache in any location within its set, determined by address block address MOD number of sets Writing to cache: two strategies Write-through Immediately update lower levels of hierarchy Write-back Only update lower levels of hierarchy when an updated block is replaced Both strategies use write buffer to make writes asynchronous Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Q4: What happens on a write?
Write-Through Write-Back Policy Data written to cache block also written to lower-level memory Write data only to the cache Update lower level when a block falls out of the cache Debug Easy Hard Do read misses produce writes? No Yes Do repeated writes make it to lower level? Additional option (on miss)-- let writes to an un-cached address; allocate a new cache line (“write-allocate”). CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Write Buffers for Write-Through Caches
Processor Cache Write Buffer Lower Level Memory Holds data awaiting write-through to lower level memory Q. Why a write buffer ? A. So CPU doesn’t stall Q. Why a buffer, why not just one register ? A. Bursts of writes are common. Q. Are Read After Write (RAW) hazards an issue for write buffer? A. Yes! Drain buffer before next read, or send read 1st after check write buffers. CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Hit rate: fraction found in that level So high that usually talk about Miss rate Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) Miss penalty: time to replace a block from lower level, including time to replace in CPU access time: time to lower level = f(latency to lower level) transfer time: time to transfer block =f(BW between upper & lower levels, block size) CSCE 430/830, Memory Hierarchy Introduction CS252 S05

The University of Adelaide, School of Computer Science 22 February 2019 Memory Hierarchy Basics Introduction Miss rate Fraction of cache access that result in a miss Causes of misses Compulsory First reference to a block Capacity Blocks discarded and later retrieved Conflict Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 22 February 2019 Memory Hierarchy Basics Introduction Note that speculative and multithreaded processors may execute other instructions during a miss Reduces performance impact of misses Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Improve Cache Performance
improve cache and memory access times: Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Reducing each of these! Simultaneously? Improve performance by: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

The University of Adelaide, School of Computer Science 22 February 2019 Memory Hierarchy Basics Introduction Six basic cache optimizations: Larger block size Reduces compulsory misses But, increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss rate But, increases hit time, increases power consumption Higher associativity Reduces conflict misses But, increases hit time, increases power consumption Higher number of cache levels Reduces overall memory access time Giving priority to read misses over writes Reduces miss penalty E.g., Read complete before earlier writes in write buffer Avoiding address translation in cache indexing Reduces hit time Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Ten Advanced Optimizations
The University of Adelaide, School of Computer Science 22 February 2019 Ten Advanced Optimizations Small and simple first level caches Critical timing path: addressing tag memory, then comparing tags, then selecting correct set Direct-mapped caches can overlap tag compare and transmission of data Lower associativity reduces power because fewer cache lines are accessed Advanced Optimizations Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

L1 Size and Associativity
The University of Adelaide, School of Computer Science 22 February 2019 L1 Size and Associativity Advanced Optimizations Access time vs. size and associativity Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

L1 Size and Associativity
The University of Adelaide, School of Computer Science 22 February 2019 L1 Size and Associativity Advanced Optimizations Energy per read vs. size and associativity Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Way Prediction To improve hit time, predict the way to pre-set mux Mis-prediction gives longer hit time Prediction accuracy > 90% for two-way > 80% for four-way I-cache has better accuracy than D-cache First used on MIPS R10000 in mid-90s Used on ARM Cortex-A8 Extend to predict block as well “Way selection” Increases mis-prediction penalty Advanced Optimizations Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Pipelining Cache Pipeline cache access to improve bandwidth Examples: Pentium: 1 cycle Pentium Pro – Pentium III: 2 cycles Pentium 4 – Core i7: 4 cycles Increases branch mis-prediction penalty Makes it easier to increase associativity Advanced Optimizations Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Nonblocking Caches Allow hits before previous misses complete “Hit under miss” “Hit under multiple miss” L2 must support this In general, processors can hide L1 miss penalty but not L2 miss penalty Advanced Optimizations Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Multibanked Caches Organize cache as independent banks to support simultaneous access ARM Cortex-A8 supports 1-4 banks for L2 Intel i7 supports 4 banks for L1 and 8 banks for L2 Interleave banks according to block address Advanced Optimizations Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Critical Word First, Early Restart
The University of Adelaide, School of Computer Science 22 February 2019 Critical Word First, Early Restart Critical word first Request missed word from memory first Send it to the processor as soon as it arrives Early restart Request words in normal order Send missed word to the processor as soon as it arrives Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched Advanced Optimizations Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Merging Write Buffer When storing to a block that is already pending in the write buffer, update write buffer Reduces stalls due to full write buffer Do not apply to I/O addresses Advanced Optimizations No write buffering Write buffering Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Compiler Optimizations
The University of Adelaide, School of Computer Science 22 February 2019 Compiler Optimizations Techniques for Instructions Cache Reorder procedures in memory so as to reduce conflict misses Use profiling to look at conflicts Techniques for Data Cache Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange Swap nested loops to access memory in sequential order Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking Instead of accessing entire rows or columns, subdivide matrices into blocks Requires more memory accesses but improves locality of accesses Advanced Optimizations Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Merging Arrays Example
/* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality 2/22/2019

Loop Interchange Example
/* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ Sequential accesses instead of striding through memory every 100 words; improved spatial locality 2/22/2019

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; /* After */ { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality 2/22/2019

Blocking: improve temporal and spatial locality
multiple arrays are accessed in both ways (i.e., row-major and column-major), namely, orthogonal accesses that can not be helped by earlier methods concentrate on submatrices, or blocks All N*N elements of Y and Z are accessed N times and each element of X is accessed once. Thus, there are N3 operations and 2N3 + N2 reads! Capacity misses are a function of N and cache size in this case.

Blocking: improve temporal and spatial locality (cont’d)
To ensure that elements being accessed can fit in the cache, the original code is changed to compute a submatrix of size B*B, where B is called the blocking factor. To total number of memory words accessed is 2N3//B + N2 Blocking exploits a combination of spatial (Y) and temporal (Z) locality.

Reducing Conflict Misses by Blocking
Effect of blocking factor on fully associative cache in reducing conflict misses 2/22/2019

22 February 2019 Hardware Prefetching Fetch two blocks on miss (include next sequential block) Advanced Optimizations Pentium 4 Pre-fetching Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Compiler Prefetching Insert prefetch instructions before data is needed Non-faulting: prefetch doesn’t cause exceptions Register prefetch Loads data into register Cache prefetch Loads data into cache Combine with loop unrolling and software pipelining Advanced Optimizations Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Reducing Cache Miss Penalty: Compiler-Controlled Prefetching
Compiler inserts prefetch instructions An Example for(i:=0; i<3; i:=i+1) for(j:=0; j<100; j:=j+1) a[i][j] := b[j][0] * b[j+1][0] 16-byte blocks, 8KB cache, 1-way write back, 8-byte elements; What kind of locality, if any, exists for a and b? 3 100-element rows (100 columns) visited; spatial locality: even-indexed elements miss and odd-indexed elements hit, leading to 3*100/2 = 150 misses 101 rows and 3 columns visited; no spatial locality, but there is temporal locality: same element is used in ith and (i + 1)st iterations and the same element is access in each i iteration (outer loop). 100 misses for b[j+1][0] when i = 0 and 1 miss for j = 0 for a total of 101 misses Assuming large penalty (100 cycles and at least 7 iterations must be prefetched). Splitting the loop into two, we have

Reducing Cache Miss Penalty: 3. Compiler-Controlled Prefetching
Assuming that each iteration of the pre-split loop consumes 7 cycles and no conflict and capacity misses, then it consumes a total of 7*300 iteration cycles + 251*100 cache miss cycles = 27,200 cycles; With prefetching instructions inserted: for(j:=0; j<100; j:=j+1){ prefetch(b[j+7][0]; prefetch(a[0][j+7]; a[0][j] := b[j][0] * b[j+1][0];}; for(i:=1; i<3; i:=i+1) prefetch(a[i][j+7]; a[i][j] := b[j][0] * b[j+1][0]}

An Example (continued)
Reducing Cache Miss Penalty: 3. Compiler-Controlled Prefetching (cont’d) An Example (continued) the first loop consumes 9 cycles per iteration (due to the two prefetch instruction) and iterates 100 times for a total of 900 cycles, the second loop consumes 8 cycles per iteration (due to the single prefetch instruction) and iterates 200 times for a total of 1,600 cycles, during the first 7 iterations of the first loop array a incurs 4 cache misses, array b incurs 7 cache misses, for a total of (4+7)*100=1,100 cache miss cycles, during the first 7 iterations of the second loop for i = 1 and i = 2 array a incurs 4 cache misses each, for total of (4+4)*100=800 cache miss cycles; array b does not incur any cache miss in the second split! Total cycles consumed: = 44000 Prefetching improves performance: 27200/4400=6.2 folds!

Copyright © 2012, Elsevier Inc. All rights reserved.
Example In an L2 cache, a cache hit takes 0.8ns and a cache miss takes 5.1ns on average. The cache hit ratio is 95% while the cache miss ratio is 5%. Assuming a cycle time is 0.5ns, compute average memory access time. A cache hit takes 0.8/0.5 = 2 cycles, and a cache miss takes 5.1/0.5 = 11 cycles Average memory access cycles = 0.95*2+0.05*11 = 2.45 cycles Average memory access time = 2.45*0.5 = 1.225ns Copyright © 2012, Elsevier Inc. All rights reserved.

Computer Memory Hierarchy
Copyright © 2012, Elsevier Inc. All rights reserved.

22 February 2019 Memory Technology Performance metrics Latency is concern of cache Bandwidth is concern of multiprocessors and I/O External approach (e.g., multi-bank memory) Internal approach (e.g., SDRAM, DDR) Memory latency Access time (AT): time between read request and when desired word arrives Cycle time (CT): minimum time between unrelated requests to memory DRAM used for main memory Dynamic: Must write after read, must refresh: AT < CT SRAM used for cache Static: no refresh or read followed by write: AT ≈ CT Memory Technology Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Memory Technology Memory Technology SRAM Requires low power to retain bit Requires 6 transistors/bit DRAM Must be re-written after being read Must also be periodically refreshed Every ~ 8 ms Each row can be refreshed simultaneously One transistor/bit Address lines are multiplexed: Upper half of address: row access strobe (RAS) Lower half of address: column access strobe (CAS) Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Memory Technology Memory Technology Amdahl: Memory capacity should grow linearly with processor speed Unfortunately, memory capacity and speed has not kept pace with processors Some optimizations: Multiple accesses to same row Synchronous DRAM Added clock to DRAM interface and enables pipelining Burst mode with critical word first Wider interfaces Double data rate (DDR) Multiple banks on each DRAM device Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

DIMM Dual Inline Memory Module
Copyright © 2012, Elsevier Inc. All rights reserved.

22 February 2019 Memory Optimizations Memory Technology DDR: DDR2 Lower power (2.5 V -> 1.8 V) Higher clock rates (266 MHz, 333 MHz, 400 MHz) DDR3 1.5 V 800 MHz DDR4 1-1.2 V 1600 MHz GDDR5 is graphics memory based on DDR3 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Memory Optimizations Memory Technology Graphics memory: Achieve 2-5 X bandwidth per DRAM vs. DDR3 Wider interfaces (32 vs. 16 bit) Higher clock rate Possible because they are attached via soldering instead of socketted DIMM modules Reducing power in SDRAMs: Lower voltage Low power mode (ignores clock, continues to refresh) Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Memory Power Consumption
The University of Adelaide, School of Computer Science 22 February 2019 Memory Power Consumption Memory Technology Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Flash Memory Memory Technology Type of EEPROM Must be erased (in blocks) before being overwritten Non volatile Limited number of write cycles Cheaper than SDRAM, more expensive than disk Slower than SRAM, faster than disk Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Solid State Drive (nowadays)

Comparison Attribute SSD HDD Random access time 0.1 ms 5-10 ms
Bandwidth MB/s 100 MB/s sequential Price/GB 0.9$-2$ 0.1$ Size Up to 2TB, 250GB common 4TB Power consumption 5 watts Up to 20 watts Read/write symmetry No Yes Noise Yes (spin, rotate)

NAND Flash Memory Main storage component of Solid State Drive (SSD)
USB Drive, cell phone, touch pad…

NAND Flash (cont’d) Advantages of NAND flash
Fast random read (25 us) Energy efficiency High reliability (no moving parts) compared to harddisks Widely deployed in high-end laptops Macbook air, ThinkPad X series, touch pad… Increasingly deployed in enterprise environment either as a secondary cache or main storage

Flash Memory (cont’d) Disadvantages of SSD
Garbage collection (GC) problem of SSD Stemmed from the out-of-place update characteristics Update requests invalidate old version of pages and then write new version of these pages to a new place Copy valid data to somewhere else (increasing number of IOs) Garbage collection is periodically started to erase victim blocks and copy valid pages to the free blocks (slow erase: 10xW,100xR) Blocks in the SSD have a limited number of erase cycles 100,000 for Single Level Chip (SLC), 5,000-10,000 for Multiple Level Chip (MLC), can be as low as 3,000 May be quickly worn out in enterprise environment Performance is very unpredictable Due to unpredictable triggering of the time-consuming GC process

Hybrid Main Memory System
DRAM + Flash Memory Uses small DRAM as a cache to buffer writes and cache reads by leveraging access locality Uses large flash memory to store cold data Advantages Similar performance as DRAM Low power consumption Low costs Copyright © 2012, Elsevier Inc. All rights reserved.

22 February 2019 Memory Dependability Memory Technology Memory is susceptible to cosmic rays Soft errors: dynamic errors Detected and fixed by error correcting codes (ECC) Hard errors: permanent errors Use sparse rows to replace defective rows Chipkill: a RAID-like error recovery technique Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Chipkill A Redundant Array of Inexpensive DRAM (RAID) processor chip is directly placed on the memory DIMM. The RAID chip calculates an ECC checksum for the contents of the entire set of chips for each memory access and stores the result in extra memory space on the protected DIMM. Thus, when a memory chip on the DIMM fails, the RAID result can be used to "back up" the lost data. Copyright © 2012, Elsevier Inc. All rights reserved.

22 February 2019 Virtual Memory Protection via virtual memory Keeps processes in their own memory space Role of architecture: Provide user mode and supervisor mode Protect certain aspects of CPU state Provide mechanisms for switching between user mode and supervisor mode Provide mechanisms to limit memory accesses Provide TLB to translate addresses Virtual Memory and Virtual Machines Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

22 February 2019 Virtual Machines Supports isolation and security Sharing a computer among many unrelated users Enabled by raw speed of processors, making the overhead more acceptable Allows different ISAs and operating systems to be presented to user programs “System Virtual Machines” SVM software is called “virtual machine monitor” or “hypervisor” Individual virtual machines run under the monitor are called “guest VMs” Virtual Memory and Virtual Machines Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Virtual Machine Monitors (VMMs)
Virtual machine monitor (VMM) or hypervisor is software that supports VMs VMM determines how to map virtual resources to physical resources Physical resource may be time-shared, partitioned, or emulated in software VMM is much smaller than a traditional OS; isolation portion of a VMM is  10,000 lines of code 2/22/2019 CS252 S05

Virtual Machine Monitors (VMMs)
2/22/2019 CS252 S05

Impact of VMs on Virtual Memory
The University of Adelaide, School of Computer Science 22 February 2019 Impact of VMs on Virtual Memory Each guest OS maintains its own set of page tables VMM adds a level of memory between physical and virtual memory called “real memory” VMM maintains shadow page table that maps guest virtual addresses to physical addresses Requires VMM to detect guest’s changes to its own page table Occurs naturally if accessing the page table pointer is a privileged operation Virtual Memory and Virtual Machines Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

VMM Overhead? Depends on the workload
User-level processor-bound programs (e.g., SPEC) have zero-virtualization overhead Runs at native speeds since OS rarely invoked I/O-intensive workloads  OS-intensive  execute many system calls and privileged instructions  can result in high virtualization overhead For System VMs, goal of architecture and VMM is to run almost all instructions directly on native hardware If I/O-intensive workload is also I/O-bound  low processor utilization since waiting for I/O  processor virtualization can be hidden  low virtualization overhead 2/22/2019 CS252 S05

Requirements of a Virtual Machine Monitor
A VM Monitor Presents a SW interface to guest software, Isolates state of guests from each other, and Protects itself from guest software (including guest OSes) Guest software should behave on a VM exactly as if running on the native HW Except for performance-related behavior or limitations of fixed resources shared by multiple VMs Guest software should not be able to change allocation of real system resources directly Hence, VMM must control  everything even though guest VM and OS currently running is temporarily using them Access to privileged state, Address translation, I/O, Exceptions and Interrupts, … 2/22/2019 CS252 S05

Requirements of a Virtual Machine Monitor
VMM must be at higher privilege level than guest VM, which generally run in user mode Execution of privileged instructions handled by VMM E.g., Timer interrupt: VMM suspends currently running guest VM, saves its state, handles interrupt, determine which guest VM to run next, and then load its state Guest VMs that rely on timer interrupt provided with virtual timer and an emulated timer interrupt by VMM Requirements of system virtual machines are  same as paged-virtual memory: At least 2 processor modes, system and user Privileged subset of instructions available only in system mode, trap if executed in user mode All system resources controllable only via these instructions 2/22/2019 CS252 S05

ISA Support for Virtual Machines
If plan for VM during design of ISA, easy to reduce instructions executed by VMM, speed to emulate ISA is virtualizable if can execute VM directly on real machine while letting VMM retain ultimate control of CPU: “direct execution” Since VMs have been considered for desktop/PC server apps only recently, most ISAs were created ignoring virtualization, including 80x86 and most RISC architectures VMM must ensure that guest system only interacts with virtual resources  conventional guest OS runs as user mode program on top of VMM If guest OS accesses or modifies information related to HW resources via a privileged instruction—e.g., reading or writing the page table pointer—it will trap to VMM If not, VMM must intercept instruction and support a virtual version of sensitive information as guest OS expects 2/22/2019 CS252 S05

Impact of VMs on Virtual Memory
Virtualization of virtual memory if each guest OS in every VM manages its own set of page tables? VMM separates real and physical memory Makes real memory a separate, intermediate level between virtual memory and physical memory Some use the terms virtual memory, physical memory, and machine memory to name the 3 levels Guest OS maps virtual memory to real memory via its page tables, and VMM page tables map real memory to physical memory VMM maintains a shadow page table that maps directly from the guest virtual address space to the physical address space of HW Rather than pay extra level of indirection on every memory access VMM must trap any attempt by guest OS to change its page table or to access the page table pointer 2/22/2019 CS252 S05

ISA Support for VMs & Virtual Memory
IBM 370 architecture added additional level of indirection that is managed by the VMM Guest OS keeps its page tables as before, so the shadow pages are unnecessary (AMD Pacifica proposes same improvement for 80x86) To virtualize software TLB, VMM manages the real TLB and has a copy of the contents of the TLB of each guest VM Any instruction that accesses the TLB must trap TLBs with Process ID tags support a mix of entries from different VMs and the VMM, thereby avoiding flushing of the TLB on a VM switch 2/22/2019 CS252 S05

Impact of I/O on Virtual Memory
I/O most difficult part of virtualization Increasing number of I/O devices attached to the computer Increasing diversity of I/O device types Sharing of a real device among multiple VMs Supporting many device drivers that are required, especially if different guest OSes are supported on same VM system Give each VM generic versions of each type of I/O device driver, and let VMM to handle real I/O Method for mapping virtual to physical I/O device depends on the type of device: Disks partitioned by VMM to create virtual disks for guest VMs Network interfaces shared between VMs in short time slices, and VMM tracks messages for virtual network addresses to ensure that guest VMs only receive their messages 2/22/2019 CS252 S05

The Limits of Physical Addressing
“Physical addresses” of memory locations Data All programs share one address space: The physical address space Machine language programs must be aware of the machine organization No way to prevent a program from accessing any machine resource A0-A31 A0-A31 CPU Memory D0-D31 D0-D31 CS252 S05

Solution: Add a Layer of Indirection
“Physical Addresses” Address Translation Virtual Physical “Virtual Addresses” A0-A31 A0-A31 CPU Memory D0-D31 D0-D31 Data User programs run in a standardized virtual address space Address Translation hardware, managed by the operating system (OS), maps virtual address to physical memory Hardware supports “modern” OS features: Protection, Translation, Sharing CS252 S05

Three Advantages of Virtual Memory
Translation: Program can be given consistent view of memory, even though physical memory is scrambled Makes multithreading reasonable (now used a lot!) Only the most important part of program (“Working Set”) must be in physical memory. Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. Protection: Different threads (or processes) protected from each other. Different pages can be given special behavior (Read Only, Invisible to user programs, etc). Kernel data protected from User programs Very important for protection from malicious programs Sharing: Can map same physical page to multiple users (“Shared memory”) CS252 S05

Page tables encode virtual address spaces
Physical Memory Space A valid page table entry codes physical memory “frame” address for the page Page Table OS manages the page table for each ASID (Addr. Space ID) A virtual address space is divided into blocks of memory called pages frame frame A machine usually supports pages of a few sizes (MIPS R4000): frame frame A page table is indexed by a virtual address CS252 S05

An Example of Page Table
Virtual Memory Physical Memory

Dividing the address space by a page size
Virtual Memory Physical Memory Page size:4KB

Virtual Page & Physical Page
Virtual Memory Physical Memory V.P. 0 P.P. 0 V.P. 1 P.P. 1 V.P. 2 P.P. 2 V.P. 3 P.P. 3 V.P. 4 V.P. 5 Page size:4KB

Addressing Page size:4KB Virtual Memory Virtual Address
Physical Memory V.P. 0 Virtual Page No. P. Offset P.P. 0 V.P. 1 P.P. 1 V.P. 2 P.P. 2 V.P. 3 P.P. 3 V.P. 4 Physical Address V.P. 5 Physical Page No. P. Offset Page size:4KB

Addressing Virtual Address Virtual Memory Physical Memory
Virtual Page No. P. Offset Physical Memory Page Table Entry Physical Address Physical Page No. P. Offset

Virtual Page No. P. Offset Physical Memory Page Table Entry V P R D Physical Page No. Valid/Present Bit If set, page being pointed is resident in memory Otherwise, on disk or not allocated Physical Address Physical Page No. P. Offset

Virtual Page No. P. Offset Physical Memory Page Table Entry V P R D Physical Page No. Protection Bits Restrict access; read-only, read/write, system-only access Physical Address Physical Page No. P. Offset

Virtual Page No. P. Offset Physical Memory Page Table Entry V P R D Physical Page No. Reference Bit Needed by replacement policies If set, page has been referenced Physical Address Physical Page No. P. Offset

Page Table Entry Virtual Address Virtual Memory Physical Memory
Virtual Page No. P. Offset Physical Memory Page Table Entry V P R D Physical Page No. Dirty Bit If set, at least one word in page has been modified Physical Address Physical Page No. P. Offset

Page Table Entry Virtual Address Virtual Memory Physical Memory
Virtual Page No. P. Offset Physical Memory V P R D Physical Page No. V P R D Physical Page No. V P R D Physical Page No. V P R D Physical Page No. V P R D Physical Page No. V P R D Physical Page No. Physical Page No. P. Offset Physical Address

Page Table Lookup Virtual Address Virtual Memory Physical Memory
Virtual Page No. P. Offset Physical Memory virtual address V P R D Physical Page No. V P R D Physical Page No. V P R D Physical Page No. V P R D Physical Page No. V P R D Physical Page No. V P R D Physical Page No. Physical Page No. P. Offset Physical Address

Page Table Lookup Virtual Address Virtual Memory Physical Memory
Virtual Page No. P. Offset Physical Memory virtual address physical address V P R D Physical Page No. V P R D Physical Page No. V P R D Physical Page No. V P R D Physical Page No. V P R D Physical Page No. V P R D Physical Page No. Physical Page No. P. Offset Physical Address

Details of Page Table Page Table Physical Memory Space Virtual Address Page Table index into page table Base Reg V Access Rights PA V page no. offset 12 table located in physical memory P page no. Physical Address frame frame frame frame virtual address Page table maps virtual page numbers to physical frames (“PTE” = Page Table Entry) Virtual memory => treat memory  cache for disk 4 fundamental questions: placement, identification, replacement, and write policy? CS252 S05

4 Fundamental Questions
Placement Operating systems allow blocks to be placed anywhere in main memory Identification Page Table, Inverted Page Table Replacement Almost all operating systems try to use LRU Write Policies Always write back

Latency Principle of Locality Caching
Since Page Table is located in main memory, it takes one memory access latency to finish an address translation; As a result, a load/store operation from/to main memory needs two memory access latency in total; Considering the expensive memory access latency, the overhead of page table lookup should be optimized; How? Principle of Locality Caching

MIPS Address Translation: How does it work?
“Virtual Addresses” “Physical Addresses” A0-A31 Virtual Physical Translation Look-Aside Buffer (TLB) Translation Look-Aside Buffer (TLB) A small fully-associative cache of mappings from virtual to physical addresses A0-A31 CPU Memory D0-D31 D0-D31 Data What is the table of mappings that it caches? TLB also contains protection bits for virtual address Fast common case: Virtual address is in TLB, process has permission to read/write it. CS252 S05

Physical and virtual pages must be the same size!
The TLB caches page table entries Physical and virtual pages must be the same size! TLB Page Table 2 1 3 virtual address page off frame 5 physical address TLB caches page table entries. Physical frame address for ASID V=0 pages either reside on disk or have not yet been allocated. OS handles V=0 “Page fault” MIPS handles TLB misses in software (random replacement). Other machines use hardware. CS252 S05

Can TLB and caching be overlapped?
Virtual Page Number Page Offset Translation Look-Aside Buffer (TLB) Virtual Physical Index Byte Select Valid Cache Tags Cache Data Cache Block = Hit Cache Tag This works, but ... Q. What is the downside? A. Inflexibility. Size of cache limited by page size. Data out CS252 S05

TLB: 2-way set associative, 256 entries
8 VA: 64bits PA: 40bits Page size: 16KB TLB: 2-way set associative, 256 entries Cache block: 64B L1: direct-mapping, 16KB L2: 4-way set associative, 4MB

The University of Adelaide, School of Computer Science

Similar presentations

Presentation on theme: "The University of Adelaide, School of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The University of Adelaide, School of Computer Science

Similar presentations

Presentation on theme: "The University of Adelaide, School of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback