IT 344: Operating Systems Winter 2010 Module 12 Page Table Management, TLBs, and Other Pragmatics Chia-Chi Teng CTB 265.

Slides:



Advertisements
Similar presentations
Paging Andrew Whitaker CSE451.
Advertisements

Paging 1 CS502 Spring 2006 Paging CS-502 Operating Systems.
OS Fall’02 Virtual Memory Operating Systems Fall 2002.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
CS 153 Design of Operating Systems Spring 2015
CS 333 Introduction to Operating Systems Class 11 – Virtual Memory (1)
Memory Management (II)
Paging and Virtual Memory. Memory management: Review  Fixed partitioning, dynamic partitioning  Problems Internal/external fragmentation A process can.
Translation Buffers (TLB’s)
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
1  The second question was how to determine whether or not the data we’re interested in is already stored in the cache.  If we want to read memory address.
CSE451 Introduction to Operating Systems Spring 2007 Module 12 Memory Management Hardware Support Gary Kimura & Mark Zbikowski April 27, 2007.
Computer Architecture Lecture 28 Fasih ur Rehman.
CS 153 Design of Operating Systems Spring 2015 Lecture 17: Paging.
Lecture 19: Virtual Memory
CSE 451: Operating Systems Autumn 2013 Module 13 Page Table Management, TLBs, and Other Pragmatics Ed Lazowska Allen Center.
Operating Systems ECE344 Ding Yuan Paging Lecture 8: Paging.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
Operating Systems COMP 4850/CISG 5550 Page Tables TLBs Inverted Page Tables Dr. James Money.
IT 344: Operating Systems Winter 2008 Module 12 Page Table Management, TLBs, and Other Pragmatics Chia-Chi Teng CTB 265.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
Review (1/2) °Caches are NOT mandatory: Processor performs arithmetic Memory stores data Caches simply make data transfers go faster °Each level of memory.
Virtual Memory 1 1.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Review °Apply Principle of Locality Recursively °Manage memory to disk? Treat as cache Included protection as bonus, now critical Use Page Table of mappings.
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
CS203 – Advanced Computer Architecture Virtual Memory.
CSE 451: Operating Systems Spring 2010 Module 12 Page Table Management, TLBs, and Other Pragmatics John Zahorjan Allen Center.
CS161 – Design and Architecture of Computer
Translation Lookaside Buffer
Memory Hierarchy Ideal memory is fast, large, and inexpensive
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Virtual Memory - Part II
CS703 - Advanced Operating Systems
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
© 2012 Gribble, Lazowska, Levy, Zahorjan
CSE 120 Principles of Operating
CSE 451: Operating Systems Winter 2011 Page Table Management, TLBs, and Other Pragmatics Mark Zbikowski Gary Kimura 1.
Virtual Memory Hardware
Translation Lookaside Buffer
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
How can we find data in the cache?
CSE 451: Operating Systems Autumn 2005 Memory Management
Translation Buffers (TLB’s)
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
© 2004 Ed Lazowska & Hank Levy
CSE451 Virtual Memory Paging Autumn 2002
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
CSE 451: Operating Systems Autumn 2004 Page Tables, TLBs, and Other Pragmatics Hank Levy 1.
Translation Buffers (TLB’s)
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
IT 344: Operating Systems Module 12 Page Table Management, TLBs, and Other Pragmatics Chia-Chi Teng CTB
CSE 451: Operating Systems Lecture 10 Paging & TLBs
CSE 451: Operating Systems Winter 2005 Page Tables, TLBs, and Other Pragmatics Steve Gribble 1.
Translation Buffers (TLBs)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
CSE 451: Operating Systems Winter 2012 Page Table Management, TLBs, and Other Pragmatics Mark Zbikowski Gary Kimura 1.
Review What are the advantages/disadvantages of pages versus segments?
CSE 451: Operating Systems Winter 2006 Module 12 Page Table Management, TLBs, and Other Pragmatics Ed Lazowska Allen Center.
CSE 451: Operating Systems Autumn 2010 Module 12 Page Table Management, TLBs, and Other Pragmatics Ed Lazowska Allen Center.
CSE 451: Operating Systems Winter 2009 Module 11a Page Table Management, TLBs, and Other Pragmatics Mark Zbikowski Gary Kimura 1.
Paging Andrew Whitaker CSE451.
Virtual Memory 1 1.
Presentation transcript:

IT 344: Operating Systems Winter 2010 Module 12 Page Table Management, TLBs, and Other Pragmatics Chia-Chi Teng CTB 265

11/12/20152 Address translation and page faults (refresher!) page frame 0 page frame 1 page frame 2 page frame Y … page frame 3 physical memory offset physical address page frame # page table offset virtual address virtual page # What mechanism causes a page fault to occur? Recall how address translation works

11/12/20153 How does OS handle a page fault? Interrupt causes system to be entered System saves state of running process, then vectors to page fault handler routine –find or create (through eviction) a page frame into which to load the needed page (1) if I/O is required, run some other process while it’s going on –find the needed page on disk and bring it into the page frame (2) run some other process while the I/O is going on –fix up the page table entry mark it as “valid,” set “referenced” and “modified” bits to false, set protection bits appropriately, point to correct page frame –put the process on the ready queue

11/12/20154 (1) Find or create (through eviction) a page frame into which to load the needed page –run page replacement algorithm free page frame assigned but unmodified (“clean”) page frame assigned and modified (“dirty”) page frame –assigned but “clean” find PTE (may be a different process!) mark as invalid (disk address must be available for subsequent reload) –assigned and “dirty” find PTE (may be a different process!) mark as invalid write it out

11/12/20155 (2) Find the needed page on disk and bring it into the page frame –processor makes process ID and faulting virtual address available to page fault handler –process ID gets you to the base of the page table –VPN portion of VA gets you to the PTE –data structure analogous to page table (an array with an entry for each page in the address space) contains disk address of page –at this point, it’s just a simple matter of I/O must be positive that the target page frame remains available! –or what?

11/12/20156 “Issues” Memory reference overhead of address translation –2 references per address lookup (page table, then memory) –solution: use a hardware cache to absorb page table lookups translation lookaside buffer (TLB) Memory required to hold page tables can be huge –need one PTE per page in the virtual address space –32 bit AS with 4KB pages = 2 20 PTEs = 1,048,576 PTEs –4 bytes/PTE = 4MB per page table OS’s typically have separate page tables per process 25 processes = 100MB of page tables –48 bit AS, same assumptions, 64GB per page table! –solution: page the page tables! (ow, my brain hurts …)

11/12/20157 Paging the page tables 1 Simplest notion: –put user page tables in a pageable segment of the system’s address space –wire down the system’s page table(s) in physical memory –allow the system segment containing the user page tables to be paged a reference to a non-resident portion of a user page table is a page fault in the system address space the system’s page table is wired down –“no smoke and mirrors” As a practical matter, this simple notion doesn’t cut the mustard today –although it is exactly what VAX/VMS did! But it’s a useful model for what’s actually done

11/12/20158 Paging the page tables 2 How can we reduce the physical memory requirements of page tables? –observation: only need to map the portion of the address space that is actually being used (often a tiny fraction of the total address space) a process may not use its full 32/48/64-bit address space a process may have unused “holes” in its address space a process may not reference some parts of its address space for extended periods –all problems in CS can be solved with a level of indirection! two-level (three-level, four-level) page tables

11/12/20159 Two-level page tables With two-level PT’s, virtual addresses have 3 parts: –master page number, secondary page number, offset –master PT maps master PN to secondary PT –secondary PT maps secondary PN to page frame number –offset and PFN yield physical address

11/12/ Two level page tables page frame 0 page frame 1 page frame 2 page frame Y … page frame 3 physical memory offset physical address page frame # master page table secondary page# virtual address master page #offset secondary page table secondary page table page frame number

11/12/ Example: –32-bit address space, 4KB pages, 4 bytes/PTE how many bits in offset? –need 12 bits for 4KB (2 12 =4K), so offset is 12 bits want master PT to fit in one page –4KB/4 bytes = 1024 PTEs –thus master page # is 10 bits (2 10 =1K) –and there are 1024 secondary page tables and 10 bits are left ( ) for indexing each secondary page table –hence, each secondary page table has 1024 PTEs and fits in one page

11/12/ Generalizing Early architectures used 1-level page tables VAX, P-II used 2-level page tables SPARC uses 3-level page tables uses 4-level page tables Key thing is that the outer level must be wired down (pinned in physical memory) in order to break the recursion – no smoke and mirrors

11/12/ Alternatives Hashed page table (great for sparse address spaces) –VPN is used as a hash –collisions are resolved because the elements in the linked list at the hash index include the VPN as well as the PFN Inverted page table (really reduces space!) –one entry per page frame –includes process id, VPN –hell to search! (but IBM PC/RT actually did this!)

11/12/ Making it all efficient Original page table scheme doubled the cost of memory lookups –one lookup into page table, a second to fetch the data Two-level page tables triple the cost!! –two lookups into page table, a third to fetch the data How can we make this more efficient? –goal: make fetching from a virtual address about as efficient as fetching from a physical address –solution: use a hardware cache inside the CPU cache the virtual-to-physical translations in the hardware called a translation lookaside buffer (TLB) TLB is managed by the memory management unit (MMU)

11/12/ TLBs Translation lookaside buffer –translates virtual page #s into PTEs (page frame numbers) (not physical addrs) –can be done in single machine cycle TLB is implemented in hardware –is a fully associative cache (all entries searched in parallel) –cache tags are virtual page numbers –cache values are PTEs (page frame numbers) –with PTE + offset, MMU can directly calculate the PA TLBs exploit locality –processes only use a handful of pages at a time entries in TLB is typical (64-192KB) can hold the “hot set” or “working set” of a process –hit rates in the TLB are therefore really important

TLB 11/12/201516

Cache 17 CPU Virtual address TLB Physical address hit L2 cache Main memory (page table) miss hit miss TLB is a special cache just for the page table Usually fully associative.

18 Review: cache hit When the CPU tries to read from memory, the address will be sent to a cache controller. –The lowest k bits of the address will index a block in the cache. –If the block is valid and the tag matches the upper (m - k) bits of the m-bit address, then that data will be sent to the CPU. Here is a diagram of a 32-bit memory address and a byte cache IndexTagData Valid Address (32 bits) = To CPU Hit 1022 Index Tag

19 Once you know the block address, you can map it to the cache as before: find the remainder when the block address is divided by the number of cache blocks. In our example, memory block 6 belongs in cache block 2, since 6 mod 4 = 2. This corresponds to placing data from memory byte addresses 12 and 13 into cache block 2. Review: cache mapping Byte Address Index Block Address

20 Disadvantage of direct mapping The direct-mapped cache is easy: indices and offsets can be computed with bit operators or simple arithmetic, because each memory address belongs in exactly one block. But, what happens if a program uses addresses 2, 6, 2, 6, 2, …? Index Memory Address

21 Disadvantage of direct mapping The direct-mapped cache is easy: indices and offsets can be computed with bit operators or simple arithmetic, because each memory address belongs in exactly one block. However, this isn’t really flexible. If a program uses addresses 2, 6, 2, 6, 2,..., then each access will result in a cache miss and a load into cache block 2. This cache has four blocks, but direct mapping might not let us use all of them. This can result in more misses than we might like Index Memory Address

22 A fully associative cache A fully associative cache permits data to be stored in any cache block, instead of forcing each memory address into one particular block. –When data is fetched from memory, it can be placed in any unused block of the cache. –This way we’ll never have a conflict between two or more memory addresses which map to a single cache block. In the previous example, we might put memory address 2 in cache block 2, and address 6 in block 3. Then subsequent repeated accesses to 2 and 6 would all be hits instead of misses. If all the blocks are already in use, it’s usually best to replace the least recently used one, assuming that if it hasn’t used it in a while, it won’t be needed again anytime soon.

23 The price of full associativity However, a fully associative cache is expensive to implement. –Because there is no index field in the address anymore, the entire address must be used as the tag, increasing the total cache size. –Data could be anywhere in the cache, so we must check the tag of every cache block. That’s a lot of comparators!... IndexTag (32 bits)DataValid Address (32 bits) = Hit 32 Tag = =

24 Set associativity An intermediate possibility is a set-associative cache. –The cache is divided into groups of blocks, called sets. –Each memory address maps to exactly one set in the cache, but data may be placed in any block within that set. If each set has 2 x blocks, the cache is an 2 x -way associative cache. Here are several possible organizations of an eight-block cache Set way associativity 8 sets, 1 block each 2-way associativity 4 sets, 2 blocks each 4-way associativity 2 sets, 4 blocks each

25 Review: cache mapping IndexTagDataValid Address (32 bits) = Hit 1020 Tag 2 bits Mux Data Block Index Tag Block Offset 2

26 Review: cache associativity By now you may have noticed the 1-way set associative cache is the same as a direct-mapped cache. Similarly, if a cache has 2 k blocks, a 2 k -way set associative cache would be the same as a fully-associative cache Set way 8 sets, 1 block each 2-way 4 sets, 2 blocks each 4-way 2 sets, 4 blocks each 0 Set 8-way 1 set, 8 blocks direct mappedfully associative

TLB Virtual to Physical translations are cached in a Translation Lookaside Buffer (TLB). 27

Comparing the 2 levels of hierarchy Cache (L2)Virtual Memory Block or LinePage MissPage Fault Block Size: 32-64BPage Size: 4K-8KB Placement:Fully Associative Direct Mapped, N-way Set Associative Replacement:Least Recently Used LRU or Random(LRU) Write Thru or BackWrite Back

My Intel Mobile Core 2 Due T /12/201529

Administrivia Midterm grade Quiz4 next Monday on Virtual Memory Project 1 write up due EOB Tomorrow 11/12/201530

11/12/ Managing TLBs Address translations are mostly handled by the TLB –>99% of translations, but there are TLB misses occasionally –in case of a miss, translation is placed into the TLB Hardware (memory management unit (MMU)) –knows where page tables are in memory OS maintains them, HW access them directly –tables have to be in HW-defined format –this is how x86 works Software loaded TLB (OS) –TLB miss faults to OS, OS finds right PTE and loads TLB –must be fast (but, cycles typically) CPU ISA has instructions for TLB manipulation OS gets to pick the page table format

11/12/ Managing TLBs (2) OS must ensure TLB and page tables are consistent –when OS changes protection bits in a PTE, it needs to invalidate the PTE if it is in the TLB What happens on a process context switch? –remember, each process typically has its own page tables –need to invalidate all the entries in TLB! (flush TLB) this is a big part of why process context switches are costly –can you think of a hardware fix to this? When the TLB misses, and a new PTE is loaded, a cached PTE must be evicted –choosing a victim PTE is called the “TLB replacement policy” –implemented in hardware, usually simple (e.g., LRU)

11/12/ Cool Paging Tricks Exploit level of indirection between VA and PA –shared memory regions of two separate processes’ address spaces map to the same physical frames –read/write: access to share data –execute: shared libraries! will have separate PTEs per process, so can give different processes different access privileges must the shared region map to the same VA in each process? –copy-on-write (COW), e.g., on fork( ) instead of copying all pages, created shared mappings of parent pages in child address space –make shared mappings read-only in child space –when child does a write, a protection fault occurs, OS takes over and can then copy the page and resume client

11/12/ Memory-mapped files –instead of using open, read, write, close “map” a file into a region of the virtual address space –e.g., into region with base ‘X’ accessing virtual address ‘X+N’ refers to offset ‘N’ in file initially, all pages in mapped region marked as invalid –OS reads a page from file whenever invalid page accessed –OS writes a page to file when evicted from physical memory only necessary if page is dirty

11/12/ Summary We know how address translation works in the “vanilla” case (single-level page table, no fault, no TLB) –hardware splits the virtual address into the virtual page number and the offset; uses the VPN to index the page table; concatenates the offset to the page frame number (which is in the PTE) to obtain the physical address We know how the OS handles a page fault –find or create (through eviction) a page frame into which to load the needed page –find the needed page on disk and bring it into the page frame –fix up the page table entry –put the process on the ready queue

11/12/ We’re aware of two “gotchas” that complicate things in practice –the memory reference overhead of address translation the need to reference the page table doubles the memory traffic solution: use a hardware cache (TLB = translation lookaside buffer) to absorb page table lookups –the memory required to hold page tables can be huge solution: use multi-level page tables; can page the lower levels, or at least omit them if the address space is sparse –this makes the TLB even more important, because without it, a single user-level memory reference can cause two or three or four page table memory references … and we can’t even afford one!

11/12/ TLB details –Implemented in hardware fully associative cache (all entries searched in parallel) cache tags are virtual page numbers cache values are page table entries (page frame numbers) with PTE + offset, MMU can directly calculate the physical address –Can be small because of locality entries can yield a 99% hit ratio –Searched before the hardware walks the page table(s) hit: address translation does not require an extra memory reference (or two or three or four) – “free” miss: the hardware walks the page table(s) to translate the address; this translation is put into the TLB, evicting some other translation; typically managed LRU by the hardware

11/12/ –On context switch TLB must be purged/flushed (using a special hardware instruction) unless entries are tagged with a process ID –otherwise, the new process will use the old process’s TLB entries and reference its page frames! Cool tricks –shared memory –copy-on-write –memory-mapped files