Chapter 9: Main Memory.

Chapter 9: Main Memory

Chapter 9: Memory Management
Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example: ARM Architecture

Objectives To provide a detailed description of various ways of organizing memory hardware To discuss various memory-management techniques, including paging and segmentation To provide a detailed description of the Intel Pentium, which supports both pure segmentation and segmentation with paging

How is the translation accomplished?
What, exactly happens inside MMU? One possibility: Hardware Tree Traversal For each virtual address, takes page table base pointer and traverses the page table in hardware Generates a “Page Fault” if it encounters invalid PTE Fault handler will decide what to do More on this next lecture Pros: Relatively fast (but still many memory accesses!) Cons: Inflexible, Complex hardware Another possibility: Software Each traversal done in software Pros: Very flexible Cons: Every translation must invoke Fault! In fact, need way to cache translations for either case!

Caching Concept Cache: a repository for copies that can be accessed more quickly than the original Make frequent case fast and infrequent case less dominant Caching underlies many of the techniques that are used today to make computers fast Can cache: memory locations, address translations, pages, file blocks, file names, network routes, etc… Only good if: Frequent case frequent enough and Infrequent case not too expensive Important measure: Average Access time = (Hit Rate x Hit Time) + (Miss Rate x Miss Time)

Why Bother with Caching?
Processor-DRAM Memory Gap (latency) CPU µProc 60%/yr. (2X/1.5yr) 1000 “Moore’s Law” (really Joy’s Law) Processor-Memory Performance Gap: (grows 50% / year) 100 Performance Y-axis is performance X-axis is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 10 DRAM 9%/yr. (2X/10 yrs) “Less’ Law?” 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

Hashed Page Tables Common in address spaces > 32 bits
The virtual page number is hashed into a page table This page table contains a chain of elements hashing to the same location Each element contains (1) the virtual page number (2) the value of the mapped page frame (3) a pointer to the next element Virtual page numbers are compared in this chain searching for a match If a match is found, the corresponding physical frame is extracted Variation for 64-bit addresses is clustered page tables Similar to hashed but each entry refers to several pages (such as 16) rather than 1 Especially useful for sparse address spaces (where memory references are non-contiguous and scattered)

Hashed Page Table

Hashed Page Table Independent of size of address space, Pro:
O(1) lookup to do translation Requires page table space proportional to how many pages are actually being used, not proportional to size of address space – with 64 bit address spaces, this is a big win! Con: Overhead of managing hash chains, etc.

Inverted Page Table Rather than each process having a page table and keeping track of all possible logical pages, track all physical pages One entry for each real (physical) page of memory Entry consists of the virtual address of the page stored in that real memory location, with information about the process that owns that page Address-space identifier (ASID) stored in each entry maps logical page for a particular process to the corresponding physical page frame. Decreases memory needed to store each page table, but increases time needed to search the table when a page reference occurs

Inverted Page Table Architecture

Inverted Page Table Pro:
Decreases memory needed to store each page table Con: increases time needed to search the table Use hash table to limit the search to one — or at most a few — page-table entries One virtual memory reference requires at least two real memory reads: one for the hash table entry and one for the page table. Associative registers (TLBs) can be used to improve performance. But how to implement shared memory? One mapping of a virtual address to the shared physical address

Paging + segmentation: best of both?
simple memory allocation, easy to share memory, and efficient for sparse address spaces Virtual address virt seg # virt page # offset Physical address phys frame# offset No page-table page-table base size > error yes Segment table Physical memory + Phys frame # Page table

Paging + segmentation Questions:
What must be saved/restored on context switch? How do we share memory? Can share entire segment, or a single page. Example: 24 bit virtual addresses = 4 bits of segment #, 8 bits of virtual page #, and 12 bits of offset. Segment table Physical memory Page-table base Page-table size 0x x14 – – 0x xD 0x x6 0xb 0x4 … 0x x13 0x2a 0x3 What do the following addresses translate to? 0x002070? 0x ? 0x14c684 ? 0x ? portions of the page tables for the segments

Multilevel translation
What must be saved/restored on context switch? Contents of top-level segment registers (for this example) Pointer to top-level table (page table) Pro: Only need to allocate as many page table entries as we need. In other words, sparse address spaces are easy. Easy memory allocation Share at segment or page level (need additional reference counting) Cons: Pointer per page (typically 4KB - 16KB pages today) Page tables need to be contiguous Two (or more, if > 2 levels) lookups per memory reference

Another Major Reason to Deal with Caching
Cannot afford to translate on every access At least two DRAM accesses per actual DRAM access Or: perhaps I/O if page table partially on disk! Even worse: What if we are using caching to make memory access faster than DRAM access??? Solution? Cache translations! Translation Cache: TLB (“Translation Look-aside Buffer”)

Why Does Caching Help? Locality!
Address Space 2n - 1 Probability of reference Temporal Locality (Locality in Time): Keep recently accessed data items closer to processor Spatial Locality (Locality in Space): Move contiguous blocks to the upper levels How does the memory hierarchy work? Well it is rather simple, at least in principle. In order to take advantage of the temporal locality, that is the locality in time, the memory hierarchy will keep those more recently accessed data items closer to the processor because chances are (points to the principle), the processor will access them again soon. In order to take advantage of the spatial locality, not ONLY do we move the item that has just been accessed to the upper level, but we ALSO move the data items that are adjacent to it. +1 = 15 min. (X:55) Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y

Memory Hierarchy of a Modern Computer System
Take advantage of the principle of locality to: Present as much memory as in the cheapest technology Provide access at speed offered by the fastest technology On-Chip Cache Registers Control Datapath Secondary Storage (Disk) Processor Main Memory (DRAM) Second Level (SRAM) 1s 10,000,000s (10s ms) Speed (ns): 10s-100s 100s Gs Size (bytes): Ks-Ms Ms Tertiary (Tape) 10,000,000,000s (10s sec) Ts The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in details in the next lecture on caches). +1 = 16 min. (X:56)

A Summary on Sources of Cache Misses
Compulsory (cold start): first reference to a block “Cold” fact of life: not a whole lot you can do about it Note: When running “billions” of instruction, Compulsory Misses are insignificant Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Conflict (collision): Multiple memory locations mapped to same cache location Solutions: increase cache size, or increase associativity Two others: Coherence (Invalidation): other process (e.g., I/O) updates memory Policy: Due to non-optimal replacement policy (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23)

How is a Block found in a Cache?
offset Block Address Tag Index Set Select Data Select Index Used to Lookup Candidates in Cache Index identifies the set Tag used to identify actual copy If no candidates match, then declare cache miss Block is minimum quantum of caching Data select field used to select data within block Many caching applications don’t have data select field

Review: Direct Mapped Cache
Direct Mapped 2N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2M) Example: 1 KB Direct Mapped Cache with 32 B Blocks Index chooses potential block Tag checked to verify block Byte select chooses byte within block Cache Index 4 31 Cache Tag Byte Select 9 Let’s use a specific example with realistic numbers: assume we have a 1 KB direct mapped cache with block size equals to 32 bytes. In other words, each block associated with the cache tag will have 32 bytes in it (Row 1). With Block Size equals to 32 bytes, the 5 least significant bits of the address will be used as byte select within the cache block. Since the cache size is 1K byte, the upper 32 minus 10 bits, or 22 bits of the address will be stored as cache tag. The rest of the address bits in the middle, that is bit 5 through 9, will be used as Cache Index to select the proper cache entry. +2 = 30 min. (Y:10) Ex: 0x01 Ex: 0x00 Ex: 0x50 : 0x50 Valid Bit Cache Tag Byte 32 1 2 3 : Cache Data Byte 0 Byte 1 Byte 31 Byte 33 Byte 63 Byte 992 Byte 1023 31

Review: Set Associative Cache
N-way set associative: N entries per Cache Index N direct mapped caches operates in parallel Example: Two-way set associative cache Cache Index selects a “set” from the cache Two tags in the set are compared to input in parallel Data is selected based on the tag result

Set Associative Cache Example
Cache Index 4 31 Cache Tag Byte Select 8 Cache Data Cache Block 0 Cache Tag Valid : This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16) Cache Data Cache Block 0 Cache Tag Valid : Compare Mux 1 Sel1 Sel0 OR Hit Cache Block

Review: Fully Associative Cache
Fully Associative: Every block can hold any line Address does not include a cache index Compare Cache Tags of all Cache Entries in Parallel Example: Block Size=32B blocks We need N 27-bit comparators Still have byte select to choose from within block While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21)

Fully Associative Cache example
4 Cache Tag (27 bits long) Byte Select 31 = Ex: 0x01 While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21) Valid Bit : Cache Tag : Cache Data Byte 0 Byte 1 Byte 31 Byte 32 Byte 33 Byte 63

Where does a Block Get Placed in a Cache?
Example: Block 12 placed in 8 block cache 32-Block Address Space: Block no. Block no. Direct mapped: block 12 can go only into block 4 (12 mod 8) Set associative: block 12 can go anywhere in set 0 (12 mod 4) Block no. Set 1 2 3 Fully associative: block 12 can go anywhere Block no.

Review: Which block should be replaced on a miss?
Easy for Direct Mapped: Only one possibility Set Associative or Fully Associative: Random LRU (Least Recently Used) 2-way way way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% % 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% % 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% % 1.13% 1.12% 1.12%

Review: What happens on a write?
Write through: The information is written to both the block in the cache and to the block in the lower-level memory Write back: The information is written only to the block in the cache. Modified cache block is written to main memory only when it is replaced Question is block clean or dirty? Pros and Cons of each? WT: PRO: read misses cannot result in writes CON: Processor held up on writes unless writes buffered WB: PRO: repeated writes not sent to DRAM processor not held up on writes CON: More complex Read miss may require write back of dirty data

Caching Applied to Address Translation
Virtual Address TLB Physical Address Physical Memory CPU Cached? Yes Save Result No Data Read or Write (untranslated) Translate (MMU) Question is one of page locality: does it exist? Instruction accesses spend a lot of time on the same page (since accesses sequential) Stack accesses have definite locality of reference Data accesses have less page locality, but still some… Can we have a TLB hierarchy? Sure: multiple levels at different sizes/speeds

What Actually Happens on a TLB Miss?
Hardware traversed page tables: On TLB miss, hardware in MMU looks at current page table to fill TLB (may walk multiple levels) If PTE valid, hardware fills TLB and processor never knows If PTE marked as invalid, causes Page Fault, after which kernel decides what to do afterwards Software traversed Page tables (like MIPS) On TLB miss, processor receives TLB fault Kernel traverses page table to find PTE If PTE valid, fills TLB and returns from fault If PTE marked as invalid, internally calls Page Fault handler Most chip sets provide hardware traversal Modern operating systems tend to have more TLB faults since they use translation for many things Examples: shared segments user-level portions of an operating system

What happens on a Context Switch?
Need to do something, since TLBs map virtual addresses to physical addresses Address Space just changed, so TLB entries no longer valid! Options? Invalidate TLB: simple but might be expensive What if switching frequently between processes? Include ProcessID in TLB This is an architectural solution: needs hardware What if translation tables change? For example, to move page from memory to disk or vice versa… Must invalidate TLB entry! Otherwise, might think that page is still in memory!

What TLB organization makes sense?
CPU TLB Cache Memory Needs to be really fast Critical path of memory access In simplest view: before the cache Thus, this adds to access time (reducing cache speed) Seems to argue for Direct Mapped or Low Associativity However, needs to have very few conflicts! With TLB, the Miss Time extremely high! This argues that cost of Conflict (Miss Time) is much higher than slightly increased cost of access (Hit Time) Thrashing: continuous conflicts between accesses What if use low order bits of page as index into TLB? First page of code, data, stack may map to same entry Need 3-way associativity at least? What if use high order bits as index? TLB mostly unused for small programs

TLB organization: include protection
How big does TLB actually have to be? Usually small: entries Not very big, can support higher associativity TLB usually organized as fully-associative cache Lookup is by Virtual Address Returns Physical Address + other info What happens when fully-associative is too slow? Put a small (4-16 entry) direct-mapped cache in front Called a “TLB Slice” Example for MIPS R3000: 0xFA00 0x0003 Y N Y R/W 34 0x0040 0x0010 N Y Y R 0 0x0041 0x0011 N Y Y R 0 Virtual Address Physical Address Dirty Ref Valid Access ASID

Oracle SPARC Solaris Consider modern, 64-bit operating system example with tightly integrated HW Goals are efficiency, low overhead Based on hashing, but more complex Two hash tables One kernel and one for all user processes Each maps memory addresses from virtual to physical memory Each entry represents a contiguous area of mapped virtual memory, More efficient than having a separate hash-table entry for each page Each entry has base address and span (indicating the number of pages the entry represents)

Oracle SPARC Solaris (Cont.)
TLB holds translation table entries (TTEs) for fast hardware lookups A cache of TTEs reside in a translation storage buffer (TSB) Includes an entry per recently accessed page Virtual address reference causes TLB search If miss, hardware walks the in-memory TSB looking for the TTE corresponding to the address If match found, the CPU copies the TSB entry into the TLB and translation completes If no match found, kernel interrupted to search the hash table The kernel then creates a TTE from the appropriate hash table and stores it in the TSB, Interrupt handler returns control to the MMU, which completes the address translation.

Example: The Intel 32 and 64-bit Architectures
Dominant industry chips Pentium CPUs are 32-bit and called IA-32 architecture Current Intel CPUs are 64-bit and called IA-64 architecture Many variations in the chips, cover the main ideas here

Example: The Intel IA-32 Architecture
Memory management in IA-32 systems is divided into two components—segmentation and paging The CPU generates logical addresses, which are given to the segmentation unit. The segmentation unit produces a linear address for each logical address. The linear address is then given to the paging unit, which in turn generates the physical address in main memory. The segmentation and paging units form the equivalent of the memory-management unit (MMU).

IA-32 Segmentation Each segment can be 4 GB
Up to 16 K segments per process Divided into two partitions First partition of up to 8 K segments are private to process (kept in local descriptor table (LDT)) Second partition of up to 8K segments shared among all processes (kept in global descriptor table (GDT)) Each entry in the LDT and GDT consists of an 8-byte segment descriptor with detailed information about a particular segment, including the base location and limit of that segment.

IA-32 Segmentation The logical address is a pair (selector, offset), where the selector is a 16-bit number: s designates the segment number, g indicates whether the segment is in the GDT or LDT, and p deals with protection. The offset is a 32-bit number specifying the location of the byte within the segment in question. The machine has six segment registers, allowing six segments to be addressed at any one time by a process. It also has six 8-byte microprogram registers to hold the corresponding descriptors from either the LDT or the GDT. This cache lets the Pentium avoid having to read the descriptor from memory for every memory reference.

Intel IA-32 Segmentation
The linear address on the IA-32 is 32 bits long The segment register points to the appropriate entry in the LDT or GDT. The base and limit information about the segment in question is used to generate a linear address. First, the limit is used to check for address validity. If the address is not valid, a memory fault is generated, resulting in a trap to the operating system. If it is valid, then the value of the offset is added to the value of the base, resulting in a 32-bit linear address.

Logical to Physical Address Translation in IA-32
The IA-32 architecture allows a page size of either 4 KB or 4 MB. For 4-KB pages, IA-32 uses a two-level paging scheme in which the division of the 32-bit linear address is as follows: The 10 high-order bits reference an entry in the outermost page table, called the page directory. The CR3 register points to the page directory for the current process. The page directory entry points to an inner page table that is indexed by the contents of the innermost 10 bits in the linear address. Finally, the low-order bits 0–11 refer to the offset in the 4-KB page pointed to in the page table.

Intel IA-32 Paging Architecture
One entry in the page directory is the Page_Size flag, which if set, indicates that the size of the page frame is 4 MB and not the standard 4 KB. If this flag is set, the page directory points directly to the 4-MB page frame, bypassing the inner page table; and the 22 low-order bits in the linear address refer to the offset in the 4-MB page frame.

Intel IA-32 Paging Architecture
To improve the efficiency of physical memory use, IA-32 page tables can be swapped to disk. In this case, an invalid bit is used in the page directory entry to indicate whether the table to which the entry is pointing is in memory or on disk. If the table is on disk, the operating system can use the other 31 bits to specify the disk location of the table. The table can then be brought into memory on demand.

Intel IA-32 Page Address Extensions
32-bit address limits of 4GB, led Intel to create page address extension (PAE), allowing 32-bit processors to access more than 4GB of physical address space Paging went to a 3-level scheme Top two bits refer to a page directory pointer table Page-directory and page-table entries increased from 32-bits to 64-bits in size, which allowed the base address of page tables and page frames to extend from 20 to 24 bits Net effect is increasing address space to 36 bits – 64GB of physical memory; operating system support is required to use PAE

Intel x86-64 Current generation Intel x86 architecture
64 bits address space yields an astonishing 264 bytes of addressable memory > 16 quintillion (16 exabytes) In practice only implement 48 bit addressing Page sizes of 4 KB, 2 MB, 1 GB Four levels of paging hierarchy Can also use PAE so virtual addresses are 48 bits and physical addresses are 52 bits

64-bit ARMv8 Architecture
The ARMv8 has three different translation granules: 4 KB, 16 KB, and 64 KB. Each translation granule provides different page sizes, as well as larger sections of contiguous memory, known as regions. For 4-KB and 16-KB granules, up to four levels of paging may be used, with up to three levels of paging for 64-KB granules. ARMv8 address structure for the 4-KB translation granule with up to four levels of paging (Notice that only 48 bits are currently used): Translation Granule Size Page Size Region Size 4 KB 2 MB, 1 GB 16 KB 32 MB 64 KB 512 MB

The four-level hierarchical paging structure for the 4-KB translation granule: TTBR register is the translation table base register and points to the level 0 table for the current thread If all four levels are used, the offset (bits 0–11) refers to the offset within a 4-KB page. Table entries for level 1 and level 2 may refer either to another table or to a 1-GB region (level-1 table) or 2-MB region (level-2 table).

The ARM architecture supports two levels of TLBs Inner level has two micro TLBs (one data, one instruction) The micro TLB supports ASIDs as well At the outer level is single main TLB Address translation begins at the micro-TLB level. In the case of a miss, the main TLB is then checked. If both TLBs yield misses, a page table walk must be performed in hardware.

End of Chapter 9

Chapter 9: Main Memory.

Similar presentations

Presentation on theme: "Chapter 9: Main Memory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 9: Main Memory.

Similar presentations

Presentation on theme: "Chapter 9: Main Memory."— Presentation transcript:

Similar presentations

About project

Feedback