ECE Dept., University of Toronto

ECE Dept., University of Toronto
ECE 454 Computer Systems Programming Memory performance (Part III: Virtual Memory and Prefetching) Ding Yuan ECE Dept., University of Toronto

Contents Virtual Memory (review (hopefully)) Prefetching 2013-10-14
Ding Yuan, ECE454

IA32 Linux Memory Layout Stack Data Heap Text
Runtime stack (8MB limit) Data Statically allocated data E.g., arrays & strings declared in code Heap Dynamically allocated storage When call malloc(), calloc(), new() Text Executable machine instructions Read-only Ding Yuan, ECE454

Virtual Memory Programs refer to virtual memory addresses
Conceptually very large array of bytes (4GB for IA32, 16 exabytes for 64 bits) Each byte has its own address System provides address space private to particular “process” Allocation: Compiler and run-time system Where different program objects should be stored All allocation within single virtual address space But why virtual memory? Why not physical memory? Ding Yuan, ECE454

A System Using Physical Addressing
Main memory 0: 1: Physical address (PA) 2: 3: CPU 4: 5: 6: 7: 8: ... M-1: Data word Used in “simple” embedded microcontrollers Problems for larger, multi-process systems Ding Yuan, ECE454

Problem 1: How Does Everything Fit?
64-bit addresses: 16 Exabyte Physical main memory: Few Gigabytes ? And there are many processes …. Ding Yuan, ECE454

Problem 2: Memory Management
Physical main memory Process 1 Process 2 Process 3 … Process n stack heap .text .data … What goes where? x Ding Yuan, ECE454

Problem 3: Portability What goes where? What goes where .text .data
Machine 1 Physical main memory Machine 2 Physical main memory programA: stack heap .text .data … What goes where? What goes where Ding Yuan, ECE454

Problem 4: How To Protect
Physical main memory Process i Process j Problem 5: How To Share? Physical main memory Process i Process j Ding Yuan, ECE454

Solution: add a Level Of Indirection
Virtual memory Process 1 mapping Physical memory Virtual memory Process n Each process gets its own private memory space Solves the previous problems Ding Yuan, ECE454

A System Using Virtual Addressing
Main memory 0: CPU Chip 1: Virtual address (VA) Physical address (PA) 2: 3: CPU MMU 4: 5: 6: 7: 8: ... M-1: Data word MMU = Memory Management Unit MMU keeps mapping of VAs -> PAs in a “page table” Ding Yuan, ECE454

VM Turns Main Memory into a Cache
Driven by enormous miss penalty: Disk is about 10,000x slower than DRAM DRAM-Cache Design: Large page (block) size: typically 4KB Fully associative Any VP can be placed in any PP Requires a “large” mapping function – different from CPU caches Highly sophisticated, expensive replacement algorithms Too complicated and open-ended to be implemented in hardware Write-back rather than write-through Ding Yuan, ECE454

MMU Needs Big Table of Translations
Main memory 0: CPU Chip 1: Virtual address (VA) Physical address (PA) 2: 3: CPU MMU 4: 5: 6: 7: Page Table 8: ... M-1: Data word MMU keeps mapping of VAs -> PAs in a “page table” Ding Yuan, ECE454

Reduce Hardware: Page Table in Mem.
2 CPU Chip Cache/ Memory PTE request MMU Page Table 1 PTE CPU VA 3 PA 4 Data 5 1) Processor sends virtual address (VA) to MMU 2-3) MMU requests page table entry (PTE) from page table in memory 4) MMU sends physical address (PA) to cache/memory 5) Cache/memory sends data word to processor If the page is not mapped in physical mem., called a “page fault” Ding Yuan, ECE454

Page Fault 1) Processor sends virtual address to MMU
Exception Page fault handler 4 2 CPU Chip Page Table Cache/ Memory Disk PTE request Victim page MMU 1 5 CPU VA PTE 7 3 New page 6 1) Processor sends virtual address to MMU 2-3) MMU fetches PTE from page table in memory 4) PTE absent so MMU triggers page fault exception 5) Handler identifies victim (and, if dirty, pages it out to disk) 6) Handler pages in new page and updates PTE in memory 7) Handler returns to original process, restarting faulting instruction Ding Yuan, ECE454

Speeding up Translation with a TLB
Page table entries (PTEs) are cached in L1 like any other memory word But PTEs may be evicted by other data references PTE hit still requires a 1-cycle delay Solution: Translation Lookaside Buffer (TLB) Small hardware cache in MMU Caches PTEs for a small number of pages (eg., 256 entries) Ding Yuan, ECE454

TLB Hit A TLB hit saves you from accessing memory for the page table
CPU Chip TLB 2 PTE VA 3 Cache/ Memory Page Table MMU 1 CPU VA PA 4 Data 5 A TLB hit saves you from accessing memory for the page table Ding Yuan, ECE454

TLB Miss A TLB miss incurs an additional memory access (the PTE)
CPU Chip TLB 4 2 PTE VA Cache/ Memory MMU Page Table 1 3 CPU VA PTE request PA 5 Data 6 A TLB miss incurs an additional memory access (the PTE) Ding Yuan, ECE454

How to Program for Virtual Memory
At any point in time, programs tend to access a set of active virtual pages called the working set Programs with better temporal locality will have smaller working sets If ((working set size) > main mem size) Thrashing: Performance meltdown where pages are swapped (copied) in and out continuously If ((# working set pages) > # TLB entries) Will suffer TLB misses Not as bad as page thrashing, but still worth avoiding Ding Yuan, ECE454

More on TLBs Assume a 256-entry TLB, 4kB pages Typical L2 cache is 6MB
256*4kB = 1MB: can only have TLB hits for 1MB of data This is called the “TLB reach”---amount of mem TLB can cover Typical L2 cache is 6MB Hence can’t have TLB hits for all of L2 Hence should consider TLB-size before L2 size when tiling? Real CPUs have second-level TLBs This is getting complicated to reason about! Likely have to experiment to find best tile size (HW2) Ding Yuan, ECE454

Prefetching Ding Yuan, ECE454

Prefetching Basic idea:
ORIGINAL CODE: CODE WITH PREFETCHING: inst1 inst1 inst2 prefetch X inst3 inst2 inst4 inst3 Cache miss latency load X (misses cache) inst4 load X (hits cache) Cache miss latency inst5 (load value is ready) inst6 inst5 (must wait for load value) inst6 Basic idea: Predicts which data will be needed soon (might be wrong) Initiates an early request for that data (like a load-to-cache) If effective, can be used to tolerate latency to memory Ding Yuan, ECE454

Prefetching is Difficult
Prefetching is effective only if all of these are true: There is spare memory bandwidth to begin with Otherwise prefetches could make things worse Prefetches are accurate Only useful if you prefetch data you will soon use Prefetches are timely Ie., prefetch the right data, but not early enough Prefetched data doesn’t displace other in-use data Eg: bad if PF replaces a cache block about to be used Latency hidden by prefetches outweighs their cost Cost of many useless prefetches could be significant Ineffective prefetching can hurt performance! Ding Yuan, ECE454

Hardware Prefetching A more complex hardware prefetcher:
A simple hardware prefetcher: When one block is accessed prefetch the adjacent block i.e., behaves like blocks are twice as big A more complex hardware prefetcher: Can recognize a “stream”: addresses separated by a “stride” Eg1: 0x1, 0x2, 0x3, 0x4, 0x5, 0x (stride = 0x1) Eg2: 0x100, 0x300, 0x500, 0x700, 0x900… (stride = 0x200) Prefetch predicted future addresses Eg., current_address + stride*4 Ding Yuan, ECE454

Core 2 Hardware Prefetching
L1/L2 cache: 64 B blocks Not drawn to scale L2->L1 inst prefetching 6 MB ~4 GB ~500 GB (?) L1 I-cache D-cache L2 unified cache Main Memory Disk 32 KB CPU Reg Mem->L2 data prefetching L2->L1 data prefetching Includes next-block prefetching and multiple streaming prefetchers They will only prefetch within a page boundary (details are kept vague/secret) Ding Yuan, ECE454

Software Prefetching Hardware provides special prefetch instructions: Eg., intel’s prefetchnta instruction Compiler or programmer can insert them into the code: Can PF patterns that hardware wouldn’t recognize (non-strided) void process_list(list_t *head){ list_t *p = head; while (p){ process(p); p = p->next; } void process_list_PF(list_t *head){ list_t *p = head; list_t *q; while (p){ q = p->next; prefetch(q); process(p); p = q; } Assumes process() is long enough to hide the prefetch latency Ding Yuan, ECE454

Summary: Optimizing for a Modern Memory Hierarchy
Ding Yuan, ECE454

Memory Optimization: Summary
Caches Conflict Misses: Less of a concern due to high-associativity (8-way L1, 16-way L2) Cache Capacity: Main concern: keep working set within on-chip cache capacity Focus on either L1 or L2 depending on required working-set size Virtual Memory: Page Misses: Keep “big-picture” working set within main memory capacity TLB Misses: may want to keep working set #pages < TLB #entries Prefetching: Try to arrange data structures, access patterns to favour sequential/strided access Try compiler or manual-inserted prefetch instructions Ding Yuan, ECE454

ECE Dept., University of Toronto

Similar presentations

Presentation on theme: "ECE Dept., University of Toronto"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE Dept., University of Toronto

Similar presentations

Presentation on theme: "ECE Dept., University of Toronto"— Presentation transcript:

Similar presentations

About project

Feedback