Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections 5.10-5.17)

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 14: Virtual Memory Topics: virtual memory (Section 5.4) Reminders: midterm begins at 9am, ends at 10:40am.
Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)
Lecture 19: Virtual Memory
1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)
Virtual Memory Expanding Memory Multiple Concurrent Processes.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
1 Lecture 16: Virtual Memory Topics: virtual memory, improving TLB performance (Sections )
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Memory COMPUTER ARCHITECTURE
Lecture: Large Caches, Virtual Memory
Lecture: Large Caches, Virtual Memory
CS 704 Advanced Computer Architecture
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Lecture: Cache Hierarchies
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture: Large Caches, Virtual Memory
Lecture: Cache Hierarchies
Lecture: SMT, Cache Hierarchies
Lecture 21: Memory Hierarchy
Lecture 12: Cache Innovations
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Lecture: DRAM Main Memory
Lecture 23: Cache, Memory, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
Lecture: SMT, Cache Hierarchies
ECE Dept., University of Toronto
Lecture: DRAM Main Memory
Lecture: Cache Innovations, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
CPE 631 Lecture 05: Cache Design
Lecture 24: Memory, VM, Multiproc
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Lecture 20: OOO, Memory Hierarchy
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
Lecture 15: Memory Design
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Lecture 22: Cache Hierarchies, Memory
Lecture 11: Cache Hierarchies
15-740/ Computer Architecture Lecture 14: Prefetching
Lecture: SMT, Cache Hierarchies
Lecture: Cache Hierarchies
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Lecture 21: Memory Hierarchy
Lecture 23: Virtual Memory, Multiprocessors
Paging and Segmentation
CS703 - Advanced Operating Systems
Main Memory Background
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Presentation transcript:

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections 5.10-5.17)

Alpha Paged Virtual Memory Each process has the following virtual memory space: seg0 kseg seg1 Reserved for User text, data Reserved for kernel Reserved for page tables The Alpha uses a separate instruction and data TLB The TLB entries can be used to map pages of different sizes

Example Look-Up T PTEs L B Physical Memory Virtual Memory Virtual page abc  Physical page xyz If each PTE is 8 bytes, location of PTE for abc is at virtual address abc/8 = lmn Virtual addr lmn  physical addr pqr

32-bit physical page number Alpha Address Mapping Virtual address Unused bits Level 1 Level 2 Level 3 Page offset 21 bits 10 bits 10 bits 10 bits 13 bits Page table base register + + + PTE PTE PTE L1 page table L2 page table L3 page table 32-bit physical page number Page offset 45-bit Physical address

Alpha Address Mapping Each PTE is 8 bytes – if page size is 8KB, a page can contain 1024 PTEs – 10 bits to index into each level If page size doubles, we need 47 bits of virtual address Since a PTE only stores 32 bits of physical page number, the physical memory can be addressed by at most 32 + offset First two levels are in physical memory; third is in virtual Why the three-level structure? Even a flat structure would need PTEs for the PTEs that would have to be stored in physical memory – more levels of indirection make it easier to dynamically allocate pages

Bandwidth Out of order superscalar processors can issue 4+ instrs per cycle  2+ loads/stores per cycle  caches must provide low latency and high bandwidth Effective caches  memory bandwidth requirements are usually low; unfortunately, memory bandwidth is easier to improve RDRAM improved memory bandwidth by a factor of eight, but improved performance by less than 2% for most applications and by 15% for some graphics apps Bandwidth can help if you prefetch aggressively

Cache Bandwidth Interleaved cache cell L1 D 1 port L1 D 1 port Multi-ported cell Odd words Even words Similar area to a 1-ported cache More complexity in routing addresses/data Slight penalty when both words conflict for the same bank L1 D 2 ports L1 D 1 port 2-cycle access time 3-cycle access time

Prefetching High memory latency and cache misses are unavoidable Prefetching is one of the most effective ways to hide memory latency Some programs are hard to prefetch for – unpredictable branches, irregular traversal of arrays, hash tables, pointer-based data structures Aggressive prefetching can pollute the cache and can compete for memory bandwidth Prefetch design for: (i) array accesses, (ii) pointers

Stride Prefetching Constant strides are relatively easy to detect Keep track of last address fetched by a PC – compare with current address to confirm constant stride Every access triggers a fetch of the next word – in fact, the prefetcher tries to stay ahead enough to entirely hide memory latency Prefetched words are stored in a buffer to reduce cache pollution

Cache Power Consumption Instruction caches can save on decode time and power by storing instructions in decoded form (trace caches) Memory accesses are power hungry – caches can also help reduce power consumption

Alpha 21264 Instruction Hierarchy When powered on, initialization code is read from an external PROM and executed in privileged architecture library (PAL) mode with no virtual memory The I-cache is virtually indexed and virtually tagged – this avoids I-TLB look-up for every access – correctness is not compromised because instructions are not modified Each I-cache block saves 11 bits to predict the index of the next set that is going to be accessed and 1 bit to predict the way – line and way prediction An I-cache miss looks up a prefetch buffer and a 128-entry fully-associative TLB before accessing L2

21264 Cache Hierarchy The L2 cache is off-chip and direct-mapped (the 21364 moves L2 on to chip) Every L2 fetch also fetches the next four physical blocks (without exceeding the page boundary) L2 is write-back The processor has a 128-bit data path to L2 and 64-bit data path to memory

21264 Data Cache The L1 data cache is write-back, virtually indexed, physically tagged, and backed up by a victim buffer On a miss, the processor checks other L1 cache locations for a synonym in parallel with L2 look-up (recall two alternative techniques to deal with the synonym problem) No prefetching for data cache misses

21264 Performance 21164: 8KB L1s and 96KB L2 ; 21264: 64KB L1 and off-chip 1MB L2 The 21264 is out of order and can tolerate L1 misses  speedup is a function of 21164 L2 misses that are captured by 21264’s L2 Commercial database/server applications stress the memory system much more than SPEC/desktop applications

Sun Fire 6800 Server Intended for commercial applications  aggressive memory hierarchy design 8 MB off-chip L2 wide buses going to L2 and memory for bandwidth on-chip memory controller to reduce latency on-chip L2 tags to save latency on a miss ECC and parity bits for all external traffic to provide high reliability Large store buffers (write caches) between L1 and L2 Data prefetch engine that detects strides Instr prefetch that stays one block ahead of decode Two parallel TLBs: 128-entry 4-way and 16-entry fully-associative

Title Bullet