Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections 5.10-5.17)

Similar presentations


Presentation on theme: "Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections 5.10-5.17)"— Presentation transcript:

1 Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

2 Alpha Paged Virtual Memory
Each process has the following virtual memory space: seg0 kseg seg1 Reserved for User text, data Reserved for kernel Reserved for page tables The Alpha uses a separate instruction and data TLB The TLB entries can be used to map pages of different sizes

3 Example Look-Up T PTEs L B Physical Memory Virtual Memory
Virtual page abc  Physical page xyz If each PTE is 8 bytes, location of PTE for abc is at virtual address abc/8 = lmn Virtual addr lmn  physical addr pqr

4 32-bit physical page number
Alpha Address Mapping Virtual address Unused bits Level 1 Level 2 Level 3 Page offset 21 bits 10 bits 10 bits 10 bits 13 bits Page table base register + + + PTE PTE PTE L1 page table L2 page table L3 page table 32-bit physical page number Page offset 45-bit Physical address

5 Alpha Address Mapping Each PTE is 8 bytes – if page size is 8KB, a page can contain 1024 PTEs – 10 bits to index into each level If page size doubles, we need 47 bits of virtual address Since a PTE only stores 32 bits of physical page number, the physical memory can be addressed by at most 32 + offset First two levels are in physical memory; third is in virtual Why the three-level structure? Even a flat structure would need PTEs for the PTEs that would have to be stored in physical memory – more levels of indirection make it easier to dynamically allocate pages

6 Bandwidth Out of order superscalar processors can issue 4+ instrs
per cycle  2+ loads/stores per cycle  caches must provide low latency and high bandwidth Effective caches  memory bandwidth requirements are usually low; unfortunately, memory bandwidth is easier to improve RDRAM improved memory bandwidth by a factor of eight, but improved performance by less than 2% for most applications and by 15% for some graphics apps Bandwidth can help if you prefetch aggressively

7 Cache Bandwidth Interleaved cache cell L1 D 1 port L1 D 1 port
Multi-ported cell Odd words Even words Similar area to a 1-ported cache More complexity in routing addresses/data Slight penalty when both words conflict for the same bank L1 D 2 ports L1 D 1 port 2-cycle access time 3-cycle access time

8 Prefetching High memory latency and cache misses are unavoidable
Prefetching is one of the most effective ways to hide memory latency Some programs are hard to prefetch for – unpredictable branches, irregular traversal of arrays, hash tables, pointer-based data structures Aggressive prefetching can pollute the cache and can compete for memory bandwidth Prefetch design for: (i) array accesses, (ii) pointers

9 Stride Prefetching Constant strides are relatively easy to detect
Keep track of last address fetched by a PC – compare with current address to confirm constant stride Every access triggers a fetch of the next word – in fact, the prefetcher tries to stay ahead enough to entirely hide memory latency Prefetched words are stored in a buffer to reduce cache pollution

10 Cache Power Consumption
Instruction caches can save on decode time and power by storing instructions in decoded form (trace caches) Memory accesses are power hungry – caches can also help reduce power consumption

11 Alpha 21264 Instruction Hierarchy
When powered on, initialization code is read from an external PROM and executed in privileged architecture library (PAL) mode with no virtual memory The I-cache is virtually indexed and virtually tagged – this avoids I-TLB look-up for every access – correctness is not compromised because instructions are not modified Each I-cache block saves 11 bits to predict the index of the next set that is going to be accessed and 1 bit to predict the way – line and way prediction An I-cache miss looks up a prefetch buffer and a 128-entry fully-associative TLB before accessing L2

12 21264 Cache Hierarchy The L2 cache is off-chip and direct-mapped (the 21364 moves L2 on to chip) Every L2 fetch also fetches the next four physical blocks (without exceeding the page boundary) L2 is write-back The processor has a 128-bit data path to L2 and 64-bit data path to memory

13 21264 Data Cache The L1 data cache is write-back, virtually indexed, physically tagged, and backed up by a victim buffer On a miss, the processor checks other L1 cache locations for a synonym in parallel with L2 look-up (recall two alternative techniques to deal with the synonym problem) No prefetching for data cache misses

14 21264 Performance 21164: 8KB L1s and 96KB L2 ; 21264: 64KB L1 and off-chip 1MB L2 The is out of order and can tolerate L1 misses  speedup is a function of L2 misses that are captured by 21264’s L2 Commercial database/server applications stress the memory system much more than SPEC/desktop applications

15 Sun Fire 6800 Server Intended for commercial applications  aggressive memory hierarchy design 8 MB off-chip L2 wide buses going to L2 and memory for bandwidth on-chip memory controller to reduce latency on-chip L2 tags to save latency on a miss ECC and parity bits for all external traffic to provide high reliability Large store buffers (write caches) between L1 and L2 Data prefetch engine that detects strides Instr prefetch that stays one block ahead of decode Two parallel TLBs: 128-entry 4-way and 16-entry fully-associative

16 Title Bullet


Download ppt "Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections 5.10-5.17)"

Similar presentations


Ads by Google