CoLT: Coalesced Large-Reach TLBs December 2012 Binh Pham §, Viswanathan Vaidyanathan §, Aamer Jaleel ǂ, Abhishek Bhattacharjee § § Rutgers University ǂ.

Slides:

Advertisements

Similar presentations

Paging Andrew Whitaker CSE451.

Advertisements

Efficient Virtual Memory for Big Memory Servers U Wisc and HP Labs ISCA’13 Architecture Reading Club Summer'131.

EECS 470 Virtual Memory Lecture 15. Why Use Virtual Memory? Decouples size of physical memory from programmer visible virtual memory Provides a convenient.

Virtual Memory. The Limits of Physical Addressing CPU Memory A0-A31 D0-D31 “Physical addresses” of memory locations Data All programs share one address.

Computer Organization CS224 Fall 2012 Lesson 44. Virtual Memory  Use main memory as a “cache” for secondary (disk) storage l Managed jointly by CPU hardware.

Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.

COMP 3221: Microprocessors and Embedded Systems Lectures 27: Virtual Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Virtual Memory I Steve Ko Computer Sciences and Engineering University at Buffalo.

CS 153 Design of Operating Systems Spring 2015

CS 153 Design of Operating Systems Spring 2015

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 17, 2003 Topic: Virtual Memory.

CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.

1 Lecture 14: Virtual Memory Topics: virtual memory (Section 5.4) Reminders: midterm begins at 9am, ends at 10:40am.

Memory Management 2010.

Lecture 17: Virtual Memory, Large Caches

1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Mem. Hier. CSE 471 Aut 011 Evolution in Memory Management Techniques In early days, single program run on the whole machine –Used all the memory available.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)

Memory Management CSE451 Andrew Whitaker. Big Picture Up till now, we’ve focused on how multiple programs share the CPU Now: how do multiple programs.

Address Translation with Paging Case studies for X86, SPARC, and PowerPC.

Design Tradeoffs For Software-Managed TLBs Authers; Nagle, Uhlig, Stanly Sechrest, Mudge & Brown.

Operating Systems ECE344 Ding Yuan Paging Lecture 8: Paging.

Revisiting Hardware-Assisted Page Walks for Virtualized Systems

A NEW PAGE TABLE FOR 64-BIT ADDRESS SPACES M. Talluri, M. D. Hill, Y. A. Kalidi University of Wisconsin, Madison Sun Microsystems Laboratories.

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming  To allocate scarce memory resources.

ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 14 Virtual Memory Benjamin Lee Electrical and Computer Engineering Duke University

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Virtual Memory Benjamin Lee Electrical and Computer Engineering Duke University

Miseon Han Thomas W. Barr, Alan L. Cox, Scott Rixner Rice Computer Architecture Group, Rice University ISCA, June 2011.

1 Memory Management. 2 Fixed Partitions Legend Free Space 0k 4k 16k 64k 128k Internal fragmentation (cannot be reallocated) Divide memory into n (possible.

1 Lecture 16: Virtual Memory Topics: virtual memory, improving TLB performance (Sections )

Virtual Memory Additional Slides Slide Source: Topics Address translation Accelerating translation with TLBs class12.ppt.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.

Redundant Memory Mappings for Fast Access to Large Memories

CS2100 Computer Organisation Virtual Memory – Own reading only (AY2015/6) Semester 1.

Virtual Memory Ch. 8 & 9 Silberschatz Operating Systems Book.

Memory Management Continued Questions answered in this lecture: What is paging? How can segmentation and paging be combined? How can one speed up address.

1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)

3.1 Advanced Operating Systems Superpages TLB coverage is the amount of memory mapped by TLB. I.e. the amount of memory that can be accessed without TLB.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 33 Paging Read Ch. 9.4.

Translation Lookaside Buffer

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Lecture: Large Caches, Virtual Memory

CS703 - Advanced Operating Systems

Morgan Kaufmann Publishers

CS510 Operating System Foundations

Energy-Efficient Address Translation

Memory Management 11/17/2018 A. Berrached:CS4315:UHD.

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Translation Lookaside Buffer

So far in memory management…

CSE 451: Operating Systems Autumn 2005 Memory Management

CoLT: Coalesced Large-Reach TLBs

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

© 2004 Ed Lazowska & Hank Levy

CSE451 Virtual Memory Paging Autumn 2002

CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management

Memory Management CSE451 Andrew Whitaker.

Lecture 8: Efficient Address Translation

Translation Lookaside Buffers

Paging and Segmentation

CS703 - Advanced Operating Systems

Presentation transcript:

CoLT: Coalesced Large-Reach TLBs December 2012 Binh Pham §, Viswanathan Vaidyanathan §, Aamer Jaleel ǂ, Abhishek Bhattacharjee § § Rutgers University ǂ VSSAD, Intel Corporation

Address Translation Primer Binh Pham - Rutgers University2 LSQ Last Level Cache LSQ 8 PTEs L1 TLB L2 TLB L1 TLB L2 TLB L1 Cache On a TLB miss: x86: 1-4 memory references ARM: 1-2 memory references Sparc: 1-2 memory references

Address Translation Performance Impact Address translation performance overhead – 10-15% – Clark & Emer [Trans. On Comp. Sys. 1985] – Talluri & Hill [ASPLOS 1994] – Barr, Cox & Rixner [ISCA 2011] Emerging software trends – Virtualization 2D walks – 89% overheads [Bhargava et al., ASPLOS 2008] Emerging hardware trends – LLC capacity to TLB capacity ratios increasing – Manycore/hyperthreading increases TLB and LLC PTE stress Binh Pham - Rutgers University3

Contiguity & CoLT Binh Pham - Rutgers University4 Page Table VirtualPhysical …… High contiguity Large pages Low contiguity TLB organization Prefetching Intermediate contiguity CoLT Low HW/SW Eliminate 40-50% TLB misses 14% performance gain TLB for large pages VirtualPhysical 0 : : 1023 TLB VirtualPhysical Coalesced TLB VirtualPhysical : :

Intermediate Contiguity: Past Work and Our Goals Past work – TLB sub-blocking: Talluri & Hill [ASPLOS 94] – SpecTLB: Barr, Cox & Rixner [ISCA 2011] – Overheads from either HW or SW – Alignment and special placement CoLT goals: – Low overhead HW – No change in SW – No alignment Binh Pham - Rutgers University5

Outline Intermediate contiguity: – Why does it exist? – How much exists in real systems? – How do you exploit it in hardware? – How much can it improve performance? Conclusion Binh Pham - Rutgers University6

Why does Intermediate Contiguity Exist? Buddy allocator Memory compaction Qualifying Examination7 PFN 7 PFN 6 PFN 5 PFN 4 PFN 3 PFN 2 PFN 1 PFN 0 List 3 List 2 List 1 List 0 PFN 1 PFN 6/7 PFN 7 PFN 6 PFN 5 PFN 4 PFN 3 PFN 2 PFN 1 PFN 0 Moveable Pages Free Pages PFN 0/1/2/3 Free Lists Physical Memory PFN 4

Real System Experiments Study real system contiguity by varying: – Superpage on or off – Memory compaction daemon invocations – System load Real system configuration: – CPU: Intel Core i7, 64 entry L1 TLBs, 512 entry L2 TLB – Memory: 3 GB – OS: Fedora 15, kernel Binh Pham - Rutgers University8

How Much Intermediate Contiguity Exists? Binh Pham - Rutgers University :117 Exploitable contiguity across all system configurations Not enough for superpages

How do you Exploit it in Hardware? Binh Pham - Rutgers University10 VirtualPhysical Reference Stream Page Table Standard Coalesced Virtual: 0 – 0b000 LLC Coalescing Logic PTEs for VPN 0 to 7 Miss 1! 0, 10 0,10: 1,11 Virtual: 1 – 0b001 Miss 2! Virtual: 2 – 0b010 Miss 2! Miss 3! Virtual: 3 – 0b011 Miss 4! 3, 21 Virtual: 4 – 0b100 Miss 3! Miss 5! 1, 112, 204, 22 Miss 12! 2, 205, 23 Virtual: 5 – 0b101 2,20: 3,21 4,22: 5,23 3, 214, 22

CoLT for Set Associative TLBs Hardware complexity – Modest lookup complexity – No additional ports – Coalesce on fill to reduce overhead – No additional page walks Binh Pham - Rutgers University11 Lookup for virtual page: 5 – 0b TagValid AttributeBase physical 0b111…0b10110 LLC PTEs for VPN 0 to 7 Coalescing Logic TLB TagValid AttributeBase physical 0b111…0b10110 TagValid AttributeBase physical 0b111…0b10110 TagValid AttributeBase physical 0b111…0b10110 TagValid AttributeBase physical 0b111…0b10110 Combinational logic to calculate physical page

CoLT Set Associative Miss Rates Binh Pham - Rutgers University12 Left-shifting 2 index bits best coalescing, conflict miss compromise Roughly 50% average miss eliminations

Different CoLT Implementations Set-associative TLBs (CoLT-SA) – Low hardware but caps coalescing opportunity Fully-associative TLB (CoLT-FA) – No indexing scheme – high coalescing opportunity – More complex hardware – we use ½ baseline TLB size Hybrid scheme (CoLT-All) – CoLT-SA for limited coalescing – CoLT-FA for high coalescing Binh Pham - Rutgers University13

How Much Can it Improve Performance? Binh Pham - Rutgers University14 CoLT gets us half-way to a perfect TLB’s performance

Conclusions Buddy allocator, memory compaction, large pages, system load create intermediate contiguity CoLT uses modest hardware to eliminate 40-50% TLB misses Average performance improvements of 14% CoLT suggests: – Re-examining highly-associative TLBs? – CoLT in virtualization? How hypervisor allocates physical memory? Binh Pham - Rutgers University15

Thank you! Binh Pham - Rutgers University16

Impact of Increasing Associativity Binh Pham - Rutgers University17

Comparison with Sub-blocking Sub-blocking – Uses a TLB entry to keep information about multiple mappings – Complete Sub-blocking: no OS modification, big TLB – Partial Sub-blocking: OS modification, small TLB, special placement required, e.g: VPN(x) / N = VPN(y) / N; PPN(x) / N = PPN(y) / N; VPN(x) % N = PPN(x) % N; VPN(y) % N = PPN(y) % N CoLT – NO alignment, special placement required – Low overhead hardware, no overhead software Binh Pham - Rutgers University18