Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Slides:

Advertisements

Similar presentations

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Advertisements

CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and

Virtual Memory Hardware Support

CS 153 Design of Operating Systems Spring 2015

S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.

Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.

Translation Buffers (TLB’s)

Lecture 17: Virtual Memory, Large Caches

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)

©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 24 Instructor: L.N. Bhuyan

CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (

Lecture 19: Virtual Memory

8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.

CoLT: Coalesced Large-Reach TLBs December 2012 Binh Pham §, Viswanathan Vaidyanathan §, Aamer Jaleel ǂ, Abhishek Bhattacharjee § § Rutgers University ǂ.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

IT253: Computer Organization

Revisiting Hardware-Assisted Page Walks for Virtualized Systems

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Redundant Memory Mappings for Fast Access to Large Memories

1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.

CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CS203 – Advanced Computer Architecture Virtual Memory.

CS161 – Design and Architecture of Computer

Memory: Page Table Structure

Translation Lookaside Buffer

Virtual Memory Chapter 7.4.

ECE232: Hardware Organization and Design

Memory COMPUTER ARCHITECTURE

Lecture: Large Caches, Virtual Memory

CS161 – Design and Architecture of Computer

Lecture 12 Virtual Memory.

Lecture: Large Caches, Virtual Memory

CS 704 Advanced Computer Architecture

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Address Translation Mechanism of 80386

Morgan Kaufmann Publishers

Lecture: Large Caches, Virtual Memory

CS510 Operating System Foundations

CSE 153 Design of Operating Systems Winter 2018

Lecture 13: Large Cache Design I

Energy-Efficient Address Translation

CSCI206 - Computer Organization & Programming

Virtual Memory 3 Hakim Weatherspoon CS 3410, Spring 2011

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Evolution in Memory Management Techniques

CMSC 611: Advanced Computer Architecture

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Andy Wang Operating Systems COP 4610 / CGS 5765

FIGURE 12-1 Memory Hierarchy

ECE 445 – Computer Organization

Lecture 22: Cache Hierarchies, Memory

Andy Wang Operating Systems COP 4610 / CGS 5765

Virtual Memory Hardware

Translation Lookaside Buffer

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

Translation Buffers (TLB’s)

Virtual Memory Overcoming main memory size limitation

Translation Buffers (TLB’s)

CSC3050 – Computer Architecture

Lecture 23: Virtual Memory, Multiprocessors

CSE 471 Autumn 1998 Virtual memory

CS703 - Advanced Operating Systems

CSE 153 Design of Operating Systems Winter 2019

Translation Buffers (TLBs)

Sarah Diesburg Operating Systems CS 3430

Andy Wang Operating Systems COP 4610 / CGS 5765

Review What are the advantages/disadvantages of pages versus segments?

Sarah Diesburg Operating Systems COP 4610

Presentation transcript:

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Physical Caching Latency constraint limits TLB scalability Virtual Address Latency constraint limits TLB scalability TLB size restricted Limited coverage of TLB entry Missed Opportunities[1] Memory access misses TLB, hits in cache TLB miss delays cache hit opportunity Core TLB L1 $ Last-Level $ Physical Address [1] Zhang et al. ICS 2010

Virtual Caching Delay translation: Virtual Caching Problem: Synonyms Virtual Address Delay translation: Virtual Caching Access cache, then translate on miss Cache hits do not need translation Problem: Synonyms Synonyms are rare[2] Optimize for the common case TLB accesses reduced significantly Loosen TLB access latency restriction Possibility of sophisticated translation Reduces power consumption Core TLB L1 $ Last-Level $ L1 $ Last-Level $ Synonyms Physical Address [2] Basu et al. ISCA 2012

Hybrid Virtual Caching Last-Level $ TLB Physical Address Virtual Address Core Physical Caching Physical Address Virtual Address Hybrid Virtual Caching L1 $ Last-Level $ $ Core Delayed TLB L1 $ Last-Level $ TLB Physical Address Virtual Address Core Synonyms Virtual Caching Scalable Delayed Translation

Contributions Propose hybrid virtual physical caching Cache populated by both virtual and physical blocks Virtual cache for common case, physical for synonyms Synonyms not confined to fixed address range, use entire cache Propose scalable yet flexible delayed translation Improve TLB entry scalability by employing segments [2][3] Provide many segments for flexibility of memory management Propose efficient search mechanism to lookup segment [2] Basu et al. ISCA 2013 [3] Karakostas, Gandhi et al. ISCA 2015

Hybrid Virtual Caching Core Virtual and physical cache Each page consistently determined as physical or virtual Cache tags hold either tags Challenge: Choose address before cache access Synonym Filter: Bloom Filter that detects synonyms HW managed by OS Synonyms always detected, translated to physical address Synonyms Non-Synonyms L1 $ Last-Level $ $ Delayed TLB

Hybrid Virtual Caching Efficiency Physical Address Virtual Address Hybrid Virtual Caching L1 $ Last-Level $ $ Core Delayed TLB Pin-based simulation Baseline TLB L1 TLB: 64 entries L2 TLB: 1024 entries Hybrid Virtual Caching 2x1Kb Synonym filters Synonym TLB: 64 entries Delayed TLB: 1024 entries Workloads Apache, Ferret, Firefox, Postgres, SpecJBB

Hybrid Virtual Caching Efficiency Physical Address Virtual Address Hybrid Virtual Caching L1 $ Last-Level $ $ Core Delayed TLB 83.7~99.9% TLB accesses bypassed Synonym Filter Majority of accesses to virtual cache Up to 99.9% TLB access reduction Up to 69.7% TLB miss reduction Delayed Translation Cache hits remove TLB accesses and reduce TLB misses

Limitation of Delayed TLB TLB entries limited in scalability Each entry maps fixed granularity Increasing TLB size does not reduce miss as expected TLB size is restricted, Improve coverage of TLB entry

Segments: Scalable Translation Direct Segment[2] improves TLB entry coverage Represented by three values (base, limit, offset) Translates contiguous memory of any size OS benefits from more available segments Memory sharing among processes fragment memory OS can offer multiple smaller segments Number of segments[3] limited by latency Segment lookup between Core and L1 cache Fully-associative lookup of all segments required Virtual Address Space Base Limit Physical Address Space Offset [2] Basu et al. ISCA 2013 [3] Karakostas, Gandhi et al. ISCA 2015

Scalable Delayed Translation Exploit reduced frequency of delayed translation Prior work limited to 10s of segments Provide 1000s of segments for OS Flexibility Efficient searching of owner segment required OS managed tree that locates segment in a HW table HW walker that traverses tree to acquire location Use location (index) to access segment in HW table 1000s Segments Delay Translation 32 Segments

Scalable Delayed Translation Segment Table: register values for many segments Index Tree: B-tree that holds following mapping key: virtual address value: index to Segment Table Index Base Limit Offset etc. 1 2 3 4 … Segment Table LLC Miss (Non-synonym) Memory Access Infeasible to search all Segment Table entries Index Tree Segment index

Scalable Delayed Translation Index Cache: caches index tree nodes on-chip Hardware Walker: searches through the index tree to produce a segment table index LLC Miss (Non-synonym) Memory Access HW Walker Index Tree Index Base Limit Offset etc. 1 2 3 4 … Segment Table Index Cache Segment index Traverse tree

Address Translation Procedure Segment Cache: caches many segment translation Segment Cache Hit LLC Miss (Non-synonym) Memory Access Miss Reduces latency and power consumption Index Tree Traverse tree Index Cache HW Walker Index Base Limit Offset etc. 1 2 3 4 … Segment Table Segment index

Evaluation Full system OoO simulation on Marssx86 + DRAMSim2 Hosts Linux with 4GB RAM (DDR3) Three level cache hierarchy (based on Intel CPUs) Baseline TLB configurations (based on Intel Haswell) L1 TLB: 1 cycle, 64 entry, 4-way L2 TLB: 7 cycle, 1024 entry, 8-way Delayed TLB configurations range 1K - 16K entry Many segment translation configurations Segment Table: 2K entries Index Cache: 32KB Segment Cache: 128 entry Benchmarks: SPECCPU, NPB, biobench, gups

Cache hits reduce TLB accesses & misses Improving Performance Results Cache hits reduce TLB accesses & misses Improving Performance

Results Scalable Delayed Translation improves performance by 10.7% on average Power consumption is reduced by 60% on average 143 179 Increased translation scalability significantly reduces TLB misses Delayed TLB is not scalable for these workloads Delayed TLB offers some scalability

Conclusion Hybrid Virtual Cache allows delaying address translation Majority of memory accesses use virtual caching, synonyms use physical caching Synonym Filter consistently and quickly identifies access to synonym pages Reduces up to 99.9% of TLB accesses, 69.7% of TLB misses Scalable delayed translation Exploits reduced translations Provides many segments and efficient segment searching Average 10.7% performance improvement, 60% power saving

Thank You