Address Translation for Manycore Systems

Slides:



Advertisements
Similar presentations
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Advertisements

G Robert Grimm New York University Virtual Memory.
Virtual Memory 3 Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University P & H Chapter
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Virtual Memory I Steve Ko Computer Sciences and Engineering University at Buffalo.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Virtual Memory 2 P & H Chapter
CS 153 Design of Operating Systems Spring 2015
Translation Buffers (TLB’s)
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
CS533 Concepts of Operating Systems Jonathan Walpole.
Design Tradeoffs For Software-Managed TLBs Authers; Nagle, Uhlig, Stanly Sechrest, Mudge & Brown.
IT253: Computer Organization
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Lecture 9: Memory Hierarchy Virtual Memory Kai Bu
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day14:
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, and Mendel Rosenblum Summary By A. Vincent Rayappa.
Virtual Memory Part 1 Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology May 2, 2012L22-1
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Introduction to virtualization
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Protection of Processes Security and privacy of data is challenging currently. Protecting information – Not limited to hardware. – Depends on innovation.
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Translation Lookaside Buffer
Memory Hierarchy Ideal memory is fast, large, and inexpensive
CS161 – Design and Architecture of Computer
Memory Caches & TLB Virtual Memory
From Address Translation to Demand Paging
From Address Translation to Demand Paging
Page Table Implementation
CS 704 Advanced Computer Architecture
Chapter 8: Main Memory Source & Copyright: Operating System Concepts, Silberschatz, Galvin and Gagne.
Memory Hierarchy Virtual Memory, Address Translation
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Ivy Eva Wu.
What we need to be able to count to tune programs
COSC121: Computer Systems. Managing Memory
CMSC 611: Advanced Computer Architecture
CSCI206 - Computer Organization & Programming
Virtual Memory 3 Hakim Weatherspoon CS 3410, Spring 2011
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
The University of Adelaide, School of Computer Science
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Shared Memory Multiprocessors
Lecture 23: Cache, Memory, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
CMSC 611: Advanced Computer Architecture
Lecture 24: Memory, VM, Multiproc
Translation Lookaside Buffer
CSE 451: Operating Systems Autumn 2005 Memory Management
Translation Buffers (TLB’s)
/ Computer Architecture and Design
Multithreaded Programming
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Lecture 9: Directory-Based Examples
High Performance Computing
Lecture 8: Directory-Based Examples
CSE451 Virtual Memory Paging Autumn 2002
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
Translation Buffers (TLB’s)
CSC3050 – Computer Architecture
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 8: Efficient Address Translation
CSE 471 Autumn 1998 Virtual memory
Translation Buffers (TLBs)
Review What are the advantages/disadvantages of pages versus segments?
Presentation transcript:

Address Translation for Manycore Systems Scott Beamer Henry Cook CS258 Final Presentation May 14th, 2008

ParLab Background Parallel (manycore) is coming, how can we use this opportunity to accomplish high level computing goals? productive, efficient, correct Context: Mobile Consumer Device Low power Single socket Bursty Workloads Quality of Service and Response Time important

Problem Statement Modern processors want translation (from VM to PM), how does this scale to parallel? When a PTE that may be cached in many places is modified, the caches (TLBs) need be kept consistent Differences from cache coherence problem Invalidations are much less frequent Translation can be performed anywhere Removes it from the critical path In ParLab, we are using partitions Spatially dividing tiled cores to work on a single app Shared L2 cache provided within a partition

Coherence Method: Shootdown Use a conventional TLB per core On a PTE modification, broadcast: Interrupt all other processors Force them flush relevant entries from their TLB’s Modification cannot be completed until all processors comply and respond Can work with any TLB/cache configuration, but synchronization costs are high In modern SMP OS, software handler is responsible for shootdown

Coherence Method: Validation Allows cached translations to get stale and fixes them at memory controller Every TLB entry stores a timestamp for its translation On a PTE modification, update a generation count associated with the page On a memory access: Translation timestamp is checked at memory controller Outdated translations are fixed and the TLB with the outdated translation is updated Only gets gain with virtual caches Virtual cache could save energy because fewer TLB lookups are needed On context switch virtual cache must be flushed Other overhead as well

Better Schemes Shared Hierarchal Hybrid Let several cores share a TLB Could benefit from constructive interference L2 is already shared, so TLB could be shared at that level L1 would have to be virtual Hierarchal Add a second or third level TLB to reduce reload penalty Hybrid

Methodology Virtutech Simics system simulator PARSEC ISA functional simulator enhanced with memory hierarchy and TLB timing modules Can measure latencies from memory access events, count coherence messages 4, 8, 16, 32, 64, 128 SPARC processor systems Running unmodified Solaris 10 Measure behavior over 1B cycles PARSEC Princeton Application Repository for Shared Memory Computers

Applications

Results - Basic Blackscholes, 128 entry

Results – Application

Results – TLB Size

Results – Invalidation Rate 1000x rate leads to 1000x increase in communication, but still not visible For 5% need roughly a PTE write per 1000 cycles 64 entry

Results – Traffic Comparison

Future Work Investigate the “32 problem” Further explore design space Complete validation scheme Experiment across sharing levels Experiment across levels of hierarchy More applications Several other PARSEC apps recently working Multiple kernels at same time to show time multiplexing

Conclusion TLB size most important observed factor so far Application has some effect Invalidation rate and type has less effect TLB coherence network traffic insignificant Shootdown not bad as a first pass