Implementing Virtual Memory in a Vector Processor with Software Restart Markers Mark Hampton & Krste Asanovic Computer Architecture Group MIT CSAIL.

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
1 Virtual Memory Management B.Ramamurthy. 2 Demand Paging Main memory LAS 0 LAS 1 LAS 2 (Physical Address Space -PAS) LAS - Logical Address.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
May 7, A Real Problem  What if you wanted to run a program that needs more memory than you have?
Computer Organization CS224 Fall 2012 Lesson 44. Virtual Memory  Use main memory as a “cache” for secondary (disk) storage l Managed jointly by CPU hardware.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
1 A Real Problem  What if you wanted to run a program that needs more memory than you have?
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CS 333 Introduction to Operating Systems Class 11 – Virtual Memory (1)
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
ECE 232 L27.Virtual.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 27 Virtual.
Translation Buffers (TLB’s)
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
In-Line Interrupt Handling for Software Managed TLBs Aamer Jaleel and Bruce Jacob Electrical and Computer Engineering University of Maryland at College.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
Virtual Memory Expanding Memory Multiple Concurrent Processes.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
Computer Studies (AL) Memory Management Virtual Memory I.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Processes and Virtual Memory
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
CS2100 Computer Organisation Virtual Memory – Own reading only (AY2015/6) Semester 1.
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
Virtual Memory Ch. 8 & 9 Silberschatz Operating Systems Book.
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
Operating Systems Lecture 9 Introduction to Paging Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
CS161 – Design and Architecture of Computer
Translation Lookaside Buffer
CMSC 611: Advanced Computer Architecture
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Virtual Memory - Part II
CS703 - Advanced Operating Systems
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Morgan Kaufmann Publishers
Pipelining: Advanced ILP
Page Replacement.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
CSE 451: Operating Systems Autumn 2005 Memory Management
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE451 Virtual Memory Paging Autumn 2002
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Presentation transcript:

Implementing Virtual Memory in a Vector Processor with Software Restart Markers Mark Hampton & Krste Asanovic Computer Architecture Group MIT CSAIL

Vector processors offer many benefits One instruction triggers multiple operations v1[5] v2[5] v1[4] v2[4] v1[3] v2[3] v1[2] v2[2] v3[1] v3[0] addv v3,v1,v2 But difficulty supporting virtual memory has been a key reason why traditional vector processors are not more widely used Dependence checking performed by compiler Reduced overhead in instruction fetch and decode Regular access patterns

Demand-paged virtual memory is a requirement in general-purpose processors  Protection between processes is supported  Shared memory is allowed  Large address spaces are enabled  Code portability is enhanced  Multiple processes can be active without having to be fully memory-resident A memory instruction uses a virtual address… load 0x802b10a4 …which is then translated into a physical address load 0x000c56e0 Requires OS and hardware support

Demand paging allows multiple interactive processes to run simultaneously The hard disk enables the illusion of a single large memory system CPU (single- threaded) Memory Hard disk CPU executes one process at a time Processes share physical memory… …and use larger hard disk as “virtual” memory  If needed page is not in physical memory, trigger a page fault  Page fault is very long-latency operation, and don’t want CPU to be idle, so perform context switch to bring in another process  Context switch requires ability to save and restore CPU state needed to restart process P3P1P2 P4P5P1P3P2

Parallel functional units complicate the saving and restoring of state  Could save all pipeline state, but this adds significant complexity  Precise exceptions only require architectural state to be saved by enforcing restrictions on commit Issue Unit FU0 Instr i+5 FUn Instr i-3 FU1 Instr i.... Architectural State Page fault detected Fetch and Decode Unit

Precise exceptions preserve the illusion of sequential execution FU0 FUn FU1.... Architectural State Fetch and Decode Unit Instruction i-4 Instruction i-3. Instruction i.. Instruction i+5 oldest newest Fetch and decode in order Execute and writeback results out of order (detect exceptions) Commit results in order (handle exceptions) Reorder Buffer (ROB) FU0 Instr i+5 FU1 Instr i FUn Instr i-3 Instruction i+6 newest Key advantage is that restarting after exception is simple Page fault detected

Most precise exception designs support a relatively small number of in-flight operations FU0 FUn FU1.... Architectural State Fetch and Decode Unit Instruction i-4 Instruction i-3. Instruction i.. Instruction i+5 oldest Reorder Buffer (ROB) FU0 Instr i+5 FU1 Instr i FUn Instr i-3 Instruction i+6 newest  Each in-flight operation needs a temporary buffer to hold result before commit  Problem with vector processors is that a single instruction can produce hundreds of results!

Vector processors also have a large amount of architectural state to preserve.... Scalar Registers r0 r1 r2 r3 r4 r31 Architectural State for Scalar Processor

Vector processors also have a large amount of architectural state to preserve Scalar RegistersVector Registers r0 r1 r2 r3 r4 r31 v0 v1 v2 v3 v4 v31 [0][1][2][vlmax-1] Architectural State for Vector Processor This hurts performance and complicates OS interface

Our work addresses the problems with virtual memory in vector processors  Problem: All of the vector instruction results have to be buffered for in-order commit Solution: We don’t buffer results; instead we use idempotent regions to allow out-of-order commit  Problem: The vector register file significantly increases the amount of state to save Solution: We don’t save vector registers; instead we recreate that state after an exception

The problem with parallel execution is knowing where to restart after an exception Copying one array to another can be done in parallel: 904…7236…581 A B …… X Can’t simply restart from the faulting operation because all of the previous operations may not have completed But suppose something goes wrong 904…????…58

What if we didn’t worry about which instructions were uncompleted?  In this example, A and B do not overlap in memory → original input data still exists  Could copy everything again and still get same result 904…7236… …????…58 A B …… X 904…7236…581 Only works if processor knows it’s safe to re-execute code, i.e. code must be idempotent

Software restart markers delimit regions of idempotent code  Instructions from a single region can be committed out-of-order—no buffering required  An exception causes execution to resume from head of region  If regions are large enough, CPU can still exploit ample parallelism instruction 1 instruction 2 instruction 3... instruction i... Precise Exception ModelSoftware Restart Markers Software marks restart points Need a single register to hold address of head of region instruction 1 instruction 2 instruction instruction i.

Software restart markers also create a new classification of state Software Restart Markers lv v0, t0 sv t1, v0 addu t2, t1, 512 addv v0, v1, v2 sv t2, v0 addu t1, t2, 512 lv v0, t2.  “Temporary” state only exists within a single restart region, e.g. v0  After exception, temporary state will be recreated and thus does not have to be saved  Software restart markers allow vector registers to be mapped to temporary state

Vector registers don’t need to be preserved across exceptions Scalar RegistersVector Registers r0 r1 r2 r3 r4 r31 v0 v1 v2 v3 v4 v31 [0][1][2][vlmax-1] Architectural State Temporary State

Creating restart regions can be done by making sure input values are preserved Vectorized memcpy() loop # void* memcpy(void *out, const void *in, size_t n); loop: lv v0, a1 # Load from input sv a0, v0 # Store to output addiu a1, 512 # Increment pointers addiu a0, 512 subu a2, 512 # Decrement counter bnez a2, loop # Is loop done?  Want to place entire loop within single restart region, but argument registers are overwritten in each iteration  Solution: Make copies of the argument registers

Creating restart regions can be done by making sure input values are preserved # void* memcpy(void *out, const void *in, size_t n); begin restart region move t0, a0 # Copy argument registers move t1, a1 move t2, a2 loop: lv v0, t1 # Load from input sv t0, v0 # Store to output addiu t1, 512 # Increment pointers addiu t0, 512 subu t2, 512 # Decrement counter bnez t2, loop # Is loop done? done: end restart region This works for all functions with separate input and output arrays

But what if an input array is overwritten? Vectorized loop for multiply_2() function # void* multiply_2(void *in, size_t n); loop: lv v0, a0 # Load from input mulvs.d v0, v0, f0 # Multiply vector by 2 sv a0, v0 # Store result addiu a0, 512 # Increment pointer subu a1, 512 # Decrement counter bnez a1, loop # Is loop done? Can’t simply copy array to backup register

But what if an input array is overwritten? # void* multiply_2(void *in, size_t n); loop: lv v0, a0 # Load from input mulvs.d v0, v0, f0 # Multiply vector by 2 sv a0, v0 # Store result addiu a0, 512 # Increment pointer subu a1, 512 # Decrement counter bnez a1, loop # Is loop done? Option #1: Copy input values to temporary buffer

But what if an input array is overwritten? # void* multiply_2(void *in, size_t n); # Allocate temporary buffer of size n pointed to by t2 memcpy(t2, a0, a1) # Copy input values to temp buffer begin restart region move t0, a0 # Get original inputs move t1, a1 memcpy(a0, t2, a1) loop: lv v0, t0 # Load from input mulvs.d v0, v0, f0 # Multiply vector by 2 sv t0, v0 # Store result addiu t0, 512 # Increment pointer subu t1, 512 # Decrement counter bnez t1, loop # Is loop done? end restart region Option #1: Copy input array to temporary buffer

But what if an input array is overwritten? # void* multiply_2(void *in, size_t n); # Allocate temporary buffer of size n pointed to by t2 memcpy(t2, a0, a1) # Copy input values to temp buffer begin restart region move t0, a0 # Get original inputs move t1, a1 memcpy(a0, t2, a1) loop: lv v0, t0 # Load from input mulvs.d v0, v0, f0 # Multiply vector by 2 sv t0, v0 # Store result addiu t0, 512 # Increment pointer subu t1, 512 # Decrement counter bnez t1, loop # Is loop done? end restart region Option #1: Copy input array to temporary buffer Disadvantages: Space and performance overhead Strip mining Usually still faster than scalar code

But what if an input array is overwritten? # void* multiply_2(void *in, size_t n); loop: lv v0, a0 # Load from input mulvs.d v0, v0, f0 # Multiply vector by 2 sv a0, v0 # Store result addiu a0, 512 # Increment pointer subu a1, 512 # Decrement counter bnez a1, loop # Is loop done? Option #2: Use scalar version when vector overhead is too large

But what if an input array is overwritten? # void* multiply_2(void *in, size_t n); sltiu t0, a1, 16 # Is n less than threshold? bnez t0, scalar_version # If so, use scalar version # Vector version of function with restart markers here. j done scalar_version: # Scalar code without restart markers here. done: # return from function Option #2: Use scalar version when vector overhead is too large

But what if an input array is overwritten? # void* multiply_2(void *in, size_t n); sltiu t0, a1, 64 # Is n less than threshold? bnez t0, scalar_version # If so, use scalar version # Vector version of function with restart markers here. j done scalar_version: # Scalar code without restart markers here. done: # return from function Option #2: Use scalar version when vector overhead is too large Goal of our approach is to implement virtual memory cheaply while being able to handle the majority of vectorized code

The compiler implementation takes advantage of existing techniques  We can create restart regions for scalar code with Trimaran, which uses region-based compilation [Hank95]  Vectorizing compilers employ transformations to remove dependences, facilitating creation of restart regions  We are currently working on a complete vectorizer  SUIF frontend provides dependence analysis  Trimaran backend is used to generate vector assembly code with software restart markers  gcc creates final executables  This is a work in progress, so all evaluation is done using hand- vectorized assembly code

We evaluate the performance overhead of creating idempotent regions in actual code  Scale vector-thread processor [Krashinsky04] is target system  Provides high performance for embedded programs  Only vector capabilities are used in this work  Microarchitectural simulator used for vector unit  Single-cycle magic memory emphasizes overhead of restart markers  A variety of EEMBC benchmarks serve as workload  gcc used to compile code  Results shown for default 4-lane Scale configuration

The performance overhead due to creating restart regions is small  For most benchmarks, performance reduction is negligible  fft is an example of a fast-running benchmark with small restart regions  An input array is preserved in viterbi to make the function idempotent

But what about the overhead of re-executing instructions after a page fault?  Restarting after a page fault is not a significant concern  Disk access latency is so high that it will dominate re-execution overhead  Page faults are relatively infrequent  However, to test our approach sufficiently, we examine TLB misses  TLB holds virtual-to-physical address translations  If translation is missing, need to walk the page table to perform TLB refill  TLB refill can be handled either in hardware or software Virtual page # Physical page # Entry 0 Entry n-1..

The method of refilling the TLB can have a significant effect on the system  Software-refilled TLBs cause an exception when a TLB miss occurs  Typical designs flush the pipeline when handling miss  If miss handler code isn’t in cache, performance is further hurt  For vector processors, the TLB normally has to be as large as the maximum vector length to avoid livelock  Advantage of this scheme is that it gives OS flexibility to choose page table structure  Hardware-refilled TLBs (found in most processors) use finite state machine to walk page table  Disadvantage is that page table structure is fixed  Doesn’t cause an exception, so performance hit is small (previous overhead results are an approximation of using system with hardware-refilled TLB)  No livelock issues Although hardware refill is good for vector processors, we use software refill to provide a worst-case scenario

Performance optimizations can reduce the re- execution cost with a software-refilled TLB  Prefetching loads a byte from each page in the dataset before beginning the region  Gets the TLB misses out of the way early  Disadvantage is extra compiler effort required  Counted loop optimization restarts after an exception from earliest uncompleted loop iteration  Limits amount of repeated work  Compiler algorithm in paper

We evaluate the performance overhead of our worst-case scenario  Same simulation and compilation infrastructure is used  Virtual memory configuration uses standard MIPS setup with software refill  Default 64-entry MIPS TLB for control processor  128-entry TLB for vector unit  Fixed 4 KB page size—smallest possible for MIPS  All page tables modeled, but no page faults  Two additional overhead components are introduced  Cost of handling TLB miss (usually negligible)  Cost of re-executing instructions after a TLB miss

The performance overhead of using software- refilled TLB is small with optimizations  Original design does not perform well with large datasets  Prefetching incurs smallest degradation  Counted loop optimization has small overhead, but still leads to some re-executed work

Related Work  IBM System/370 [Buchholz86] only allowed one in-flight vector instruction at a time, hurting performance  DEC Vector VAX [DEC91] saved internal pipeline state, causing performance and energy problems  CODE [Kozyrakis03] uses register renaming to support virtual memory, while our scheme can be used in processors with no renaming capabilities  Sentinel scheduling [Mahlke92, August95] uses idempotent code and recovery blocks, but for the purpose of recovering from misspeculations in a VLIW architecture  Checkpoint repair [Hwu87] is more flexible than our software “checkpointing” scheme, but incurs more hardware overhead

Concluding Remarks  Traditional vector architectures have not found widespread acceptance, in large part because of the difficulty in supporting virtual memory  Software restart markers enable virtual memory to be implemented cheaply  They allow instructions to be committed out-of-order  They reduce amount of state to save in event of context switch  Our approach reduces hardware overhead while incurring only a small performance degradation  Average overhead with hardware-refilled TLB less than 1%  Average overhead with software-refilled TLB less than 3%