1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

CSCI 4717/5717 Computer Architecture

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.

Computer Organization and Architecture

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Computer Organization and Architecture

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

Chapter 12 Pipelining Strategies Performance Hazards.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13

Analysis of Branch Predictors

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Improving Data Cache Performance Under a Cache Miss J. Dundas and T. Mudge Supercomputing ‘97 Laura J. Spencer, Jim Gast,

Pipelining and Parallelism Mark Staveley

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

CPU Design and Pipelining – Page 1CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading:

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.

Computer Architecture Chapter (14): Processor Structure and Function

Data Prefetching Smruti R. Sarangi.

William Stallings Computer Organization and Architecture 8th Edition

Multiscalar Processors

5.2 Eleven Advanced Optimizations of Cache Performance

Morgan Kaufmann Publishers The Processor

Milad Hashemi, Onur Mutlu, Yale N. Patt

Computer Organization and ASSEMBLY LANGUAGE

Address-Value Delta (AVD) Prediction

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Control unit extension for data hazards

15-740/ Computer Architecture Lecture 10: Runahead and MLP

15-740/ Computer Architecture Lecture 14: Runahead Execution

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Data Prefetching Smruti R. Sarangi.

Computer Architecture

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Control unit extension for data hazards

Control unit extension for data hazards

Presentation transcript:

1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006

2 Outline Why How Conclusions Problems

3 Why? The Memory Latency Bottleneck Computer Architecture, A quantitative Approach. Third Edition Hennessy, Patterson

4 Solutions: Cache A safe place for hiding or storing things -Webster’s New World Dictionary of the American Language (1976) Reduce average memory latency by caching data in a small, fast RAM Data Pre-fetching Parallelism

5 A New Problem Arise Cache misses are the main reason of processor stall in modern superscalars, especially for L2, each miss can take hundreds cycles to complete.

6 Runahead: A Solution for Cache Missing AuthoryearAchievement Dundas, Mudge1997In-order scalar Runahead Mutlu, Patt2003Out-of-order superscalar Runahead Akkary, Rajwar2003Checkpoint Ceze, Torrellas Checkpointing and value prediction Runahead history

7 How? Initiated on an instruction or data cache miss Restart at the initiating instruction once the miss is serviced Adapted from Dundas

8 Hardware Support Required for Runahead We need to be able to compute load/store addresses, branch conditions, and jump targets Must be able to speculatively update registers during runahead Register set contents must be checkpointed Shadow each RF RAM cell, these cells form the BRF Copy RF to BRF when entering runahead Copy BRF to RF when resuming normal operation Pre-processed stores cannot modify the contents of memory Fetch logic must save the PC of the Runahead-initiating instruction Adapted from Dundas RF : Register File BRF : Backup Register File

9 Entering and Exiting Runahead Entering runahead Save the contents of the RF in the BRF Save the PC of the runahead-initiating instruction Restart instruction fetch at the first instruction in the next sequential line if runahead is initiated on an instruction cache miss Exiting runahead Set all of the RF and L1 data cache runahead-valid bits to the VALID state Restore the RF from the BRF Restart instruction fetch at the PC of the instruction that initiated runahead Adapted from Dundas

10 Instructions Register-to-register Mark their destination register INV if any of their source registers are INV Can replace an INV value in their destination register if all sources are valid Load Mark their destination register INV if: the base register used to form the effective address is marked INV, or a cache miss occurs, or the target word in the L1 data cache is marked INV due to a preceding store Can replace an INV value in their destination register if none of the above apply Adapted from Dundas

11 Instructions (cont.) Store Pre-processed stores do not modify the contents of memory Stores mark their destination L1 data cache word INV if: the base register used to form the effective address is not INV, and a cache miss does not occur Values are only INV with respect to subsequent loads during the same runahead episode Conditional branch Branches are resolved normally if their operands are valid If a branch condition is marked INV, then the outcome is determined via branch prediction If an indirect branch target register is marked INV, then the pipeline stalls until normal operation resume Adapted from Dundas

12 Instructions (cont.) jump register indirect assume that the return stack contains the address of the next instruction Adapted from Dundas

13 Two Runahead Branch Policies When a conditional branch or jump is pre- executed that is dependent on an invalid register, Conservative: halt runahead until the miss is ready. Aggressive: keep going but assumes that the branch prediction or subroutine call return stack performance is good enough to accurately resolve the branch or jump

14 An Example IRV : Invalid Register Vector 0: Invalid 1: Valid

15 Benefit Early execution of memory operations which are potential cache misses Re-execution of these instructions will most probably be cache hits It allows further instructions to be execute. But these instructions are executed again after exit from runahead mode.

16 Conclusions Pre-process instructions while cache misses are serviced Don’t stall for instructions that are dependent upon invalid or missing data Loads and stores that miss in the cache can become data prefetches Instruction cache misses become instruction prefetches Conditional branch outcomes are saved for use during normal operation All pre-processed instruction results are discarded Only interested in generating prefetches and branch outcomes Runahead is a form of very aggressive, yet inexpensive, speculation Adapted from Dundas

17 Problems Increases the number of executed instructions Pre-executed instructions consume energy What if a short-time runahead happen

18 Reference [1] J. Dundas and T. Mudge. Improving data cache performance by pre- executing instructions under a cache miss. In ICS-11, [2] J. D. Dundas. Improving Processor performance by Dynamically Pre- Processing the Instruction Stream. PhD thesis, Univ. of Michigan, [3] O. Mutlu, J. Stark, C.Wilkerson, and Y. N. Patt. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In HPCA-9, pages 129–140, [4] H. Akkary, R. Rajwar, and S. T. Srinivasan. Checkpoint processing and recovery: Towards scalable large instruction window processors. In MICRO-36, pages 423–434, [5] L.Ceze, K.Strauss, J.Tuck, J. Renau and J.Torrellas CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Pridiction, In Computer Architecture Letters, 2006

19 Thank You & Questions?