Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Slides:



Advertisements
Similar presentations
CH14 Instruction Level Parallelism and Superscalar Processors
Advertisements

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
Topics Left Superscalar machines IA64 / EPIC architecture
Branch prediction Titov Alexander MDSP November, 2009.
Lecture 9 – OOO execution © Avi Mendelson, 5/ MAMAS – Computer Architecture Lecture 9 – Out Of Order (OOO) Dr. Avi Mendelson Some of the slides.
CSCI 4717/5717 Computer Architecture
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Pipelining and Control Hazards Oct
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Computer Organization and Architecture The CPU Structure.
Chapter 12 Pipelining Strategies Performance Hazards.
CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.
Multiscalar processors
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Pipelining and Parallelism Mark Staveley
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Chapter Six.
Computer Architecture Chapter (14): Processor Structure and Function
Data Prefetching Smruti R. Sarangi.
Dynamic Branch Prediction
PowerPC 604 Superscalar Microprocessor
Milad Hashemi, Onur Mutlu, and Yale N. Patt
CS203 – Advanced Computer Architecture
Pipeline Implementation (4.6)
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
The processor: Pipelining and Branching
Milad Hashemi, Onur Mutlu, Yale N. Patt
Address-Value Delta (AVD) Prediction
Chapter Six.
Advanced Computer Architecture
Chapter Six.
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Control unit extension for data hazards
15-740/ Computer Architecture Lecture 10: Runahead and MLP
15-740/ Computer Architecture Lecture 14: Runahead Execution
Data Prefetching Smruti R. Sarangi.
15-740/ Computer Architecture Lecture 14: Prefetching
Wackiness Algorithm A: Algorithm B:
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start, Microprocessor Research, Intel Labs Chris Wilkerson, Desktop Platforms Group, Intel Corp Yale N. Patt, The University of Texas at Austin Presented by: Mark Teper

Outline The Problem Related Work The Idea: Runahead Execution Details Results Issues

Brief Overview Instruction Window: Set of in-order instructions that have not yet been commited Scheduling Window Set of unexecuted instructions needed to selected for execution What can go wrong? … Program Flow Instruction Window Scheduling Windows Execution Units Execution Units

… The Problem Program Flow Unexecuted InstructionExecuting Instruction Long Running Instruction Commited Instruction Instruction Window

Filling the Instruction Window Better IPC

Related Work Caches: Alter size and structure of caches Attempt to reduce unnecessary memory reads Prefetching: Attempt to fetch data into nearby cache before needed Hardware & software techniques Other techniques: Waiting instruction buffer (WIB) Long-latency block retirements CPU L1 Cache 1 Cycle L2 Cache 10 Cycles Memory 1000 cycles

RunAhead Execution Continue executing instructions during long stalls Disregard results once data is available … Program Flow Unexecuted InstructionExecuting Instruction Long Running Instruction Commited Instruction Instruction Window

Benefits Acts as a high accuracy prefetcher Software prefetchers have less information Hardware prefetchers cant analyze code as well Biase predictors Makes use of cycles that are otherwise wasted

Entering RunAhead Processors can enter run-ahead mode at any point L2 Cache Misses used in paper Architecture needs to be able to checkpoint and restore register state Including branch-history register and return address stack

Handling Avoided Read Run Ahead trigger returns immediately Value is marked as INV Processor continues fetching and executing instructions ld r1, [r2] Add r3, r2, r2 Add r3, r1, r2 move r1, 0 R1 R2 R3

Executing Instruction in RunAhead Instructions are fetched and executed as normal Instructions are committed retired out of the instruction window in program order If the instructions registers are INV it can be retired without executing No data is ever observable outside the CPU

Branches during RunAhead Divergence Points: Incorrect INV value branch prediction Predict Branch Does Branch Depend on INV? Yes – Assume predictor is correct, Continue execution No - Evaluate branch Was branch predictor correct? Yes – Continue Execution No – Flush instruction queue

Exiting RunAhead Occurs when stalling memory access finally returns Checkpointed architecture is restored All instructions in the machine are flushed Processor starts fetching again at instruction which caused RunAhead execution Paper presented optimization where fetching started slightly before stalled instruction returned

Biasing Branch Predictors RunAhead can cause branch predictors to be biased twice on the same branch Several Alternatives: (1)Always train branch predictors (2)Never train branch predictors (3)Create list of predicted branches (4)Create separate Branch Predictor

RunAhead Cache RunAhead execution disregards stores Cant produce externally observable results However, this data is needed for communication Solution: Run-Ahead cache Loop: … store r1, [r2] add r1, r3, r1 store r1, [r4] load r1, [r2] bner1, r5, Loop

Stores and Loads in Run Ahead Loads 1. If address is INV data is automatically INV 2. Next look in: 1. Store buffer 2. RunAhead Cache 3. Finally go to memory 1. In in cache treat as valid 2. If not treat as INV, dont stall Stores 1. Use store-buffer as usual 2. On Commit: 1. If address is INV ignore 2. Otherwise write data to RunAhead Cache

Run-Ahead Cache Results Found that not passing data from stores to loads resulted in poor performance Significant number of INV results Better

Details: Architecture

Results Better

Results (2) Better

Issues Some wrong assumptions about future machines Future baseline corresponds poorly to modern architectures Not a lot of details of architectural requirement for this technique Increase architecture size Increase power-requirements