15-740/ Computer Architecture Lecture 14: Runahead Execution

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Computer Organization and Architecture

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Computer Architecture: Out-of-Order Execution

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

Computer Architecture: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

15-740/ Computer Architecture Lecture 7: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University.

15-740/ Computer Architecture Lecture 16: Runahead and OoO Wrap-Up Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/17/2011.

CS 352H: Computer Systems Architecture

Data Prefetching Smruti R. Sarangi.

Computer Architecture Lecture 14: Out-of-Order Execution

William Stallings Computer Organization and Architecture 8th Edition

Multiscalar Processors

Computer Structure Multi-Threading

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/14/2011

CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions

5.2 Eleven Advanced Optimizations of Cache Performance

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Carnegie Mellon University

Instruction Level Parallelism and Superscalar Processors

Computer Architecture: Multithreading (I)

Lecture 6: Advanced Pipelines

Milad Hashemi, Onur Mutlu, Yale N. Patt

Levels of Parallelism within a Single Processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Address-Value Delta (AVD) Prediction

15-740/ Computer Architecture Lecture 24: Control Flow

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Computer Architecture: Multithreading (IV)

Lecture: Out-of-order Processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

15-740/ Computer Architecture Lecture 10: Runahead and MLP

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Data Prefetching Smruti R. Sarangi.

Prof. Onur Mutlu Carnegie Mellon University

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

15-740/ Computer Architecture Lecture 14: Prefetching

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Prof. Onur Mutlu Carnegie Mellon University

Levels of Parallelism within a Single Processor

Created by Vivi Sahfitri

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

15-740/18-740 Computer Architecture Lecture 14: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

Reviews Due Today Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA 1998.

Announcements Milestone I Midterm I Milestone II Due this Friday (Oct 14) Format: 2-pages Include results from your initial evaluations. We need to see good progress. Of course, the results need to make sense (i.e., you should be able to explain them) Midterm I Postponed to October 24 Milestone II Will be postponed. Stay tuned.

Course Checkpoint Homeworks Review sets Projects Look at solutions These are not for grades. These are for you to test your understanding of the concepts and prepare for exams. Provide succinct and clear answers. Review sets Concise reviews Projects Most important part of the course Focus on this If you are struggling, talk with the TAs

Course Feedback Fill out the forms and return

Last Lecture Tag broadcast, wakeup+select loop Pentium Pro vs. Pentium 4 designs: buffer decoupling Consolidated physical register files in Pentium 4 and Alpha 21264 Centralized vs. distributed reservation stations Which instruction to select?

Today Load related instruction scheduling Runahead execution

Review: Centralized vs. Distributed Reservation Stations Centralized (monolithic): + Reservation stations not statically partitioned (can adapt to changes in instruction mix, e.g., 100% adds) -- All entries need to have all fields even though some fields might not be needed for some instructions (e.g. branches, stores, etc) -- Number of ports = issue width -- More complex control and routing logic Distributed: + Each RS can be specialized to the functional unit it serves (more area efficient) + Number of ports can be 1 (RS per functional unit) + Simpler control and routing logic -- Other RS’s cannot be used if one functional unit’s RS is full (static partitioning)

Review: Issues in Scheduling Logic (I) What if multiple instructions become ready in the same cycle? Why does this happen? An instruction with multiple dependents Multiple instructions can complete in the same cycle Which one to schedule? Oldest Random Most dependents first Most critical first (longest latency dependency chain) Does not matter for performance unless the active window is very large Butler and Patt, “An Investigation of the Performance of Various Dynamic Scheduling Techniques,” MICRO 1992.

Issues in Scheduling Logic (II) When to schedule the dependents of a multi-cycle execute instruction? One cycle before the completion of the instruction Example: IMUL, Pentium 4, 3-cycle ADD When to schedule the dependents of a variable-latency execute instruction? A load can hit or miss in the data cache Option 1: Schedule dependents assuming load will hit Option 2: Schedule dependents assuming load will miss When do we take out an instruction from a reservation station?

Scheduling of Load Dependents Assume load will hit + No delay for dependents (load hit is the common case) -- Need to squash and re-schedule if load actually misses Assume load will miss (i.e. schedule when load data ready) + No need to re-schedule (simpler logic) -- Significant delay for load dependents if load hits Predict load hit/miss + No delay for dependents on accurate prediction -- Need to predict and re-schedule on misprediction Yoaz et al., “Speculation Techniques for Improving Load Related Instruction Scheduling,” ISCA 1999.

What to Do with Dependents on a Load Miss? (I) A load miss can take hundreds of cycles If there are many dependent instructions on a load miss, these can clog the scheduling window Independent instructions cannot be allocated reservation stations and scheduling can stall How to avoid this? Idea: Move miss-dependent instructions into a separate buffer Example: Pentium 4’s “scheduling loops” Lebeck et al., “A Large, Fast Instruction Window for Tolerating Cache Misses,” ISCA 2002.

What to Do with Dependents on a Load Miss? (II) But, dependents still hold on to the physical registers Cannot scale the size of the register file indefinitely since it is on the critical path Possible solution: Deallocate physical registers of dependents Difficult to re-allocate. See Srinivasan et al, “Continual Flow Pipelines,” ASPLOS 2004.

Questions Why is OoO execution beneficial? What if all operations take single cycle? Latency tolerance: OoO execution tolerates the latency of multi-cycle operations by executing independent operations concurrently What if an instruction takes 500 cycles? How large of an instruction window do we need to continue decoding? How many cycles of latency can OoO tolerate? What limits the latency tolerance scalability of Tomasulo’s algorithm? Active/instruction window size: determined by register file, scheduling window, reorder buffer, store buffer, load buffer

Small Windows: Full-window Stalls When a long-latency instruction is not complete, it blocks retirement. Incoming instructions fill the instruction window. Once the window is full, processor cannot place new instructions into the window. This is called a full-window stall. A full-window stall prevents the processor from making progress in the execution of the program.

Small Windows: Full-window Stalls 8-entry instruction window: Oldest LOAD R1  mem[R5] L2 Miss! Takes 100s of cycles. BEQ R1, R0, target ADD R2  R2, 8 LOAD R3  mem[R2] Independent of the L2 miss, executed out of program order, but cannot be retired. MUL R4  R4, R3 ADD R4  R4, R5 STOR mem[R2]  R4 ADD R2  R2, 64 Younger instructions cannot be executed because there is no space in the instruction window. LOAD R3  mem[R2] The processor stalls until the L2 Miss is serviced. L2 cache misses are responsible for most full-window stalls.

Impact of L2 Cache Misses L2 Misses 512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model

Impact of L2 Cache Misses L2 Misses 500-cycle DRAM latency, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model

The Problem Out-of-order execution requires large instruction windows to tolerate today’s main memory latencies. As main memory latency increases, instruction window size should also increase to fully tolerate the memory latency. Building a large instruction window is a challenging task if we would like to achieve Low power/energy consumption (tag matching logic, ld/st buffers) Short cycle time (access, wakeup/select latencies) Low design and verification complexity

Efficient Scaling of Instruction Window Size One of the major research issues in out of order execution How to achieve the benefits of a large window with a small one (or in a simpler way)? Runahead execution? Upon L2 miss, checkpoint architectural state, speculatively execute only for prefetching, re-execute when data ready Continual flow pipelines? Upon L2 miss, deallocate everything belonging to an L2 miss dependent, reallocate/re-rename and re-execute upon data ready Dual-core execution? One core runs ahead and does not stall on L2 misses, feeds another core that commits instructions

Runahead Execution (I) A technique to obtain the memory-level parallelism benefits of a large instruction window When the oldest instruction is a long-latency cache miss: Checkpoint architectural state and enter runahead mode In runahead mode: Speculatively pre-execute instructions The purpose of pre-execution is to generate prefetches L2-miss dependent instructions are marked INV and dropped Runahead mode ends when the original miss returns Checkpoint is restored and normal execution resumes Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003.

Runahead Example Perfect Caches: Load 1 Hit Load 2 Hit Compute Compute Small Window: Load 1 Miss Load 2 Miss Compute Stall Compute Stall Miss 1 Miss 2 Runahead: Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Hit Compute Runahead Compute Saved Cycles Miss 1 Miss 2

Benefits of Runahead Execution Instead of stalling during an L2 cache miss: Pre-executed loads and stores independent of L2-miss instructions generate very accurate data prefetches: For both regular and irregular access patterns Instructions on the predicted program path are prefetched into the instruction/trace cache and L2. Hardware prefetcher and branch predictor tables are trained using future access information.

Runahead Execution Mechanism Entry into runahead mode Checkpoint architectural register state Instruction processing in runahead mode Exit from runahead mode Restore architectural register state from checkpoint

Instruction Processing in Runahead Mode Load 1 Miss Compute Runahead Miss 1 Runahead mode processing is the same as normal instruction processing, EXCEPT: It is purely speculative: Architectural (software-visible) register/memory state is NOT updated in runahead mode. L2-miss dependent instructions are identified and treated specially. They are quickly removed from the instruction window. Their results are not trusted.

L2-Miss Dependent Instructions Load 1 Miss Compute Runahead Miss 1 Two types of results produced: INV and VALID INV = Dependent on an L2 miss INV results are marked using INV bits in the register file and store buffer. INV values are not used for prefetching/branch resolution.

Removal of Instructions from Window Load 1 Miss Compute Runahead Miss 1 Oldest instruction is examined for pseudo-retirement An INV instruction is removed from window immediately. A VALID instruction is removed when it completes execution. Pseudo-retired instructions free their allocated resources. This allows the processing of later instructions. Pseudo-retired stores communicate their data to dependent loads. Exceptions ignored. I/O operations that modify the state of the I/O devices are not performed. Uncacheable blocks are not fetched.

Store/Load Handling in Runahead Mode Load 1 Miss Compute Runahead Miss 1 A pseudo-retired store writes its data and INV status to a dedicated memory, called runahead cache. Purpose: Data communication through memory in runahead mode. A dependent load reads its data from the runahead cache. Does not need to be always correct  Size of runahead cache is very small.

Branch Handling in Runahead Mode Load 1 Miss Compute Runahead Miss 1 INV branches cannot be resolved. A mispredicted INV branch causes the processor to stay on the wrong program path until the end of runahead execution. VALID branches are resolved and initiate recovery if mispredicted.

Runahead Execution (III) Advantages: + Very accurate prefetches for data/instructions (all cache levels) + Follows the program path + Simple to implement, most of the hardware is already built in + Uses the same thread context as main thread, no waste of context + No need to construct a pre-execution thread Disadvantages/Limitations: -- Extra executed instructions -- Limited by branch prediction accuracy -- Cannot prefetch dependent cache misses. Solution? -- Effectiveness limited by available “memory-level parallelism” (MLP) -- Prefetch distance limited by memory latency Implemented in IBM POWER6, Sun “Rock”