Presentation is loading. Please wait.

Presentation is loading. Please wait.

Runahead Execution A Power-Efficient Alternative to Large Instruction Windows for Tolerating Long Main Memory Latencies Onur Mutlu EE 382N Guest Lecture.

Similar presentations


Presentation on theme: "Runahead Execution A Power-Efficient Alternative to Large Instruction Windows for Tolerating Long Main Memory Latencies Onur Mutlu EE 382N Guest Lecture."— Presentation transcript:

1 Runahead Execution A Power-Efficient Alternative to Large Instruction Windows for Tolerating Long Main Memory Latencies Onur Mutlu EE 382N Guest Lecture

2 2 Outline Motivation Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Overcoming the Limitations  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary

3 3 Motivation Memory latency is very long in today’s processors (hundreds of processor clock cycles)  CDC 6600: 10 cycles [Thornton, 1970]  Alpha 21264: 120+ cycles [Wilkes, 2001]  Intel Pentium 4: 300+ cycles [Sprangle & Carmean, 2002] And, it continues to increase (in terms of processor cycles)  DRAM latency is not reducing as fast as processor cycle time Existing techniques to tolerate the memory latency do not work well enough.

4 4 Existing Latency Tolerance Techniques Caching [Wilkes, 1965]  Widely used, simple, effective, but not enough, inefficient, passive  Not all applications/phases exhibit temporal or spatial locality  Cache miss penalty is very high Prefetching [IBM 360/91, 1967]  Works well for regular memory access patterns  Prefetching irregular access patterns is difficult, inaccurate, and hardware- intensive Multithreading [CDC 6600, 1964]  Does not improve the performance of a single thread Out-of-order execution [Tomasulo, 1967]  Tolerates cache misses that cannot be prefetched  Requires extensive hardware resources for tolerating long latencies

5 5 Out-of-order Execution Instructions are executed out of sequential program order to tolerate latency.  Execution of an older long-latency instruction is overlapped with the execution of younger independent instructions. Instructions are retired in program order to support precise exceptions/interrupts. Not-yet-retired instructions and their results are buffered in hardware structures, called the instruction window: Scheduling window (reservation stations) Register files Load and store buffers Reorder buffer

6 6 Small Windows: Full-window Stalls When a long-latency instruction is not complete, it blocks retirement. Incoming instructions fill the instruction window if the window is not large enough. Processor cannot place new instructions into the window if the window is already full. This is called a full-window stall. A full-window stall prevents the processor from making progress in the execution of the program.

7 7 Small Windows: Full-window Stalls Oldest LOAD R1  mem[A] BEQ R1, R0, target L2 Miss! Takes 100s of cycles. 8-entry instruction window: Independent of the L2 miss, executed out of program order, but cannot be retired. Younger instructions cannot be executed because there is no space in the instruction window. The processor stalls until the L2 Miss is serviced. L2 cache misses are responsible for most full-window stalls.

8 8 Impact of L2 Cache Misses 500-cycle DRAM latency, 4.25 GB/s DRAM bandwidth, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model

9 9 The Problem Out-of-order execution requires large instruction windows to tolerate today’s main memory latencies.  Even in the presence of caches and prefetchers As main memory latency (in terms of processor cycles) increases, instruction window size should also increase to fully tolerate the memory latency. Building a large instruction window is not an easy task if we would like to achieve  Low power/energy consumption  Short cycle time  Low design and verification complexity

10 10 Outline Motivation Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Overcoming the Limitations  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary

11 11 Overview of Runahead Execution [HPCA’03] A technique to obtain the memory-level parallelism benefits of a large instruction window (without having to build it!) When the oldest instruction is an L2 miss:  Checkpoint architectural state and enter runahead mode In runahead mode:  Instructions are speculatively pre-executed  The purpose of pre-execution is to discover other L2 misses  L2-miss dependent instructions are marked INV and dropped Runahead mode ends when the original L2 miss returns  Checkpoint is restored and normal execution resumes

12 Compute Load 1 Miss Miss 1 Stall Compute Load 2 Miss Miss 2 Stall Load 1 Hit Load 2 Hit Compute Load 1 Miss Runahead Load 2 MissLoad 2 Hit Miss 1 Miss 2 Compute Load 1 Hit Saved Cycles Perfect Caches: Small Window: Runahead: Runahead Example

13 13 Benefits of Runahead Execution Instead of stalling during an L2 cache miss: Pre-executed loads and stores independent of L2-miss instructions generate very accurate data prefetches:  From main memory to L2  From L2 to L1  For both regular and irregular access patterns Instructions on the predicted program path are prefetched into the instruction/trace cache and L2. Hardware prefetcher and branch predictor tables are trained using future access information.

14 14 Runahead Execution Mechanism Entry into runahead mode Instruction processing in runahead mode  INV bits and pseudo-retirement  Handling of store/load instructions (runahead cache)  Handling of branch instructions Exit from runahead mode Modifications to pipeline

15 15 When an L2-miss load instruction is the oldest in the instruction window: Processor checkpoints  architectural register state (correctness)  branch history register (performance)  return address stack (performance) Processor records the address of the L2-miss load. Processor enters runahead mode. L2-miss load marks its destination register as INV (invalid) and it is removed from the instruction window. Entry into Runahead Mode Compute Load 1 Miss Miss 1

16 16 Processing in Runahead Mode Compute Load 1 Miss Runahead Miss 1 Runahead mode processing is the same as normal instruction processing, EXCEPT: It is purely speculative (No architectural state updates) L2-miss dependent instructions are identified and treated specially.  They are quickly removed from the instruction window.  Their values are not trusted.

17 17 Processing in Runahead Mode Compute Load 1 Miss Runahead Miss 1 Two types of results are produced: INV (invalid), VALID INV = Dependent on an L2 miss  First INV result is produced by the L2-miss load that caused entry into runahead mode. An instruction produces an INV result  If it sources an INV result  If it misses in the L2 cache (A prefetch request is generated) INV results are marked using INV bits in the register file, store buffer, and runahead cache.  INV bits prevent introduction of bogus data into the pipeline.  Bogus values are not used for prefetching/branch resolution.

18 18 Pseudo-retirement in Runahead Mode Compute Load 1 Miss Runahead Miss 1 An instruction is examined for pseudo-retirement when it becomes the oldest in the instruction window.  An INV instruction is removed from window immediately.  A VALID instruction is removed when it completes execution and updates only the microarchitectural state. Architectural (software-visible) register/memory state is NOT updated in runahead mode. Pseudo-retired instructions free their allocated resources.  This allows the processing of later instructions. Pseudo-retired stores communicate their data and INV status to dependent loads.

19 19 Store/Load Handling in Runahead Mode Compute Load 1 Miss Runahead Miss 1 A pseudo-retired store writes its data and INV status to a dedicated scratchpad memory, called runahead cache. Purpose: Data communication through memory in runahead mode. A runahead load accesses store buffer, runahead cache, and L1 data cache in parallel. If a runahead load is dependent on a pseudo-retired runahead store, it reads its data and INV status from the runahead cache. Does not need to be always correct  Size of runahead cache is very small (512 bytes).

20 20 Branch Handling in Runahead Mode Compute Load 1 Miss Runahead Miss 1 Branch instructions are predicted in runahead mode (as in normal mode execution). Runahead branches use the same predictor as normal branches. VALID branches are resolved and initiate recovery if mispredicted. VALID branches update the branch predictor tables.  Early training of the predictor. Reduces mispredictions in normal mode. INV branches cannot be resolved.  A mispredicted INV branch causes the processor to stay on the wrong program path until the end of runahead execution.

21 21 Exit from Runahead Mode Compute Load 1 Miss Runahead Miss 1 Compute Load 1 Re-fetched and Re-executed When the runahead-causing L2 miss is serviced: All instructions in the machine are flushed. INV bits are reset. Runahead cache is flushed. Processor restores the architectural state as it was before the runahead-causing instruction was fetched.  Architecturally, NOTHING happened.  But, hopefully useful prefetch requests were generated (caches warmed up). Processor starts fetch beginning with the runahead-causing instruction. Mode is switched to normal mode.  Instructions executed in runahead mode are re-executed in normal mode.

22 22 Modifications to Pipeline < 0.05% area overhead Uop Queue Store Buffer INV Instruction Decoder L2 Access Queue Front Side Bus Access Queue From memory To memory Frontend RAT Backend FP Queue Int Queue Mem Queue ADDR FP RAT Prefetcher Selection Logic Trace Cache Fetch Unit L2 Cache Renamer Reorder Buffer MEM Sched INT Sched FP SchedRegfile INT Units INT Units GEN L1 Data Cache CHECKPOINTED RUNAHEAD CACHE STATE

23 23 Outline Motivation Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Overcoming the Limitations  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary

24 24 Baseline Processor 3-wide fetch, 29-stage pipeline x86 processor 128-entry instruction window  48-entry load buffer, 32-entry store buffer 512 KB, 8-way, 16-cycle unified L2 cache Approximately 500-cycle L2 miss latency  Bandwidth, contention, conflicts modeled in detail Aggressive streaming data prefetcher (16 streams) Next-two-lines instruction prefetcher

25 25 Evaluated Benchmarks 147 Intel x86 benchmarks simulated for 30 million instructions Benchmark Suites  SPEC CPU 95 (S95): 10 benchmarks, mostly FP  SPEC FP 2000 (FP00): 11 benchmarks  SPEC INT 2000 (INT00): 6 benchmarks  Internet (WEB): 18 benchmarks: SpecJbb, Webmark2001  Multimedia (MM): 9 benchmarks: mpeg, speech rec., quake  Productivity (PROD): 17 benchmarks: Sysmark2k, winstone  Server (SERV): 2 benchmarks: TPC-C, timesten  Workstation (WS): 7 benchmarks: CAD, nastran, verilog

26 26 Performance of Runahead Execution

27 27 Runahead vs. Large Windows

28 28 Runahead vs. Large Windows (Alpha)

29 29 Outline Motivation Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Overcoming the Limitations  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary

30 30 Limitations of the Baseline Runahead Mechanism Energy Inefficiency  Many instructions are speculatively executed, sometimes without providing any performance benefit  On average 27% extra instructions for 22% IPC improvement  Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06] Ineffectiveness for pointer-intensive applications  Runahead cannot parallelize dependent L2 cache misses  Address-Value Delta (AVD) Prediction [MICRO’05] Irresolvable branch mispredictions in runahead mode  Cannot recover from a mispredicted L2-miss dependent branch  Wrong Path Events [MICRO’04]

31 31 Outline Motivation Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Overcoming the Limitations  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary

32 32 Causes of Energy Inefficiency Short runahead periods  Periods that last for 10s of cycles instead of 100s Overlapping runahead periods  Two periods that execute the same dynamic instructions (e.g. due to dependent L2 cache misses) Useless runahead periods  Periods that do not result in the discovery of L2 cache misses

33 33 Efficient Runahead Execution Key idea: Predict if a runahead period is going to be short, overlapping, or useless. If so, do not enter runahead mode. Simple hardware and software (compile-time profiling) techniques are effective predictors.  Details in our papers. Extra executed instructions decrease from 27% to 6% Performance improvement remains the same (22%)

34 34 Outline Motivation Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Overcoming the Limitations  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary

35 35 Runahead execution cannot parallelize dependent misses This limitation results in  wasted opportunity to improve performance  wasted energy (useless pre-execution) Runahead performance would improve by 25% if this limitation were ideally overcome The Problem: Dependent Cache Misses Compute Load 1 Miss Miss 1 Load 2 Miss Miss 2 Load 2Load 1 Hit Runahead: Load 2 is dependent on Load 1 Runahead Cannot Compute Its Address! INV

36 36 The Goal of AVD Prediction Enable the parallelization of dependent L2 cache misses in runahead mode with a low-cost mechanism How:  Predict the values of L2-miss address (pointer) loads Address load: loads an address into its destination register, which is later used to calculate the address of another load as opposed to data load

37 37 Parallelizing Dependent Cache Misses Compute Load 1 Miss Miss 1 Load 2 Hit Miss 2 Load 2Load 1 Hit Value Predicted Runahead Saved Cycles Can Compute Its Address Compute Load 1 Miss Miss 1 Load 2 Miss Miss 2 Load 2 INVLoad 1 Hit Runahead Cannot Compute Its Address! Saved Speculative Instructions Miss

38 38 AVD Prediction Address-value delta (AVD) of a load instruction defined as: AVD = Effective Address of Load – Data Value of Load For some address loads, AVD is stable An AVD predictor keeps track of the AVDs of address loads When a load is an L2 miss in runahead mode, AVD predictor is consulted If the predictor returns a stable (confident) AVD for that load, the value of the load is predicted Predicted Value = Effective Address – Predicted AVD

39 39 Identifying Address Loads in Hardware If the AVD is too large, the value being loaded is likely NOT an address Only predict loads that have satisfied: -MaxAVD < AVD < +MaxAVD This identification mechanism eliminates almost all data loads from consideration  Enables the AVD predictor to be small

40 40 An Implementable AVD Predictor Set-associative prediction table Prediction table entry consists of  Tag (PC of the load)  Last AVD seen for the load  Confidence counter for the recorded AVD Updated when an address load is retired in normal mode Accessed when a load misses in L2 cache in runahead mode Recovery-free: No need to recover the state of the processor or the predictor on misprediction  Runahead mode is purely speculative

41 41 Why Do Stable AVDs Occur? Regularity in the way data structures are  allocated in memory AND  traversed Two types of loads can have stable AVDs  Traversal address loads Produce addresses consumed by address loads  Leaf address loads Produce addresses consumed by data loads

42 42 Traversal Address Loads Regularly-allocated linked list: A A+k A+2k A+3k A+4k A+5k... A traversal address load loads the pointer to next node: node = node  next Effective AddrData Value AVD AA+k -k A+kA+2k-k A+2kA+3k-k A+3kA+4k-k A+4kA+5k-k Stable AVDStriding data value AVD = Effective Addr – Data Value

43 43 Stable AVDs can be captured with a stride value predictor Stable AVDs disappear with the re-organization of the data structure (e.g., sorting) Stability of AVDs is dependent on the behavior of the memory allocator  Allocation of contiguous, fixed-size chunks is useful Properties of Traversal-based AVDs A A+k A+2k A+3k A+k A A+2k Sorting Distance between nodes NOT constant!

44 44 Leaf Address Loads Sorted dictionary in parser: Nodes point to strings (words) String and node allocated consecutively A+k A C+k C B+k B D+kE+kF+kG+k DEFG Dictionary looked up for an input word. A leaf address load loads the pointer to the string of each node: Effective AddrData Value AVD A+kA k C+kCk F+kFk lookup (node, input) { //... ptr_str = node  string; m = check_match(ptr_str, input); if (m>=0) lookup(node->right, input); if (m left, input); } Stable AVDNo stride! AVD = Effective Addr – Data Valuestring node

45 45 Properties of Leaf-based AVDs Stable AVDs cannot be captured with a stride value predictor Stable AVDs do not disappear with the re-organization of the data structure (e.g., sorting) Stability of AVDs is dependent on the behavior of the memory allocator A+k A B+k BC C+k Sorting Distance between node and string still constant! C+k C A+k AB B+k

46 46 Performance of AVD Prediction 12.1%

47 47 Effect on Executed Instructions 13.3%

48 48 AVD Prediction vs. Stride Value Prediction Performance:  Both can capture traversal address loads with stable AVDs e.g., treeadd  Stride VP cannot capture leaf address loads with stable AVDs e.g., health, mst, parser  AVD predictor cannot capture data loads with striding data values Predicting these can be useful for the correct resolution of mispredicted L2-miss dependent branches, e.g., parser Complexity:  AVD predictor requires much fewer entries (only address loads)  AVD prediction logic is simpler (no stride maintenance)

49 49 AVD vs. Stride VP Performance 12.1% 2.5% 13.4% 12.6% 4.5% 16% 16 entries4096 entries

50 50 Outline Motivation Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Overcoming the Limitations  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary

51 51 Summary Cache misses are a major performance limiter in current processors. Tolerating L2 cache miss latency by traditional out-of-order execution is too costly in terms of hardware complexity and energy consumption. Efficient runahead execution provides the latency tolerance benefit of a large instruction window  by parallelizing independent cache misses in the shadow of an L2 miss  Modest increases in hardware cost or energy consumption  22% performance increase on a 128-entry window with 6% increase in executed instructions <0.05% increase in processor area Address-Value Delta (AVD) prediction enables the parallelization of dependent cache misses  by taking advantage of the regularity in the memory allocation patterns  A 16-entry AVD predictor improves the performance of runahead execution by 12% on pointer-intensive applications

52 52 For more information… Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt, "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,“ Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA), pages 129-140, Anaheim, CA, February 2003.  http://www.ece.utexas.edu/~onur/mutlu_hpca03.pdf Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt, "Runahead Execution: An Effective Alternative to Large Instruction Windows,“ IEEE Micro, Special Issue: Micro's Top Picks from Microarchitecture Conferences (MICRO TOP PICKS), Vol. 23, No. 6, pages 20-25, November/December 2003.  http://www.ece.utexas.edu/~onur/mutlu_ieee_micro03.pdf

53 53 For even more information… Onur Mutlu, Hyesoon Kim, and Yale N. Patt, "Techniques for Efficient Processing in Runahead Execution Engines,“ Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), pages 370-381, Madison, WI, June 2005.  http://www.ece.utexas.edu/~onur/mutlu_isca05.pdf  http://www.ece.utexas.edu/~onur/mutlu_ieee_micro06.pdf Onur Mutlu, Hyesoon Kim, and Yale N. Patt, "Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns,“ Proceedings of the 38th International Symposium on Microarchitecture (MICRO), pages 233-244, Barcelona, Spain, November 2005.  http://www.ece.utexas.edu/~onur/mutlu_micro05.pdf

54 More Questions?

55 55 In-order vs. Out-of-order Execution (Alpha)

56 56 Sensitivity to L2 Cache Size

57 57 Effect of a Better Front-end


Download ppt "Runahead Execution A Power-Efficient Alternative to Large Instruction Windows for Tolerating Long Main Memory Latencies Onur Mutlu EE 382N Guest Lecture."

Similar presentations


Ads by Google