Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/14/2011

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Efficient Runahead Execution Processors A Power-Efficient Processing Paradigm for Tolerating Long Main Memory Latencies Onur Mutlu PhD Defense 4/28/2006.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Runahead Execution A Power-Efficient Alternative to Large Instruction Windows for Tolerating Long Main Memory Latencies Onur Mutlu EE 382N Guest Lecture.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Computer Architecture: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University.

15-740/ Computer Architecture Lecture 16: Runahead and OoO Wrap-Up Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/17/2011.

Computer Architecture: Branch Prediction (II) and Predicated Execution

Prof. Hsien-Hsin Sean Lee

Data Prefetching Smruti R. Sarangi.

Prof. Onur Mutlu Carnegie Mellon University

Simultaneous Multithreading

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Carnegie Mellon University

Computer Architecture: Multithreading (I)

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt

Milad Hashemi, Onur Mutlu, Yale N. Patt

Levels of Parallelism within a Single Processor

15-740/ Computer Architecture Lecture 13: More Caching

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Hardware Multithreading

Address-Value Delta (AVD) Prediction

15-740/ Computer Architecture Lecture 24: Control Flow

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Computer Architecture: Multithreading (IV)

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Lecture 10: Branch Prediction and Instruction Delivery

15-740/ Computer Architecture Lecture 10: Runahead and MLP

15-740/ Computer Architecture Lecture 14: Runahead Execution

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Data Prefetching Smruti R. Sarangi.

Prof. Onur Mutlu Carnegie Mellon University

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

15-740/ Computer Architecture Lecture 14: Prefetching

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 18: Caching Basics

Levels of Parallelism within a Single Processor

Patrick Akl and Andreas Moshovos AENAO Research Group

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

18-447: Computer Architecture Lecture 30A: Advanced Prefetching

Ka-Ming Keung Swamy D Ponpandi

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/14/2011 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/14/2011

Review Set Due next Wednesday Recommended: Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA 1990. Recommended: Hennessy and Patterson, Appendix C.2 and C.3 Liptay, “Structural aspects of the System/360 Model 85 II: the cache,” IBM Systems Journal, 1968. Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ ISCA 2006.

Announcements Milestone I Midterm I Milestone II Due this Friday (Oct 14) Format: 2-pages Include results from your initial evaluations. We need to see good progress. Of course, the results need to make sense (i.e., you should be able to explain them) Midterm I October 24 Milestone II Will be postponed. Stay tuned.

Course Feedback I have read them Fill out the form and return, if you have not done so

Last Lecture More issues in load related instruction scheduling Better utilizing the instruction window Runahead execution

Today More on runahead execution

Efficient Scaling of Instruction Window Size One of the major research issues in out of order execution How to achieve the benefits of a large window with a small one (or in a simpler way)? Runahead execution? Upon L2 miss, checkpoint architectural state, speculatively execute only for prefetching, re-execute when data ready Continual flow pipelines? Upon L2 miss, deallocate everything belonging to an L2 miss dependent, reallocate/re-rename and re-execute upon data ready Dual-core execution? One core runs ahead and does not stall on L2 misses, feeds another core that commits instructions

Runahead Example Perfect Caches: Load 1 Hit Load 2 Hit Compute Compute Small Window: Load 1 Miss Load 2 Miss Compute Stall Compute Stall Miss 1 Miss 2 Runahead: Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Hit Compute Runahead Compute Saved Cycles Miss 1 Miss 2

Benefits of Runahead Execution Instead of stalling during an L2 cache miss: Pre-executed loads and stores independent of L2-miss instructions generate very accurate data prefetches: For both regular and irregular access patterns Instructions on the predicted program path are prefetched into the instruction/trace cache and L2. Hardware prefetcher and branch predictor tables are trained using future access information.

Runahead Execution Pros and Cons Advantages: + Very accurate prefetches for data/instructions (all cache levels) + Follows the program path + Simple to implement, most of the hardware is already built in + Versus other pre-execution based prefetching mechanisms: + Uses the same thread context as main thread, no waste of context + No need to construct a pre-execution thread Disadvantages/Limitations: -- Extra executed instructions -- Limited by branch prediction accuracy -- Cannot prefetch dependent cache misses. Solution? -- Effectiveness limited by available “memory-level parallelism” (MLP) -- Prefetch distance limited by memory latency Implemented in IBM POWER6, Sun “Rock”

Memory Level Parallelism (MLP) Idea: Find and service multiple cache misses in parallel Why generate multiple misses? Enables latency tolerance: overlaps latency of different misses How to generate multiple misses? Out-of-order execution, multithreading, runahead, prefetching parallel miss isolated miss B A C time

Memory Latency Tolerance Techniques Caching [initially by Wilkes, 1965] Widely used, simple, effective, but inefficient, passive Not all applications/phases exhibit temporal or spatial locality Prefetching [initially in IBM 360/91, 1967] Works well for regular memory access patterns Prefetching irregular access patterns is difficult, inaccurate, and hardware-intensive Multithreading [initially in CDC 6600, 1964] Works well if there are multiple threads Improving single thread performance using multithreading hardware is an ongoing research effort Out-of-order execution [initially by Tomasulo, 1967] Tolerates cache misses that cannot be prefetched Requires extensive hardware resources for tolerating long latencies

Performance of Runahead Execution

Runahead Execution vs. Large Windows

Runahead vs. A Real Large Window When is one beneficial, when is the other? Pros and cons of each

Runahead on In-order vs. Out-of-order

Runahead vs. Large Windows (Alpha)

In-order vs. Out-of-order Execution (Alpha)

Sun ROCK Cores Load miss in L1 cache starts parallelization using 2 HW threads Ahead thread Checkpoints state and executes speculatively Instructions independent of load miss are speculatively executed Load miss(es) and dependent instructions are deferred to behind thread Behind thread Executes deferred instructions and re-defers them if necessary Memory-Level Parallelism (MLP) Run ahead on load miss and generate additional load misses Instruction-Level Parallelism (ILP) Ahead and behind threads execute independent instructions from different points in program in parallel

ROCK Pipeline

Effect of Runahead in Sun ROCK Chaudhry talk, Aug 2008.

Limitations of the Baseline Runahead Mechanism Energy Inefficiency A large number of instructions are speculatively executed Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06] Ineffectiveness for pointer-intensive applications Runahead cannot parallelize dependent L2 cache misses Address-Value Delta (AVD) Prediction [MICRO’05, IEEE TC’06] Irresolvable branch mispredictions in runahead mode Cannot recover from a mispredicted L2-miss dependent branch Wrong Path Events [MICRO’04]

The Efficiency Problem [ISCA’05] A runahead processor pre-executes some instructions speculatively Each pre-executed instruction consumes energy Runahead execution significantly increases the number of executed instructions, sometimes without providing performance improvement

The Efficiency Problem 22% Mem latency 500 L2 cache 1 MB 27%

Causes of Inefficiency Short runahead periods Overlapping runahead periods Useless runahead periods Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks 2006.

Short Runahead Periods Processor can initiate runahead mode due to an already in-flight L2 miss generated by the prefetcher, wrong-path, or a previous runahead period Short periods are less likely to generate useful L2 misses have high overhead due to the flush penalty at runahead exit Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Miss Compute Runahead Miss 1 Miss 2

Eliminating Short Periods Mechanism to eliminate short periods: Record the number of cycles C an L2-miss has been in flight If C is greater than a threshold T for an L2 miss, disable entry into runahead mode due to that miss T can be determined statically (at design time) or dynamically T=400 for a minimum main memory latency of 500 cycles works well

Overlapping Runahead Periods Two runahead periods that execute the same instructions Second period is inefficient Load 1 Miss Load 2 INV Load 1 Hit Load 2 Miss Compute Runahead OVERLAP OVERLAP Miss 1 Miss 2

Eliminating Overlapping Periods Overlapping periods are not necessarily useless The availability of a new data value can result in the generation of useful L2 misses But, this does not happen often enough Mechanism to eliminate overlapping periods: Keep track of the number of pseudo-retired instructions R during a runahead period Keep track of the number of fetched instructions N since the exit from last runahead period If N < R, do not enter runahead mode

Useless Runahead Periods Periods that do not result in prefetches for normal mode They exist due to the lack of memory-level parallelism Mechanism to eliminate useless periods: Predict if a period will generate useful L2 misses Estimate a period to be useful if it generated an L2 miss that cannot be captured by the instruction window Useless period predictors are trained based on this estimation Load 1 Miss Load 1 Hit Compute Runahead Miss 1

Overall Impact on Executed Instructions 26.5% 6.2%

Overall Impact on IPC 22.6% 22.1%

Limitations of the Baseline Runahead Mechanism Energy Inefficiency A large number of instructions are speculatively executed Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06] Ineffectiveness for pointer-intensive applications Runahead cannot parallelize dependent L2 cache misses Address-Value Delta (AVD) Prediction [MICRO’05] Irresolvable branch mispredictions in runahead mode Cannot recover from a mispredicted L2-miss dependent branch Wrong Path Events [MICRO’04]

The Problem: Dependent Cache Misses Runahead execution cannot parallelize dependent misses wasted opportunity to improve performance wasted energy (useless pre-execution) Runahead performance would improve by 25% if this limitation were ideally overcome Runahead: Load 2 is dependent on Load 1 Cannot Compute Its Address! Load 1 Miss Load 2 INV Load 1 Hit Load 2 Miss Compute Runahead Miss 1 Miss 2

The Goal of AVD Prediction Enable the parallelization of dependent L2 cache misses in runahead mode with a low-cost mechanism How: Predict the values of L2-miss address (pointer) loads Address load: loads an address into its destination register, which is later used to calculate the address of another load as opposed to data load

Parallelizing Dependent Cache Misses Cannot Compute Its Address! Load 1 Miss Load 2 INV Load 1 Hit Load 2 Miss Compute Runahead Miss 1 Miss 2 Value Predicted Can Compute Its Address Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Hit Saved Speculative Instructions As I showed before, traditional runahead execution cannot parallelize dependent cache misses due to load 1 and load 2, because load 2 cannot compute its address in runahead mode. However, if the value of Load 1 is predicted, Load 2 would be able to compute its address and it would generate its miss in runahead mode. Compute Runahead Saved Cycles Miss 1 Miss 2

AVD Prediction [MICRO’05] Address-value delta (AVD) of a load instruction defined as: AVD = Effective Address of Load – Data Value of Load For some address loads, AVD is stable An AVD predictor keeps track of the AVDs of address loads When a load is an L2 miss in runahead mode, AVD predictor is consulted If the predictor returns a stable (confident) AVD for that load, the value of the load is predicted Predicted Value = Effective Address – Predicted AVD Say that predictor does not need to be large!

Why Do Stable AVDs Occur? Regularity in the way data structures are allocated in memory AND traversed Two types of loads can have stable AVDs Traversal address loads Produce addresses consumed by address loads Leaf address loads Produce addresses consumed by data loads

Traversal Address Loads Regularly-allocated linked list: A traversal address load loads the pointer to next node: node = nodenext A AVD = Effective Addr – Data Value A+k Effective Addr Data Value AVD A+2k A A+k -k A+k A+2k -k ... A+3k A+2k A+3k -k Striding data value Stable AVD

Properties of Traversal-based AVDs Stable AVDs can be captured with a stride value predictor Stable AVDs disappear with the re-organization of the data structure (e.g., sorting) Stability of AVDs is dependent on the behavior of the memory allocator Allocation of contiguous, fixed-size chunks is useful A A+k A+2k A+3k A+3k A+k A A+2k Sorting Distance between nodes NOT constant!  AVD Prediction

Leaf Address Loads Sorted dictionary in parser: Nodes point to strings (words) String and node allocated consecutively Dictionary looked up for an input word. A leaf address load loads the pointer to the string of each node: lookup (node, input) { // ... ptr_str = nodestring; m = check_match(ptr_str, input); // … } A+k A B+k C+k node string AVD = Effective Addr – Data Value B C Effective Addr Data Value D+k E+k F+k G+k AVD A+k A k D E F G C+k C k F+k F k No stride! Stable AVD

Properties of Leaf-based AVDs Stable AVDs cannot be captured with a stride value predictor Stable AVDs do not disappear with the re-organization of the data structure (e.g., sorting) Stability of AVDs is dependent on the behavior of the memory allocator A+k A B+k B C C+k C+k C A+k A B B+k Distance between node and string still constant!  Sorting AVD Prediction

Identifying Address Loads in Hardware Insight: If the AVD is too large, the value that is loaded is likely not an address Only keep track of loads that satisfy: -MaxAVD ≤ AVD ≤ +MaxAVD This identification mechanism eliminates many loads from consideration Enables the AVD predictor to be small AVD Prediction

An Implementable AVD Predictor Set-associative prediction table Prediction table entry consists of Tag (Program Counter of the load) Last AVD seen for the load Confidence counter for the recorded AVD Updated when an address load is retired in normal mode Accessed when a load misses in L2 cache in runahead mode Recovery-free: No need to recover the state of the processor or the predictor on misprediction Runahead mode is purely speculative AVD Prediction

AVD Update Logic AVD Prediction

AVD Prediction Logic AVD Prediction

Baseline Processor Execution-driven Alpha simulator 8-wide superscalar processor 128-entry instruction window, 20-stage pipeline 64 KB, 4-way, 2-cycle L1 data and instruction caches 1 MB, 32-way, 10-cycle unified L2 cache 500-cycle minimum main memory latency 32 DRAM banks, 32-byte wide processor-memory bus (4:1 frequency ratio), 128 outstanding misses Detailed memory model Pointer-intensive benchmarks from Olden and SPEC INT00 AVD Prediction

Performance of AVD Prediction runahead 14.3% 15.5% 16-entry predictor, baseline: runahead processor In contrast, a 16-entry stride value predictor reduces exec time by only 2.7% and exec inst by 3.5% A 4K-entry stride VP reduces exec time by 4.7% and exec inst by 6.1%

AVD vs. Stride VP Performance 2.7% 4.7% 5.1% 5.5% 6.5% 8.6% 16 entries 4096 entries AVD Prediction