Dean Tullsen ACACES 2008  Parallelism – Use multiple contexts to achieve better performance than possible on a single context.  Traditional Parallelism.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

CSCI 4717/5717 Computer Architecture

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Chapter 1 and 2 Computer System and Operating System Overview

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Slipstream Processors by Pujan Joshi1 Pujan Joshi May 6 th, 2008 Slipstream Processors Improving both Performance and Fault Tolerance.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Advanced Architectures

Simultaneous Multithreading

5.2 Eleven Advanced Optimizations of Cache Performance

/ Computer Architecture and Design

Levels of Parallelism within a Single Processor

Hardware Multithreading

Address-Value Delta (AVD) Prediction

Computer Architecture: Multithreading (IV)

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Henk Corporaal TUEindhoven 2011

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

Sampoorani, Sivakumar and Joshua

/ Computer Architecture and Design

Levels of Parallelism within a Single Processor

Dynamic Hardware Prediction

Loop-Level Parallelism

rePLay: A Hardware Framework for Dynamic Optimization

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Dean Tullsen ACACES 2008  Parallelism – Use multiple contexts to achieve better performance than possible on a single context.  Traditional Parallelism – We use extra threads/processors to offload computation. Threads divide up the execution stream.  Non-traditional parallelism – Extra threads are used to speed up computation without necessarily off-loading any of the original computation  Primary advantage  nearly any code, no matter how inherently serial, can benefit from parallelization.  Another advantage – threads can be added or subtracted without significant disruption.

Dean Tullsen ACACES 2008 Thread 1 Thread 2 Thread 3 Thread 4

Dean Tullsen ACACES 2008 Thread 1 Thread 2 Thread 3 Thread 4  Speculative precomputation, dynamic speculative precomputation, many others.  Most commonly – prefetching, possibly branch pre-calculation.

Dean Tullsen ACACES 2008 Background -- Helper Threads  Chappell, Stark, Kim, Reinhardt, Patt, “Simultaneous Subordinate Micro-threading” 1999  Use microcoded threads to manipulate the microarchitecture to improve the performance of the main thread.  Zilles 2001, Collins 2001, Luk 2001  Use a regular SMT thread, with code distilled from the main thread, to support the main thread.

Dean Tullsen ACACES 2008  Speculative Precomputation [Collins, et al 2001 – Intel/UCSD]  Dynamic Speculative Precomputation  Event-Driven Simultaneous Optimization  Value Specialization  Inline Prefetching  Thread Prefetching

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  In SP, a p-slice is a thread derived from a trace of execution between a trigger instruction and the delinquent load.  All instructions upon which the load’s address is not dependent are removed (often 90-95%).  Live-in register values (typically 2-6) must be explicitly copied from main thread to helper thread.

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab Delinquent load Trigger instruction Prefetch Spawn thread Memory latency

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  Because SP uses actual program code, can precompute addresses that fit no predictable pattern.  Because SP runs in a separate thread, it can interfere with the main thread much less than software prefetching. When it isn’t working, it can be killed.  Because it is decoupled from the main thread, the prefetcher is not constrained by the control flow of the main thread.  All the applications in this study already had very aggressive software prefetching applied, when possible.

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  On-chip memory for transfer of live-in values.  Chaining triggers – for delinquent loads in loops, a speculative thread can trigger the next p-slice (think of this as a looping prefetcher which targets a load within a loop)  Minimizes live-in copy overhead.  Enables SP threads to get arbitrarily far ahead.  Necessitates a mechanism to stop the chaining prefetcher.

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  Chaining triggers executed without impacting main thread  Target delinquent loads arbitrarily far ahead of non-speculative thread  Speculative threads make progress independent of main thread  Use basic triggers to initiate precomputation, but use chaining triggers to sustain it

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  Speculative precomputation uses otherwise idle hardware thread contexts  Pre-computes future memory accesses  Targets worst behaving static loads in a program  Chaining Triggers enable speculative threads to spawn additional speculative threads  Results in tremendous performance gains, even with conservative hardware assumptions

Dean Tullsen ACACES 2008  Speculative Precomputation  Dynamic Speculative Precomputation [Collins, et al – UCSD/Intel]  Event-Driven Simultaneous Optimization  Value Specialization  Inline Prefetching  Thread Prefetching

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  SP, as well as similar techniques proposed about the same time, require  Profile support  Heavy user or compiler interaction  It is thus susceptible to profile-mismatch, requires recompilation for each machine architecture, and if they require user interaction…  (or, a bit more accurately, we just wanted to see if we could do it all in hardware) Dynamic Speculative Precomputation (DSP) -- Motivation

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  relies on the hardware to  identify delinquent loads  create speculative threads  optimize the threads when they aren’t working quite well enough  eliminate the threads when they aren’t working at all  destroy threads when they are no longer useful…

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  Like hardware prefetching, works without software support or recompilation, regardless of the machine architecture.  Like SP, works with minimal interference on main thread.  Like SP, works on highly irregular memory access patterns.

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  Identify delinquent loads  Delinquent Load Identification Table  Construct p-slices and apply optimizations  Retired Instruction Buffer  Spawn and manage P-slices  Slice Information Table  Implemented as back-end instruction analyzers

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab PC ICache Register Renaming Centralized Instruction Queue Re-order Buffer Monolithic Register File Execution Units Data Cache

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab PC ICache Register Renaming Centralized Instruction Queue Re-order Buffer Monolithic Register File Execution Units Data Cache Delinquent Load Identification Table (DLIT) Retired Instruction Buffer (RIB) Slice Information Table (SIT)

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  Once delinquent load identified, RIB buffers instructions until the delinquent load appears as the newest instruction in the buffer.  Dependence analysis easily identifies load’s antecedents, a trigger instruction, and the live-in’s needed by the slice.  Similar to register live-range analysis  But much easier

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  Construct p-slices to prefetch delinquent loads  Buffers information on an in-order run of committed instructions  Comparable to trace cache fill unit  FIFO structure  RIB normally idle

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  Analyze instructions between two instances of delinquent load  Most recent to oldest  Maintain partial p-slice and register live-in set  Add to p-slice instructions which produce live-in set register  Update register live-in set  When analysis terminates, p-slice has been constructed and live-in registers identified

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab struct DATATYPE { int val[10]; }; DATATYPE * data [100]; for(j = 0; j < 10; j++) { for(i = 0; i < 100; i++) { data[i]->val[j]++; } loop: I1 load r1=[r2] I2 add r3=r3+1 I3 add r6=r3-100 I4 add r2=r2+8 I5 add r1=r4+r1 I6 load r5=[r1] I7 add r5=r5+1 I8 store [r1]=r5 I9 blt r6, loop

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab load r5 = [r1] Analyze from recent add r1 = r4+r1 add r2 = r2+8 Instruction add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] To oldest IncludedLive-in Set

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab load r5 = [r1] add r1 = r4+r1 add r2 = r2+8 Instruction add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] IncludedLive-in Set

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab load r5 = [r1] add r1 = r4+r1 add r2 = r2+8 Instruction add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] Included r1 Live-in Set 

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab load r5 = [r1] add r1 = r4+r1 add r2 = r2+8 Instruction add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] Included r1 Live-in Set r1 

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab load r5 = [r1] add r1 = r4+r1 add r2 = r2+8 Instruction add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] Included r1 Live-in Set r1  

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab load r5 = [r1] add r1 = r4+r1 add r2 = r2+8 Instruction add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] Included r1 Live-in Set r1,r4  

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab load r5 = [r1] add r1 = r4+r1 add r2 = r2+8 Instruction add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] Included r1 Live-in Set r1,r4  

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab load r5 = [r1] add r1 = r4+r1 add r2 = r2+8 Instruction add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] Included r1 Live-in Set r1,r4 r2,r4   r1,r4 

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab load r5 = [r1] add r1 = r4+r1 add r2 = r2+8 Instruction add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] P-Slice load r1 = [r2] add r1 = r4+r1 load r5 = [r1] Live-in Set r2,r4 Delinquent Load is trigger

Dean Tullsen ACACES 2008  If two occurrences of the load are in the buffer (the common case), we’ve identified a loop that can be exploited for better slices.  Can perform additional analysis passes and optimizations  Retain live-in set from previous pass  Increases construction latency but keeps RIB simple  Optimizations  Advanced trigger placement (if dependences allow, move trigger earlier in loop)  Induction unrolling (prefetch multiple iterations ahead)  Chaining (looping) slices – prefetch many loads with a single thread. Dean TullsenProcessor Architecture and Compilation Lab

Dean Tullsen ACACES 2008  Similar to SP chaining slice, but this time just a loop.  Out-of-order processor for this study.  Two new challenges  Must manage runahead distance  Kill threads when non-speculative thread leaves program section Dean TullsenProcessor Architecture and Compilation Lab

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab

Dean Tullsen ACACES 2008 Dean TullsenProcessor Architecture and Compilation Lab  Dynamic Speculative Precomputation aggressively targets delinquent loads  Thread based prefetching scheme  Uses back-end (off critical path) instruction analyzers  P-slices constructed with no external software support  Multi-pass RIB analysis enables aggressive p-slice optimizations

Dean Tullsen ACACES 2008  Speculative Precomputation  Dynamic Speculative Precomputation  Event-Driven Simultaneous Optimization  Value Specialization  Inline Prefetching  Thread Prefetching

Dean Tullsen ACACES 2008 With Weifeng Zhang and Brad Calder

Dean Tullsen ACACES 2008  Use “helper threads” to recompile/optimize the main thread.  Optimization is triggered by interesting events that are identified in hardware (event- driven).

Dean Tullsen ACACES 2008 Thread 1 Thread 2 Thread 3 Thread 4  Execution and Compilation take place in parallel!

Dean Tullsen ACACES 2008  A new model of optimization  Computation and optimization occur in parallel  Optimizations are triggered by the program’s runtime behavior  Advantages  Low overhead profiling of runtime behavior  Low overhead optimization by exploiting additional hardware context  Quick response to the program’s changing behavior  Aggressive optimizations

Dean Tullsen ACACES 2008 original code Helper thread base optimized code event Re-optimized code  Maintaining only one copy of the optimized code Recurrent optimization on already optimized code when the behavior changes Main thread Gradually enabling aggressive optimizations event Helper thread

Hardware event-driven Hardware monitors the program’s behavior with no software overhead Optimization threads triggered to respond to particular events. Optimization events handled ASAP to quickly adapt to the program’s changing behaviors Hardware Multithreaded Concurrent, low-overhead helper threads Gradual re-optimization upon new events Main thread Optimization threads events Trident

Dean Tullsen ACACES 2008 Register a given thread to be monitored, and create helper thread contexts Monitor the main thread to generate events (into the queue) Helper thread is triggered to perform optimization. Update the code cache and patch the main thread

Dean Tullsen ACACES 2008  Events  Occurrence of a particular type of runtime behavior  Generic events  Hot branch events  Trace invalidation  Optimization specific events  Hot value events  Delinquent Load events  Other events  ?

Dean Tullsen ACACES 2008  The Trident Framework is built around a fairly traditional dynamic optimization system  => hot trace formation, code cache  Trident captures hot traces in hardware (details omitted)  However, even with its basic optimizations, Trident has key advantages over previous systems  Hardware hot branch events identify hot traces  Zero-overhead monitoring  Low-overhead optimization in another thread  No context switches between these functions

Dean Tullsen ACACES 2008  Definitions  Hot trace ▪ A number of basic blocks frequently running together  Trace formation ▪ Streamlining these blocks for better execution locality  Code cache ▪ Memory buffer to store hot traces A G E C F K J H D B call return start I

Dean Tullsen ACACES 2008  Streamlining the instruction sequence  Redundant branch/load removal  Constant propagation  Instruction re-association  Code elimination  Architecture-aware optimizations  reduction of RAS (return address stack) mis-predictions (orders of magnitude)  I-cache conscious placement of traces within code cache.  Trace Invalidation

Dean Tullsen ACACES 2008  Value specialization  Make a special version of the code corresponding to likely live-in values  Advantages over hardware value prediction  Value predictions are made in the background and less frequently  No limits on how many predictions can be made  Allow more sophisticated prediction techniques  Propagate predicted values along the trace  Trigger other optimizations such as strength reduction

Dean Tullsen ACACES 2008  Value specialization  Make a special version of the code corresponding to likely live-in values  Advantages over software value specialization  Can adapt to semi-invariant runtime values (eg, values that change, but slowly)  Adapts to actual dynamic runtime values.  Detects optimizations that are no longer working.

Dean Tullsen ACACES 2008  Value specialization  Semi-invariant “constants”  Strided values (details omitted)  Dynamic verification Perform the original load into a scratch register Perform the original load into a scratch register Move predicted value into the load destination Move predicted value into the load destination Check the predicted value, branch to recovery if not equal Check the predicted value, branch to recovery if not equal Perform constant propagation and strength reduction Perform constant propagation and strength reduction Copy the scratch into load destination Copy the scratch into load destination Jump to next instruction after load in the original binary Jump to next instruction after load in the original binary compensation block LDQ0(R2)  R1 ADDR6, R4  R3 MULR1, R3  R2 …… LDQ0(R2)  R3 MOV0  R1 BNER1, R3, recovery ADDR6, R4  R3 MOV0  R2..… No dependency!

Dean Tullsen ACACES 2008 Evaluate helper threads’ impact on the main threads Exercise full optimization flow Do not use the optimized traces ~0.6% of degradation of the main thread’s IPC Concurrent execution of the main thread and helpers ≤ 2% of total execution time (running concurrently with the main thread)

Dean Tullsen ACACES %

Dean Tullsen ACACES 2008  Speculative Precomputation  Dynamic Speculative Precomputation  Event-Driven Simultaneous Optimization  Value Specialization  Inline Prefetching  Thread Prefetching

Dean Tullsen ACACES 2008  Limitations of existing prefetching techniques  Compiler-based static prefetching ▪ Address / aliasing resolution ▪ Timeliness ▪ Hard to identify delinquent loads ▪ Variation due to data input or architecture  Hardware prefetching ▪ Cannot follow complicated load behaviors

Dean Tullsen ACACES 2008  Goal  Provide an efficient way to perform flexible software prefetching  Find prefetching opportunities in legacy code  Effective prefetching  Prefetching should be accurate ▪ Target the loads which actually miss in the cache ▪ Prefetch far ahead enough to cover miss latency ▪ Must have low overhead to compute prefetching addresses

Dean Tullsen ACACES 2008  Intrinsically difficult to get the prefetch distance right  Trident enables adaptive discovery of the optimal prefetch distances  Conventional systems often make decisions once because of high overhead Load 1 Load 3 Load 2 execution time original execution trace

Dean Tullsen ACACES 2008  Hot branch event  Delinquent load event

Dean Tullsen ACACES 2008  Determine how far ahead to prefetch a delinquent load  Prefetch Distance =  Prefetch (offset+stride*distance)(base)  Most prior prefetching systems keep the prefetch distance fixed after initial estimate  Trident reuses the first two steps, except that  Low overhead of monitoring + optimization allows us to adapt this distance as well as the stride average load miss latency average cycles per iteration

Dean Tullsen ACACES 2008  1. Object prefetching – identifies loads within the same object, and clusters them to minimize prefetch overhead.  2. Adaptive determination of prefetch distance

Dean Tullsen ACACES 2008  Heavy interaction between neighboring loads (especially other loads we are also prefetching) make static or even dynamic determination of the correct prefetch distance difficult.  Because of the low cost of optimization, Trident uses trial-and-error to discover the right distances.

Dean Tullsen ACACES 2008  All stride based prefetch instructions are inserted with initial distance of 1  These loads are continuously monitored in the DLT  The optimizer increases/decreases the distance until  The load is no longer delinquent  The load is matured  Stabilization is achieved quickly prefetch not hiding enough latency load is delinquent delinquent load event

Dean Tullsen ACACES 2008  In many cases, pointer chasing loads actually have strided patterns.  These patterns can be identified by Trident’s hardware monitors.  This gives Trident 2 advantages over software prefetchers  Low-overhead address computation  The ability to prefetch multiple iterations ahead.

 Baseline: H/W stride-based prefetching stream buffers  Self-repairing based prefetching achieves 23% speedup  12% better than software prefetching without repairing

Dean Tullsen ACACES 2008  Speculative Precomputation  Dynamic Speculative Precomputation  Event-Driven Simultaneous Optimization  Value Specialization  Inline Prefetching  Thread Prefetching (speculative precomputation)

Dean Tullsen ACACES 2008  Speculative Precomputation ▪ Collins [MICRO’01]  Prior research on pre-computation  Static construction ▪ Not adaptive ▪ Not timely  Hardware construction ▪ Complicated ▪ Expensive  P-thread code is stolen from the main thread  Works best when able to create a looping thread. Main thread Helper thread Delinquent load trigger prefetching

Dean Tullsen ACACES 2008  Can potentially be more effective than inline prefetching.  However, more complex, with more things to get right/wrong  Trigger points  Termination points  Synchronization between helper and main thread  These vary not just with load latencies, but also control flow, etc.  Again, Trident’s ability to continuously adapt is key.

Dean Tullsen ACACES 2008  It is critical in any thread-based prefetching scheme that the prefetch thread stay ahead of the main thread.  Trident optimizations  jump-start p-thread multiple iterations ahead  Use dynamically detected strides to replace complex recurrences  Same-object prefetching  P-thread placement optimizations for I-cache performance  Low-overhead sw synchronization  Quick repair of off-track prefetching

Dean Tullsen ACACES 2008 Trident’s acceleration techniques achieve 7% better performance than existing pre-computation techniques

Dean Tullsen ACACES 2008 Adaptive inlined prefetching Pre-computation achieve 10% better performance than previous aggressive inlined prefetching

Dean Tullsen ACACES 2008  Event-driven multithreaded optimization  Hardware event-driven optimization means low overhead profiling. ▪ monitoring of code need never stop  Allowing compilation to take place in parallel with execution provides low overhead optimization ▪ Allows more aggressive optimizations ▪ Allows gradual improvement via recurrent optimization ▪ Allows self-adaptive (eg, search-based) optimization  What else can we do with this technology?

Dean Tullsen ACACES 2008 Non-traditional Parallelism  Works on serial code (and parallel)  Provides parallel speedup by allowing the main thread to run faster  Is not limited by traditional theoretical limits to parallel speedup  Adapts easily to changes in available parallelism  Other types of non-traditional parallelism??