Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA.

Slides:

Advertisements

Similar presentations

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Advertisements

André Seznec Caps Team IRISA/INRIA 1 The O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

Dynamic Branch Prediction

André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata.

Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.

EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Branch Target Buffers BPB: Tag + Prediction

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.

© Krste Asanovic, 2014CS252, Spring 2014, Lecture 7 CS252 Graduate Computer Architecture Spring 2014 Lecture 7: Branch Prediction and Load-Store Queues.

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

ALPHA Introduction I- Stream ALPHA Introduction I- Stream Dharmesh Parikh.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

1 Revisiting the perceptron predictor André Seznec IRISA/ INRIA.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Computer Structure Advanced Branch Prediction

André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC.

CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Prof. Hsien-Hsin Sean Lee

CS203 – Advanced Computer Architecture

Computer Structure Advanced Branch Prediction

The University of Adelaide, School of Computer Science

CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic

5.2 Eleven Advanced Optimizations of Cache Performance

ECE 445 – Computer Organization

CMSC 611: Advanced Computer Architecture

Exploring Value Prediction with the EVES predictor

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Dynamic Hardware Prediction

The O-GEHL branch predictor

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA

AS-AF Caps Team Instruction fetch on wide issue superscalar processors  Fetching 6-10 instructions in // on each cycle:  Fetch can be pipelined  I-cache can be banked  Instruction streams are “relatively long”  Next block address generation is critical Not the real issueµ

AS-AF Caps Team Instruction block address generation  One block per cycle  Speculative: accuracy is critical  Accuracy comes with hardware complexity:  Conditional branch predictor  Sequential block address computation  Return address stack read  Jump prediction  Branch target prediction/computation  Final address selection

AS-AF Caps Team Using a complex instruction address generator (IAG)  PentiumPro, Alpha EV6, Alpha EV8:  Fast (relatively) inaccurate IAG responding in a single cycle backed with a complex multicycle IAG  Loss of a significant part of instruction bandwidth  Overfetching implemented on Alpha EV8  Trend to deeper pipelines:  Smaller line predictor: less accurate  Deeper pipelined IAG: longer misfetch penalty 10 % misfetches, 3 cycles penalty : 30 % bandwidth loss

AS-AF Caps Team Ahead pipelining the IAG  Suggested with multiple block ahead prediction in Seznec, Jourdan, Sainrat and Michaud (ASPLOS 96):  Conventional IAG: use information available with block A to predict block B  Multiple block ahead prediction: use information available with A to predict block C (or D)  This paper :  How to really do it

AS-AF Caps Team Fetch blocks for us 1st inst Rc: B inst /fetch block Rs: ends with the cache block Rl ends with the second block

AS-AF Caps Team Fetch blocks (2) Cond 1st inst 1nt: 1NT bypassing 0nt: no NT bypassing aNT: all NT bypassing

AS-AF Caps Team Fetch blocks (3) CondUncond 0nt 0NT 0NT+: no extra cond Cond 0nt 0NT: no extra CTI

AS-AF Caps Team Technological hypothesis  Alpha EV8 front-end + twice faster clock:  Cycle = 1 EV8 cycle phase  Cycle = time to cross a 8 to 10 entries multiplexor  Cycle = time to read a 2Kb table and route back the read data as an index.  2 cycles for reading a 16Kb table and routing back the read data as an index

AS-AF Caps Team Hierarchical IAG  Complex IAG + Line predictor  Conventional complex IAG spans over four cycles:  3 cycles for conditional branch prediction  3 cycles for I-cache read and branch target computation  Jump prediction, return stack read  + 1 cycle for final address selection  Line prediction:  a single 2Kb table + 1-bit direction table  select between fallthrough and line predictor read

AS-AF Caps Team Hierarchical IAG (2) LP RAS Pred Check Cond. Jump Pred Final Selection Branch target addresses + decode info

AS-AF Caps Team Ahead pipelining the IAG  Same functionalities required:  Final selection: uses the last cycle in the address generation  Conditional branch predictor  Jump predictor  Return address stack:  Branch target prediction Recomputation impossible ! Use of a Branch target buffer  Decode information: MUST BE PREDICTED

AS-AF Caps Team Ahead pipelining the IAG (2)  Initiate table reads N cycles ahead with information available:  N-block ahead address  N-block ahead branch history  Problem: significant loss of accuracy Use of N-block ahead (address, history) + intermediate path !

AS-AF Caps Team Inflight path incorporation  Pipelined table access + final logic:  Column selection  Wordline selection  Final logic  Insert one bit of inflight information per intermediate block  Not the same bit for each table in the predictor !

AS-AF Caps Team Ahead conditional branch predictor  Global history branch predictor  512 Kbits  2bc-gskew 4 cycles prediction delay:  Indexed using 5 blocks ahead (address + history)  One bit per table per intermediate block Accuracy equivalent to conventional conditional branch predictors

AS-AF Caps Team Ahead Branch Target Prediction  Use of a BTB:  Tradeoffs: Size: 2Kb/16Kb 1 vs 2 cycles tagless or tags (+1 cycle) Associativity: +1 cycle ?  Difficulty:  The longer the pipeline read, the larger the number of possible pathes especially if nottaken branches are bypassed

AS-AF Caps Team Ahead Branch target prediction (2)  2Kb is too small:  16 Kb 2 cycles access time  Associativity is required, but extra cycle for tag check is disastrous:  tagless + way-prediction !  2-way skewed-associativity: to incorporate different inflight information on the two ways

AS-AF Caps Team Ahead Branch Target Prediction (3)  Bypassing not-taken branches:  N possible branches : N possible targets  N targets per entry: waste of space Many single entry used  Read of N contiguous entries

AS-AF Caps Team Ahead jump predictor, return stack  Jump predictor: history + address  3 cycles ahead + inflight information  Return address stack as usual:  Direct access to the previous cycle top  Access to the possible address to be pushed

AS-AF Caps Team Decode information !!  Needs the decode information of the current block to select among:  Fallthrough  Branch targets  Jump target  Return address stack top Cannot be got from the I-cache, must be predicted !

AS-AF Caps Team Decode information !!  Needs the decode information of the current block to select among:  Fallthrough  Branch targets  Jump target  Return address stack top Cannot be got from the I-cache, must be predicted !

AS-AF Caps Team Ahead pipelined IAG and decode Final Selection Cond. Predicted Decode RAS J ump BTB Decode Info ??

AS-AF Caps Team Decode information (2)  1st try (not in the paper): in the BTB  Entries without targets (more capacity misses)  Duplicated decode information if multiple targets  Decode mispredictions, but correct target prediction by jump predictor, RAS, fallthrough Not very convincing results

AS-AF Caps Team Predicting Decode information Principle: Decode information associated with a block is associated with the block address itself whenever possible whenever possible:  BTB, jump predictor  But not for returns and fallthrough blocks

AS-AF Caps Team Predicting decode information (2)  Return block decode prediction:  Return decode prediction table indexed with (one cycle ahead) top of the return stack Systematic decode misprediction if chaining call and associated return in two successive blocks. Sorry !

AS-AF Caps Team Fallthrough block decode prediction  Just after a taken control-flow instruction, no time to read an extra table before the final selection stage:  Store decode information for block A+1 with address A in BTB, jump predictor  Fall through after fall through:  2-block ahead decode prediction table: A to read A+2 decode info

AS-AF Caps Team Hierarchical IAG vs ahead pipelined IAG CB BTB JP Ret dec 2A dec selection completed IAG init. Decode pred check LP selection LP check

AS-AF Caps Team Recovering after a misprediction  Generation of address for the next block after recovery should have begun 4 cycles before recovery:  4 cycles extra misprediction penalty !? Unacceptable  Checkpoint/repair mechanism should provide information for a smooth restart

AS-AF Caps Team Recovering after a misprediction (2) predicting the next block  There is no time to read any of the table in the IAG, only time to cross the final selection stage.  The checkpoint must provide everything that was at the entry of the final stage one cycle ahead. Less than 200 bits for all policies, except bypassing all not-taken branches

AS-AF Caps Team Recovering from misprediction (3) 3rd block and subsequent 3rd block  BTB and Jump predictor cannot provide targets in time:  All possible targets must come from checkpoint  Conditional branch predictions are not available 4th and 5th block  Conditional branch predictions are not available (approximately) 4 possible sets of entries for the final selection stage in the checkpoint repair, but 2 cycles access time.

AS-AF Caps Team Recovering from misprediction (4) one or two bubbles restart  If full speed I-fetch resuming is too costly in the checkpoint mechanism then:  Two bubbles restart: All possible targets recomputed/predicted Only conditional predictions for next and third block to be checkpointed  One bubble restart: only one set of exits from BTB, jump predictor + conditional branch predictions to be checkpointed

AS-AF Caps Team Performance evaluation  SPEC 2000 int + float  Traces of 100 million instructions after skipping the initialization  Immediate update

AS-AF Caps Team Average fetch block size (integer applications)  Instruction streams: 9.79 instructions  2 * 8 inst/cache line  No bypassing nottaken branches: 5.81 instructions  Bypassing one taken branch: 7.28 instructions  Bypassing all not taken branches: 7.40 instructions Bypassing a single taken branch is sufficient

AS-AF Caps Team Accuracy of IAGs (integer benchmarks ) misp/KI Very similar accuracies

AS-AF Caps Team Misfetches : IAG faults corrected at decode time misfetches/KI Integer applications

AS-AF Caps Team Misfetches (2) misfetches/KI floating-point applications

AS-AF Caps Team (Maximum) Instruction fetch bandwidth integer applications

AS-AF Caps Team Summary  Ahead pipelining the IAG is a valid alternative to the use of a hierarchy of IAGs:  Accuracy in the same range  Significantly higher instruction bandwidth  Main contributions:  Decode prediction  Checkpoint/repair analysis

AS-AF Caps Team Future works  Sufficient to feed 4-to-6 way processors  May be a little bit short for 8-way or more processors  We plan to extend the study to:  Decoupled instruction fetch front end (Reinmann et al )  Multiple (non-contiguous) block address generation  Trace caches