A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

Slides:

Advertisements

Similar presentations

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Advertisements

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Goal: Reduce the Penalty of Control Hazards

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.

Revisiting Load Value Speculation:

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Computer System Design

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Fetch Directed Prefetching - a Study

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

PipeliningPipelining Computer Architecture (Fall 2006)

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Prof. Hsien-Hsin Sean Lee

CS203 – Advanced Computer Architecture

Dynamic Branch Prediction

CSC 4250 Computer Architectures

CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic

5.2 Eleven Advanced Optimizations of Cache Performance

Flow Path Model of Superscalars

Ka-Ming Keung Swamy D Ponpandi

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 20: OOO, Memory Hierarchy

* From AMD 1996 Publication #18522 Revision E

Dynamic Hardware Prediction

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong

Conventional Pipeline Architecture High-performance processors can be broken down into two parts  Front-end: fetches and decodes instructions  Execution core: executes instructions

Front-End and Pipeline Simple Front-End Decode… Fetch

Front-End with Prediction FetchPredict FetchPredict FetchPredict Simple Front-End …Decode

Front-End Issues I Flynn’s bottleneck:  IPC is bounded by the number of Instructions fetched per cycle Implies: As execution performance increases, the front-end must keep up to ensure overall performance

Front-End Issues II Two opposing forces  Designing a faster front-end Increase I-cache size  Interconnect Scaling Problem Wire performance does not scale with feature size Decrease I-cache size

Key Contributions I

Key Contributions: Fetch Target Queue Objective:  Avoid using large cache with branch prediction Purpose  Decouple I-cache from branch prediction Results  Improves throughput

Key Contributions: Fetch Target Buffer Objective  Avoid large caches with branch prediction Implementation  A multi-level buffer Results  Deliver performance is 25% better than single level  Scales better with “future” feature size

Outline Scalable Front-End and Components  Fetch Target Queue  Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion

Fetch Target Queue Decouples I-cache from branch prediction  Branch predictor can generate predictions independent of when the I-cache uses them FetchPredict FetchPredict FetchPredict Simple Front-End

Fetch Target Queue Decouples I-cache from branch prediction  Branch predictor can generate predictions independent of when the I-cache uses them Fetch Predict Fetch Predict Front-End with FTQ Predict

Fetch Target Queue Fetch and predict can have different latencies  Allows for I-cache to be pipelined As long as they have the same throughput

Fetch Blocks FTQ stores fetch block Sequence of instructions  Starting at branch target  Ending at a strongly biased branch Instructions are directly fed into pipeline

Outline Scalable Front-End and Component  Fetch Target Queue  Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion

Fetch Target Buffer: Outline Review: Branch Target Buffer Fetch Target Buffer Fetch Blocks Functionality

Review: Branch Target Buffer I Previous Work (Perleberg and Smith [2]) Makes fetch independent of predict FetchPredict FetchPredict FetchPredict Simple Front-End FetchPredict FetchPredict FetchPredict With Branch Target Buffer

Review: Branch Target Buffer II Characteristics  Hash table  Makes predictions  Caches prediction information

Review: Branch Target Buffer III Index/ Tag Branch Prediction Predicted branch target Fall- through address Instructions at Branch 0x1718Taken0x18340x1788add sub 0x1734Taken0x20880x1764neq br 0x1154Not taken0x13640x1200ld store ………… PC

FTP Optimizations over BTB Multi-level  Solves conundrum Need a small cache Need enough space to successfully predict branches

FTP Optimizations over BTB Oversize bit  Indicates if a block is larger than cache line  With multi-port cache Allows several smaller blocks to be loaded at the same time

FTP Optimizations over BTB Only stores partial fall-through address  Fall-through address is close to the current PC  Only need to store an offset

FTP Optimizations over BTB Doesn’t store every blocks:  Fall-through blocks  Blocks that are seldom taken

Fetch Target Buffer Next PC Target: of branch Type: conditional, subroutine call/return Oversize: if block size > cache line

Fetch Target Buffer

PC used as index into FTB

HIT! L1 Hit

HIT! NOT TAKEN Branch NOT Taken

HIT! NOT TAKEN Branch NOT Taken

HIT! Branch Taken TAKEN

L1: MISS FALL THROUGH L1 Miss

L1: MISS FALL THROUGH After N cycle Delay L2: HIT L1 Miss

L1: MISS FALL THROUGH: eventually mispredicts L2: MISS L1 and L2 Miss

Hybrid branch prediction Meta-predictor selects between  Local history predictor  Global history predictor  Bimodal predictor

Branch Prediction MetaBimodLocal Pred Local History Global Predictor

Branch Prediction

Committing Results When full, SHQ commits oldest value to local history or global history

Outline Scalable Front-End and Component  Fetch Target Queue  Fetch Target Buffer Methodology Results Analysis and Conclusion

Experimental Methodology I Baseline Architecture  Processor 8 instruction fetch with 16 instruction issue per cycle 128 entry reorder buffer with 32 entry load/store buffer 8 cycle minimum branch mis-prediction penalty Cache  64k 2-way instruction cache  64k 4 way data cache (pipelined)

Experimental Methodology II Timing Model  Cacti cache compiler Models on-chip memory Modified for 0.35 um, um and 0.10 um processes Test set  6 SPEC95 benchmarks  2 C++ Programs

Outline Scalable Front-End and Component  Fetch Target Queue  Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion

Comparing FTB to BTB FTB provides slightly better performance Tested for various cache sizes: 64, 256, 1k, 4k and 8K entries Better

Comparing Multi-level FTB to Single-Level FTB Two-level FTB Performance  Smaller fetch size 2 Level Average Size: Level Average Size: 7.5  Higher accuracy on average Two-Level: 83.3% Single: 73.1 %  Higher performance 25% average speedup over single

Fall-through Bits Used Number of fall- through bits: 4-5  Because fetch distances 16 instructions do not improve performance Better

FTQ Occupancy Roughly indicates throughput On average, FTQ is  Empty: 21.1%  Full: 10.7% of the time Better

Scalability Two level FTB scale well with features size  Higher slope is better Better

Outline Scalable Front-End and Component  Fetch Target Queue  Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion

Analysis 25% improvement in IPC over best performing single-level designs System scales well with feature size On average, FTQ is non-empty 21.1% of the time FTB Design requires at most 5 bits for fall- through address

Conclusion FTQ and FTB design  Decouples the I-cache from branch prediction Produces higher throughput  Uses multi-level buffer Produces better scalability

References [1] A Scalable Front-End Architecture for Fast Instruction Delivery. Glenn Reinman, Todd Austin, and Brand Calder. ACM/IEEE 26 th Annual International Symposium on Computer Architecture. May 1999 [2] Branch Target Buffer: Design and Optimization. Chris Perleberg and Alan Smith. Technical Report. December 1989.

Thank you Questions?