A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong
Conventional Pipeline Architecture High-performance processors can be broken down into two parts Front-end: fetches and decodes instructions Execution core: executes instructions
Front-End and Pipeline Simple Front-End Decode… Fetch
Front-End with Prediction FetchPredict FetchPredict FetchPredict Simple Front-End …Decode
Front-End Issues I Flynn’s bottleneck: IPC is bounded by the number of Instructions fetched per cycle Implies: As execution performance increases, the front-end must keep up to ensure overall performance
Front-End Issues II Two opposing forces Designing a faster front-end Increase I-cache size Interconnect Scaling Problem Wire performance does not scale with feature size Decrease I-cache size
Key Contributions I
Key Contributions: Fetch Target Queue Objective: Avoid using large cache with branch prediction Purpose Decouple I-cache from branch prediction Results Improves throughput
Key Contributions: Fetch Target Buffer Objective Avoid large caches with branch prediction Implementation A multi-level buffer Results Deliver performance is 25% better than single level Scales better with “future” feature size
Outline Scalable Front-End and Components Fetch Target Queue Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion
Fetch Target Queue Decouples I-cache from branch prediction Branch predictor can generate predictions independent of when the I-cache uses them FetchPredict FetchPredict FetchPredict Simple Front-End
Fetch Target Queue Decouples I-cache from branch prediction Branch predictor can generate predictions independent of when the I-cache uses them Fetch Predict Fetch Predict Front-End with FTQ Predict
Fetch Target Queue Fetch and predict can have different latencies Allows for I-cache to be pipelined As long as they have the same throughput
Fetch Blocks FTQ stores fetch block Sequence of instructions Starting at branch target Ending at a strongly biased branch Instructions are directly fed into pipeline
Outline Scalable Front-End and Component Fetch Target Queue Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion
Fetch Target Buffer: Outline Review: Branch Target Buffer Fetch Target Buffer Fetch Blocks Functionality
Review: Branch Target Buffer I Previous Work (Perleberg and Smith [2]) Makes fetch independent of predict FetchPredict FetchPredict FetchPredict Simple Front-End FetchPredict FetchPredict FetchPredict With Branch Target Buffer
Review: Branch Target Buffer II Characteristics Hash table Makes predictions Caches prediction information
Review: Branch Target Buffer III Index/ Tag Branch Prediction Predicted branch target Fall- through address Instructions at Branch 0x1718Taken0x18340x1788add sub 0x1734Taken0x20880x1764neq br 0x1154Not taken0x13640x1200ld store ………… PC
FTP Optimizations over BTB Multi-level Solves conundrum Need a small cache Need enough space to successfully predict branches
FTP Optimizations over BTB Oversize bit Indicates if a block is larger than cache line With multi-port cache Allows several smaller blocks to be loaded at the same time
FTP Optimizations over BTB Only stores partial fall-through address Fall-through address is close to the current PC Only need to store an offset
FTP Optimizations over BTB Doesn’t store every blocks: Fall-through blocks Blocks that are seldom taken
Fetch Target Buffer Next PC Target: of branch Type: conditional, subroutine call/return Oversize: if block size > cache line
Fetch Target Buffer
PC used as index into FTB
HIT! L1 Hit
HIT! NOT TAKEN Branch NOT Taken
HIT! NOT TAKEN Branch NOT Taken
HIT! Branch Taken TAKEN
L1: MISS FALL THROUGH L1 Miss
L1: MISS FALL THROUGH After N cycle Delay L2: HIT L1 Miss
L1: MISS FALL THROUGH: eventually mispredicts L2: MISS L1 and L2 Miss
Hybrid branch prediction Meta-predictor selects between Local history predictor Global history predictor Bimodal predictor
Branch Prediction MetaBimodLocal Pred Local History Global Predictor
Branch Prediction
Committing Results When full, SHQ commits oldest value to local history or global history
Outline Scalable Front-End and Component Fetch Target Queue Fetch Target Buffer Methodology Results Analysis and Conclusion
Experimental Methodology I Baseline Architecture Processor 8 instruction fetch with 16 instruction issue per cycle 128 entry reorder buffer with 32 entry load/store buffer 8 cycle minimum branch mis-prediction penalty Cache 64k 2-way instruction cache 64k 4 way data cache (pipelined)
Experimental Methodology II Timing Model Cacti cache compiler Models on-chip memory Modified for 0.35 um, um and 0.10 um processes Test set 6 SPEC95 benchmarks 2 C++ Programs
Outline Scalable Front-End and Component Fetch Target Queue Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion
Comparing FTB to BTB FTB provides slightly better performance Tested for various cache sizes: 64, 256, 1k, 4k and 8K entries Better
Comparing Multi-level FTB to Single-Level FTB Two-level FTB Performance Smaller fetch size 2 Level Average Size: Level Average Size: 7.5 Higher accuracy on average Two-Level: 83.3% Single: 73.1 % Higher performance 25% average speedup over single
Fall-through Bits Used Number of fall- through bits: 4-5 Because fetch distances 16 instructions do not improve performance Better
FTQ Occupancy Roughly indicates throughput On average, FTQ is Empty: 21.1% Full: 10.7% of the time Better
Scalability Two level FTB scale well with features size Higher slope is better Better
Outline Scalable Front-End and Component Fetch Target Queue Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion
Analysis 25% improvement in IPC over best performing single-level designs System scales well with feature size On average, FTQ is non-empty 21.1% of the time FTB Design requires at most 5 bits for fall- through address
Conclusion FTQ and FTB design Decouples the I-cache from branch prediction Produces higher throughput Uses multi-level buffer Produces better scalability
References [1] A Scalable Front-End Architecture for Fast Instruction Delivery. Glenn Reinman, Todd Austin, and Brand Calder. ACM/IEEE 26 th Annual International Symposium on Computer Architecture. May 1999 [2] Branch Target Buffer: Design and Optimization. Chris Perleberg and Alan Smith. Technical Report. December 1989.
Thank you Questions?