Download presentation
Presentation is loading. Please wait.
Published byNathaniel Harvey Modified over 9 years ago
1
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong
2
Conventional Pipeline Architecture High-performance processors can be broken down into two parts Front-end: fetches and decodes instructions Execution core: executes instructions
3
Front-End and Pipeline Simple Front-End Decode… Fetch
4
Front-End with Prediction FetchPredict FetchPredict FetchPredict Simple Front-End …Decode
5
Front-End Issues I Flynn’s bottleneck: IPC is bounded by the number of Instructions fetched per cycle Implies: As execution performance increases, the front-end must keep up to ensure overall performance
6
Front-End Issues II Two opposing forces Designing a faster front-end Increase I-cache size Interconnect Scaling Problem Wire performance does not scale with feature size Decrease I-cache size
7
Key Contributions I
8
Key Contributions: Fetch Target Queue Objective: Avoid using large cache with branch prediction Purpose Decouple I-cache from branch prediction Results Improves throughput
9
Key Contributions: Fetch Target Buffer Objective Avoid large caches with branch prediction Implementation A multi-level buffer Results Deliver performance is 25% better than single level Scales better with “future” feature size
10
Outline Scalable Front-End and Components Fetch Target Queue Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion
11
Fetch Target Queue Decouples I-cache from branch prediction Branch predictor can generate predictions independent of when the I-cache uses them FetchPredict FetchPredict FetchPredict Simple Front-End
12
Fetch Target Queue Decouples I-cache from branch prediction Branch predictor can generate predictions independent of when the I-cache uses them Fetch Predict Fetch Predict Front-End with FTQ Predict
13
Fetch Target Queue Fetch and predict can have different latencies Allows for I-cache to be pipelined As long as they have the same throughput
14
Fetch Blocks FTQ stores fetch block Sequence of instructions Starting at branch target Ending at a strongly biased branch Instructions are directly fed into pipeline
15
Outline Scalable Front-End and Component Fetch Target Queue Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion
16
Fetch Target Buffer: Outline Review: Branch Target Buffer Fetch Target Buffer Fetch Blocks Functionality
17
Review: Branch Target Buffer I Previous Work (Perleberg and Smith [2]) Makes fetch independent of predict FetchPredict FetchPredict FetchPredict Simple Front-End FetchPredict FetchPredict FetchPredict With Branch Target Buffer
18
Review: Branch Target Buffer II Characteristics Hash table Makes predictions Caches prediction information
19
Review: Branch Target Buffer III Index/ Tag Branch Prediction Predicted branch target Fall- through address Instructions at Branch 0x1718Taken0x18340x1788add sub 0x1734Taken0x20880x1764neq br 0x1154Not taken0x13640x1200ld store ………… PC
20
FTP Optimizations over BTB Multi-level Solves conundrum Need a small cache Need enough space to successfully predict branches
21
FTP Optimizations over BTB Oversize bit Indicates if a block is larger than cache line With multi-port cache Allows several smaller blocks to be loaded at the same time
22
FTP Optimizations over BTB Only stores partial fall-through address Fall-through address is close to the current PC Only need to store an offset
23
FTP Optimizations over BTB Doesn’t store every blocks: Fall-through blocks Blocks that are seldom taken
24
Fetch Target Buffer Next PC Target: of branch Type: conditional, subroutine call/return Oversize: if block size > cache line
25
Fetch Target Buffer
26
PC used as index into FTB
27
HIT! L1 Hit
28
HIT! NOT TAKEN Branch NOT Taken
29
HIT! NOT TAKEN Branch NOT Taken
30
HIT! Branch Taken TAKEN
31
L1: MISS FALL THROUGH L1 Miss
32
L1: MISS FALL THROUGH After N cycle Delay L2: HIT L1 Miss
33
L1: MISS FALL THROUGH: eventually mispredicts L2: MISS L1 and L2 Miss
34
Hybrid branch prediction Meta-predictor selects between Local history predictor Global history predictor Bimodal predictor
35
Branch Prediction MetaBimodLocal Pred Local History Global Predictor
36
Branch Prediction
37
Committing Results When full, SHQ commits oldest value to local history or global history
38
Outline Scalable Front-End and Component Fetch Target Queue Fetch Target Buffer Methodology Results Analysis and Conclusion
39
Experimental Methodology I Baseline Architecture Processor 8 instruction fetch with 16 instruction issue per cycle 128 entry reorder buffer with 32 entry load/store buffer 8 cycle minimum branch mis-prediction penalty Cache 64k 2-way instruction cache 64k 4 way data cache (pipelined)
40
Experimental Methodology II Timing Model Cacti cache compiler Models on-chip memory Modified for 0.35 um, 0.188 um and 0.10 um processes Test set 6 SPEC95 benchmarks 2 C++ Programs
41
Outline Scalable Front-End and Component Fetch Target Queue Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion
42
Comparing FTB to BTB FTB provides slightly better performance Tested for various cache sizes: 64, 256, 1k, 4k and 8K entries Better
43
Comparing Multi-level FTB to Single-Level FTB Two-level FTB Performance Smaller fetch size 2 Level Average Size: 6.6 1 Level Average Size: 7.5 Higher accuracy on average Two-Level: 83.3% Single: 73.1 % Higher performance 25% average speedup over single
44
Fall-through Bits Used Number of fall- through bits: 4-5 Because fetch distances 16 instructions do not improve performance Better
45
FTQ Occupancy Roughly indicates throughput On average, FTQ is Empty: 21.1% Full: 10.7% of the time Better
46
Scalability Two level FTB scale well with features size Higher slope is better Better
47
Outline Scalable Front-End and Component Fetch Target Queue Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion
48
Analysis 25% improvement in IPC over best performing single-level designs System scales well with feature size On average, FTQ is non-empty 21.1% of the time FTB Design requires at most 5 bits for fall- through address
49
Conclusion FTQ and FTB design Decouples the I-cache from branch prediction Produces higher throughput Uses multi-level buffer Produces better scalability
50
References [1] A Scalable Front-End Architecture for Fast Instruction Delivery. Glenn Reinman, Todd Austin, and Brand Calder. ACM/IEEE 26 th Annual International Symposium on Computer Architecture. May 1999 [2] Branch Target Buffer: Design and Optimization. Chris Perleberg and Alan Smith. Technical Report. December 1989.
51
Thank you Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.