Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

Similar presentations


Presentation on theme: "A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong."— Presentation transcript:

1 A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong

2 Conventional Pipeline Architecture High-performance processors can be broken down into two parts  Front-end: fetches and decodes instructions  Execution core: executes instructions

3 Front-End and Pipeline Simple Front-End Decode… Fetch

4 Front-End with Prediction FetchPredict FetchPredict FetchPredict Simple Front-End …Decode

5 Front-End Issues I Flynn’s bottleneck:  IPC is bounded by the number of Instructions fetched per cycle Implies: As execution performance increases, the front-end must keep up to ensure overall performance

6 Front-End Issues II Two opposing forces  Designing a faster front-end Increase I-cache size  Interconnect Scaling Problem Wire performance does not scale with feature size Decrease I-cache size

7 Key Contributions I

8 Key Contributions: Fetch Target Queue Objective:  Avoid using large cache with branch prediction Purpose  Decouple I-cache from branch prediction Results  Improves throughput

9 Key Contributions: Fetch Target Buffer Objective  Avoid large caches with branch prediction Implementation  A multi-level buffer Results  Deliver performance is 25% better than single level  Scales better with “future” feature size

10 Outline Scalable Front-End and Components  Fetch Target Queue  Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion

11 Fetch Target Queue Decouples I-cache from branch prediction  Branch predictor can generate predictions independent of when the I-cache uses them FetchPredict FetchPredict FetchPredict Simple Front-End

12 Fetch Target Queue Decouples I-cache from branch prediction  Branch predictor can generate predictions independent of when the I-cache uses them Fetch Predict Fetch Predict Front-End with FTQ Predict

13 Fetch Target Queue Fetch and predict can have different latencies  Allows for I-cache to be pipelined As long as they have the same throughput

14 Fetch Blocks FTQ stores fetch block Sequence of instructions  Starting at branch target  Ending at a strongly biased branch Instructions are directly fed into pipeline

15 Outline Scalable Front-End and Component  Fetch Target Queue  Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion

16 Fetch Target Buffer: Outline Review: Branch Target Buffer Fetch Target Buffer Fetch Blocks Functionality

17 Review: Branch Target Buffer I Previous Work (Perleberg and Smith [2]) Makes fetch independent of predict FetchPredict FetchPredict FetchPredict Simple Front-End FetchPredict FetchPredict FetchPredict With Branch Target Buffer

18 Review: Branch Target Buffer II Characteristics  Hash table  Makes predictions  Caches prediction information

19 Review: Branch Target Buffer III Index/ Tag Branch Prediction Predicted branch target Fall- through address Instructions at Branch 0x1718Taken0x18340x1788add sub 0x1734Taken0x20880x1764neq br 0x1154Not taken0x13640x1200ld store ………… PC

20 FTP Optimizations over BTB Multi-level  Solves conundrum Need a small cache Need enough space to successfully predict branches

21 FTP Optimizations over BTB Oversize bit  Indicates if a block is larger than cache line  With multi-port cache Allows several smaller blocks to be loaded at the same time

22 FTP Optimizations over BTB Only stores partial fall-through address  Fall-through address is close to the current PC  Only need to store an offset

23 FTP Optimizations over BTB Doesn’t store every blocks:  Fall-through blocks  Blocks that are seldom taken

24 Fetch Target Buffer Next PC Target: of branch Type: conditional, subroutine call/return Oversize: if block size > cache line

25 Fetch Target Buffer

26 PC used as index into FTB

27 HIT! L1 Hit

28 HIT! NOT TAKEN Branch NOT Taken

29 HIT! NOT TAKEN Branch NOT Taken

30 HIT! Branch Taken TAKEN

31 L1: MISS FALL THROUGH L1 Miss

32 L1: MISS FALL THROUGH After N cycle Delay L2: HIT L1 Miss

33 L1: MISS FALL THROUGH: eventually mispredicts L2: MISS L1 and L2 Miss

34 Hybrid branch prediction Meta-predictor selects between  Local history predictor  Global history predictor  Bimodal predictor

35 Branch Prediction MetaBimodLocal Pred Local History Global Predictor

36 Branch Prediction

37 Committing Results When full, SHQ commits oldest value to local history or global history

38 Outline Scalable Front-End and Component  Fetch Target Queue  Fetch Target Buffer Methodology Results Analysis and Conclusion

39 Experimental Methodology I Baseline Architecture  Processor 8 instruction fetch with 16 instruction issue per cycle 128 entry reorder buffer with 32 entry load/store buffer 8 cycle minimum branch mis-prediction penalty Cache  64k 2-way instruction cache  64k 4 way data cache (pipelined)

40 Experimental Methodology II Timing Model  Cacti cache compiler Models on-chip memory Modified for 0.35 um, 0.188 um and 0.10 um processes Test set  6 SPEC95 benchmarks  2 C++ Programs

41 Outline Scalable Front-End and Component  Fetch Target Queue  Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion

42 Comparing FTB to BTB FTB provides slightly better performance Tested for various cache sizes: 64, 256, 1k, 4k and 8K entries Better

43 Comparing Multi-level FTB to Single-Level FTB Two-level FTB Performance  Smaller fetch size 2 Level Average Size: 6.6 1 Level Average Size: 7.5  Higher accuracy on average Two-Level: 83.3% Single: 73.1 %  Higher performance 25% average speedup over single

44 Fall-through Bits Used Number of fall- through bits: 4-5  Because fetch distances 16 instructions do not improve performance Better

45 FTQ Occupancy Roughly indicates throughput On average, FTQ is  Empty: 21.1%  Full: 10.7% of the time Better

46 Scalability Two level FTB scale well with features size  Higher slope is better Better

47 Outline Scalable Front-End and Component  Fetch Target Queue  Fetch Target Buffer Experimental Methodology Results Analysis and Conclusion

48 Analysis 25% improvement in IPC over best performing single-level designs System scales well with feature size On average, FTQ is non-empty 21.1% of the time FTB Design requires at most 5 bits for fall- through address

49 Conclusion FTQ and FTB design  Decouples the I-cache from branch prediction Produces higher throughput  Uses multi-level buffer Produces better scalability

50 References [1] A Scalable Front-End Architecture for Fast Instruction Delivery. Glenn Reinman, Todd Austin, and Brand Calder. ACM/IEEE 26 th Annual International Symposium on Computer Architecture. May 1999 [2] Branch Target Buffer: Design and Optimization. Chris Perleberg and Alan Smith. Technical Report. December 1989.

51 Thank you Questions?


Download ppt "A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong."

Similar presentations


Ads by Google