CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 Lecture: Branch Prediction Topics: branch prediction, bimodal/global/local/tournament predictors, branch target buffer (Section 3.3, notes on class webpage)

W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.

CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EECC722 - Shaaban #1 Lec # 5 Fall High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

EECC722 - Shaaban #1 Lec # 5 Fall High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation/Cache.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO

Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Alpha Supplement CS 740 Oct. 14, 1998

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Fetch Directed Prefetching - a Study

Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA.

High Bandwidth Instruction Fetching Techniques

CMPE750 - Shaaban #1 Lec # 5 Spring High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation/Cache.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Lecture 9. Branch Target Prediction and Trace Cache

Prof. Hsien-Hsin Sean Lee

CS203 – Advanced Computer Architecture

Prof. Onur Mutlu Carnegie Mellon University

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

15-740/ Computer Architecture Lecture 24: Control Flow

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 18: Pipelining Today’s topics:

Ka-Ming Keung Swamy D Ponpandi

Lecture: Branch Prediction

Lecture 22: Cache Hierarchies, Memory

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 20: OOO, Memory Hierarchy

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Adapted from the slides of Prof

Ka-Ming Keung Swamy D Ponpandi

Lecture 7: Branch Prediction, Dynamic ILP

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO

Fetching Multiple Blocks Aggressive o-o-o processors will perform poorly if they only fetch a single basic block every cycle Solution:  Predict multiple branches and targets in a cycle  Fetch multiple cache lines in the cycle  Initiate the next set of fetches in the next cycle

Without the Trace Cache Stage 1 requires identification of predictions and target addresses Stage 2 requires multi-ported access of the I-cache Stage 3 requires shifting and alignment

Trace Cache Takes advantage of temporal locality and biased branches Does not require multiple I-cache accesses A BC DEFG A C F 1 0 A B E 0 1 A B D 0 0 A 1 0

Base Case In each cycle, fetch up to three sequential basic blocks

Multiple Branch Predictor PHT k-bit global history k / MUXMUX / k-1

Trace Cache Design The branch predictions can be used to index into the trace cache or for tag comparison (Fig.4) Keep track of next address (taken and not-taken) Line buffer and merge logic assembles traces

Trace Cache

Design Alternatives Associativity (including paths) Partial matches – use all instructions till the first mispredict Multiple line-fill buffers Trace selection to reduce conflicts Multi-cycle trace caches?

Branch Address Cache The BTB maintains 14 addresses (tree of basic blocks) Based on the branch prediction, three addresses are forwarded to the I-Cache BTB extension that allows multiple target prediction  adds pipeline stages  can still have I-Cache bank contention

Collapsing Buffer Can detect taken branches within a single cache line Also suffers from merge logic and bank contention

Methodology Very aggressive o-o-o processor – large window (2048 instrs), unlimited resources, no artificial dependences, no cache misses SPEC92-Int and Instruction Benchmark Suite (IBS) Trace cache – 64 entries, 16 instrs and 3 branches per entry – 712 tag bytes and 4KB worth of instructions – ICache is 128KB

Results Fetching three sequential basic blocks (SEQ.3) is not much more complex than fetching one – IPC improvement of ~15% Trace cache outperforms BAC and CB – note that the latter can’t handle all kinds of trace patterns and suffer from ICache bank contention TC outperforms SEQ.3 by 12% BAC and CB do worse than SEQ.3 if they increase front-end latency

Ideal Fetch The trace cache is within 20% of ideal fetch The trace miss rate is fairly high – 18-76% Up to 60% of instructions do not come from the trace cache A larger trace cache comes within 10% of ideal fetch – note that the front-end is the bottleneck in this processor

Title Bullet