Download presentation
Presentation is loading. Please wait.
1
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29 1996
2
Fetching Multiple Blocks Aggressive o-o-o processors will perform poorly if they only fetch a single basic block every cycle Solution: Predict multiple branches and targets in a cycle Fetch multiple cache lines in the cycle Initiate the next set of fetches in the next cycle
3
Without the Trace Cache Stage 1 requires identification of predictions and target addresses Stage 2 requires multi-ported access of the I-cache Stage 3 requires shifting and alignment
4
Trace Cache Takes advantage of temporal locality and biased branches Does not require multiple I-cache accesses A BC DEFG 01 1100 A C F 1 0 A B E 0 1 A B D 0 0 A 1 0
5
Base Case In each cycle, fetch up to three sequential basic blocks
6
Multiple Branch Predictor PHT k-bit global history k / MUXMUX / k-1
7
Trace Cache Design The branch predictions can be used to index into the trace cache or for tag comparison (Fig.4) Keep track of next address (taken and not-taken) Line buffer and merge logic assembles traces
8
Trace Cache
9
Design Alternatives Associativity (including paths) Partial matches – use all instructions till the first mispredict Multiple line-fill buffers Trace selection to reduce conflicts Multi-cycle trace caches?
10
Branch Address Cache The BTB maintains 14 addresses (tree of basic blocks) Based on the branch prediction, three addresses are forwarded to the I-Cache BTB extension that allows multiple target prediction adds pipeline stages can still have I-Cache bank contention
11
Collapsing Buffer Can detect taken branches within a single cache line Also suffers from merge logic and bank contention
12
Methodology Very aggressive o-o-o processor – large window (2048 instrs), unlimited resources, no artificial dependences, no cache misses SPEC92-Int and Instruction Benchmark Suite (IBS) Trace cache – 64 entries, 16 instrs and 3 branches per entry – 712 tag bytes and 4KB worth of instructions – ICache is 128KB
13
Results Fetching three sequential basic blocks (SEQ.3) is not much more complex than fetching one – IPC improvement of ~15% Trace cache outperforms BAC and CB – note that the latter can’t handle all kinds of trace patterns and suffer from ICache bank contention TC outperforms SEQ.3 by 12% BAC and CB do worse than SEQ.3 if they increase front-end latency
14
Ideal Fetch The trace cache is within 20% of ideal fetch The trace miss rate is fairly high – 18-76% Up to 60% of instructions do not come from the trace cache A larger trace cache comes within 10% of ideal fetch – note that the front-end is the bottleneck in this processor
15
Title Bullet
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.