Presentation is loading. Please wait.

Presentation is loading. Please wait.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Similar presentations


Presentation on theme: "Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical."— Presentation transcript:

1 Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical Engineering and Computer Science, University of Michigan

2 Fetch Directed Instruction Prefetching 2Introduction Instruction supply critical to processor performance –Complicated by instruction cache misses –Instruction cache miss solutions: Increasing size or associativity of instruction cache Instruction cache prefetching –Which cache blocks to prefetch? –Timeliness of prefetch –Interference with demand misses... Instruction Fetch Execution Core Issue Buffer

3 Fetch Directed Instruction Prefetching 3 Prior Instruction Prefetching Work Next line prefetching (NLP) (Smith) –Each cache block is tagged with an NLP bit –When block is accessed during a fetch NLP bit determines whether next sequential block is prefetched –Prefetch into fully associative buffer Streaming buffers (Jouppi) –On cache miss, sequential cache blocks, starting with block that missed, are prefetched into a buffer Buffer can use fully associative lookup Uniqueness filter can avoid redundant prefetches Multiple streaming buffers can be used together

4 Fetch Directed Instruction Prefetching 4 Our Prefetching Approach Desirable characteristics –Accuracy of prefetch Useful prefetches –Timeliness of prefetch Maximize prefetch gain Fetch Directed Prefetching –Branch predictor runs ahead of instruction cache –Instruction cache prefetch guided by instruction stream

5 Fetch Directed Instruction Prefetching 5 Talk Overview Fetch Target Queue (FTQ) Fetch Directed Prefetching (FDP) Filtering Techniques Enhancements to Streaming Buffers Bandwidth Considerations Conclusions

6 Fetch Directed Instruction Prefetching 6 Fetch Target Queue Queue of instruction fetch addresses Latency tolerance –Branch predictor can continue in face of icache miss –Instruction Fetch can continue in face of branch predictor miss When combined with high bandwidth branch predictor –Provides stream of instr addresses far in advance of current PC Branch Predictor Instruction Fetch Execution Core... FTQ Issue Buffer

7 Fetch Directed Instruction Prefetching 7 Fetch Directed Prefetching Instruction FetchBranch Predictor FTQ Fully associative buffer Prefetch Enqueue (filtration mechanisms) Prefetch current FTQ prefetch candidate Stream of PCs contained in FTQ guides prefetch –FTQ is searched in-order for entries to prefetch –Prefetched cache blocks stored in fully associative queue –Fully associative queue and instruction cache probed in parallel PIQ (32 entry)

8 Fetch Directed Instruction Prefetching 8Methodology SimpleScalar Alpha 3.0 tool set (Burger, Austin) –SPEC95 C Benchmarks Fast forwarded past initialization portion of benchmarks –Can issue 8 instructions per cycle –128 entry reorder buffer –32 entry load/store buffer –Variety of instruction cache sizes 16K 2-way and 4-way associative 32K 2-way associative Tried both single and dual ported configurations –Instruction cache size for this talk is 16K 2-way –32K 4-way associative data cache –Unified 1MB 4-way associative second level cache

9 Fetch Directed Instruction Prefetching 9 Bandwidth Concerns Prefetching can disrupt demand fetching –Need to model bus utilization Modified SimpleScalar’s memory hierarchy –Accurate modeling of bus usage –Two configurations of L2 cache bus to main memory 32 bytes/cycle 8 bytes/cycle –Single port on L2 cache Shared by both data and instruction caches

10 Fetch Directed Instruction Prefetching 10 Performance of Fetch Directed Prefetch 41% bus utilization 66% bus utilization 89.9 %

11 Fetch Directed Instruction Prefetching 11 Reducing Wasted Prefetches Reduce bus utilization while retaining speedup –How to identify useless or redundant prefetches? Variety of filtration techniques –FTQ Position Filtering –Cache Probe Filtering Use idle instruction cache ports to validate prefetches –Remove CPF –Enqueue CPF –Evict Filtering

12 Fetch Directed Instruction Prefetching 12 Cache Probe Filtering Use instruction cache to validate FTQ entries for prefetch –FTQ entries are initially unmarked –If cache block is in i-cache, invalidate FTQ entry –If cache block is not in i-cache, validate FTQ entry Validation can occur whenever a cache port is idle –When the instruction window is full –Instruction cache miss Lockup-free instruction cache

13 Fetch Directed Instruction Prefetching 13 Cache Probe Filtering Techniques Enqueue CPF –Only enqueue Valid prefetches –Conservative, low bandwidth approach Remove CPF –By default, prefetch all FTQ entries. –If idle cache ports are available for validation Do not prefetch entries which are found Invalid

14 Fetch Directed Instruction Prefetching 14 Performance of Filtering Techniques 8 bytes/cycle 55% bus utilization 30% bus utilization

15 Fetch Directed Instruction Prefetching 15 Eviction Prefetching Example FTB Index Instruction Cache 3 27 15 Evict bit0 FTB 1 0 0 Cache miss Cache block evicted Bit set for next prediction Evict bit1 0 0 1 Evict bit2 0 0 0 Evict index 0 1 2 If branch predictor holds more state than instruction cache –Mark evicted cache blocks in branch predictor –Prefetch those blocks when predicted

16 Fetch Directed Instruction Prefetching 16 Performance of Filtering Techniques 8 bytes/cycle 20% bus utilization 31% bus utilization

17 Fetch Directed Instruction Prefetching 17 Enqueue CPF and Eviction Prefetching Effective combination of two low bandwidth approaches Both attempt to prefetch entries not in instruction cache Enqueue CPF needs to wait on idle cache port to prefetch Eviction Prefetching can prefetch when prediction is made Combined –Eviction Prefetching gives basic coverage –Enqueue CPF finds additional prefetches that Evict misses

18 Fetch Directed Instruction Prefetching 18 Streaming Buffer Enhancements All configurations used uniqueness filters and fully associative lookup Base configurations –Single streaming buffer (SB1) –Dual streaming buffers (SB2) –Eight streaming buffers (SB8) Cache Probe Filtering (CPF) enhancements –Filter out streaming buffer prefetches already in icache –Stop filtering

19 Fetch Directed Instruction Prefetching 19 Streaming Buffer Results 8 bytes/cycle 58% bus utilization 36% bus utilization

20 Fetch Directed Instruction Prefetching 20 Selected Low Bandwidth Results 8 bytes/cycle

21 Fetch Directed Instruction Prefetching 21 Selected High Bandwidth Results 32 bytes/cycle

22 Fetch Directed Instruction Prefetching 22Conclusion Fetch Directed Prefetching –Accurate, just in time prefetching Cache Probe Filtering –Reduces bus bandwidth of fetch directed prefetching –Also useful for Streaming Buffers Evict Filter –Provides accurate prefetching by identifying evicted cache blocks Fully associative versus inorder prefetch buffer –Available in upcoming tech report by end of year

23 Fetch Directed Instruction Prefetching 23 Prefetching Tradeoffs NLP –Simple, low bandwidth approach –No notion of prefetch usefulness –Limited timeliness Streaming Buffers –Takes advantage of latency of a cache miss –Can use low to moderate bandwidth with filtering –No notion of prefetch usefulness Fetch Directed Prefetching –Prefetch based on prediction stream –Can use low to moderate bandwidth with filtering –Most useful with accurate branch prediction


Download ppt "Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical."

Similar presentations


Ads by Google