Download presentation
Presentation is loading. Please wait.
Published byTobias Stevens Modified over 9 years ago
1
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi (CMU) and Babak Falsafi (EPFL)
2
2 Prediction The Way Forward CPU Predictors Prefetching Branch Target and Direction Cache Replacement Cache Hit Application footprints grow Predictors need to scale to remain effective Ideally, fast, accurate predictions Can’t have this with conventional technology Prediction has proven useful – Many forms – Which to choose?
3
3 The Problem with Conventional Predictors Predictor Virtualization Approximate Large, Accurate, Fast Predictors Predictor Hardware Cost Accuracy Latency What we have Small Fast Not-so-accurate What we want Small Fast Accurate
4
4 Why Now? L2 Cache Physical Memory CPU 10-100MB I$ D$ I$ D$ I$ D$ I$ D$ Extra Resources: CMPs with Large Caches
5
L2 Cache 5 Predictor Virtualization (PV) Use the on-chip cache to store metadata Reduce cost of dedicated predictors Physical Memory CPU I$ D$ I$ D$ I$ D$ I$ D$
6
L2 Cache 6 Predictor Virtualization (PV) Use the on-chip cache to store metadata Implement otherwise impractical predictors Physical Memory CPU I$ D$ I$ D$ I$ D$ I$ D$
7
7 Research Overview PV breaks the conventional predictor design trade offs –Lowers cost of adoption –Facilitates implementation of otherwise impractical predictors Freeloads on existing resources –Adaptive demand Key Design Challenge: –How to compensate for the longer latency to metadata PV in action –Virtualized “Spatial Memory Streaming” –Virtualized Branch Target Buffers
8
8 PV Architecture PV in Action –Virtualizing “Spatial Memory Streaming” –Virtualizing Branch Target Buffers Conclusions Talk Roadmap
9
9 PV Architecture CPU I$ D$ L2 Cache Physical Memory Optimization Engine Optimization Engine Predictor Table requestprediction Virtualize
10
10 PV Architecture CPU I$ D$ L2 Cache Physical Memory Optimization Engine Optimization Engine PVCache requestprediction PVProxy PVTabl e Requires access to L2 Back-side of L1 Not as performance critical
11
11 PV Challenge: Prediction Latency CPU I$ D$ L2 Cache Physical Memory Optimization Engine Optimization Engine PVCache requestprediction PVProxy PVTable Common Case Infrequent latency: 12-18 cycles Rare latency: 400 cycles Key: How to pack metadata in L2 cache blocks to amortize costs
12
12 To Virtualize or Not To Virtualize Predictors redesigned with PV in mind Overcoming the latency challenge –Metadata reuse Intrinsic: one entry used for multiple predictions Temporal: one entry reused in the near future Spatial: one miss overcome by several subsequent hits –Metadata access pattern predictability Predictor metadata prefetching –Looks similar to designing caches BUT: –Does not have to be correct all the time –Time limit on usefullnes
13
13 PV in Action Data prefetching –Virtualize “Spatial Memory Streaming” [ISCA06] Within 1% performance Hardware cost from 60KB down to < 1KB Branch prediction –Virtualize branch target buffers Increase the perceived BTB accuracy Up to 12.75% IPC improvement with 8% hardware overhead
14
14 Spatial Memory Streaming [ISCA06] Memory 1100001010001 … 1101100000001 … spatial patterns Pattern History Table [ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
15
15 Spatial Memory Streaming (SMS) Detector Predictor data access stream patterns trigger access pattern prefetches ~1KB ~60KB Virtualize
16
16 Virtualizing SMS Virtual Table Virtual Table pattern tag pattern tag pattern tag... unused 11 ways 1K sets PVCache 11 ways 8 sets L2 cache line Region-level prefetching is naturally tolerant of longer prediction latencies Simply pack predictor entries spatially
17
17 Experimental Methodology SimFlex: –full-system, cycle-accurate simulator Baseline processor configuration –4-core CMP - OoO –L1D/L1I 64KB 4-way set-associative –UL2 8MB 16-way set-associative Commercial Workloads –Web servers: Apache and Zeus –TPC-C: DB2 and Oracle –TPC-H: several queries –Developed by Impetus group at CMU Anastasia Ailamaki & Babak Falsafi PIs
18
SMS Performance Potential Percentage L1 Read Mises (%) Conventional Predictor Degrades with Limited Storage
19
19 Virtualized SMS Hardware Cost Original Prefetcher ~ 60KB Virtualized Prefetcher < 1KB Speedup better
20
Impact of Virtualization on L2 Requests Percentage Increase L2 Requests (%)
21
Impact of Virtualization on Off-Chip Bandwidth Off-Chip Bandwidth Increase
22
22 PV in Action Data prefetching –Virtualize “Spatial Memory Streaming” [ISCA06] Same performance Hardware cost from 60KB down to < 1KB Branch prediction –Virtualize branch target buffers Increase the perceived BTB capacity Up to 12.75% IPC improvement with 8% hardware overhead
23
23 The Need for Larger BTBs Branch MPKI better Commercial applications benefit from large BTBs BTB entries
24
24 L2 Cache Virtualizing BTBs: Phantom-BTB BTB PC Virtual Table Virtual Table Latency challenge Not latency tolerant to longer prediction latencies Solution: predictor metadata prefetching Virtual table decoupled from the BTB Virtual table entry: temporal group Small and Fast Large and Slow
25
Facilitating Metadata Prefetching Intuition: Programs follow mostly similar paths Detection path Subsequent path
26
26 Temporal Groups Past misses Good indicator of future misses Dedicated Predictor acts as a filter
27
27 Fetch Trigger Preceding miss triggers temporal group fetch Not precise region around miss
28
28 Temporal Group Prefetching
29
29 Temporal Group Prefetching
30
30 L2 Cache Temporal Group Generator Temporal Group Generator Phantom-BTB Architecture BTB PC Prefetch Engine Temporal Group Generator Generates and installs temporal groups in the L2 cache Prefetch Engine Prefetches temporal groups
31
31 Temporal Group Generator Temporal Group Generator Temporal Group Generation BTB PC Prefetch Engine L2 Cache miss BTB misses generate temporal groups BTB hits do not generate any PBTB activity Miss Hit
32
32 Prefetch Engine miss Branch Metadata Prefetching BTB PC Temporal Group Generator Temporal Group Generator Virtual Table Virtual Table L2 Cache Prefetch Buffer Prefetch Buffer BTB misses trigger metadata prefetches Parallel lookup in BTB and prefetch buffer Miss Hit Prefetch Buffer Hits
33
33 Phantom-BTB Advantages “Pay-as-you-go” approach –Practical design –Increases the perceived BTB capacity –Dynamic allocation of resources Branch metadata allocated on demand –On-the-fly adaptation to application demands Branch metadata generation and retrieval performed on BTB misses Only if the application sees misses Metadata survives in the L2 as long as there is sufficient capacity and demand
34
34 Experimental Methodology Flexus cycle-accurate, full-system simulator Uniprocessor - OoO –1K-entry conventional BTB –64KB 2-way ICache/DCache –4MB 16-way L2 Cache Phantom-BTB –64-entry prefetch buffer –6-entry temporal group –4K-entry virtual table Commercial Workloads
35
35 PBTB vs. Conventional BTBs Speedup better Performance within 1% of a 4K-entry BTB with 3.6x less storage
36
36 Phantom-BTB with Larger Dedicated BTBs Speedup better PBTB remains effective with larger dedicated BTBs
37
37 Increase in L2 MPKI L2 MPKI Marginal increase in L2 misses better
38
38 Increase in L2 Accesses L2 Accesses per KI PBTB follows application demand for BTB capacity better
39
39 Summary Predictor metadata stored in memory hierarchy –Benefits Reduces dedicated predictor resources Emulates large predictor tables for increased predictor accuracy –Why now? Large on-chip caches / CMPs / need for large predictors –Predictor virtualization advantages Predictor adaptation Metadata sharing Moving Forward –Virtualize other predictors –Expose predictor interface to software level
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.