Download presentation
Presentation is loading. Please wait.
Published byThomasine Reynolds Modified over 9 years ago
1
Two Ways to Exploit Multi-Megabyte Caches AENAO Research Group @ Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas Moshovos {aasaraai, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu
2
EPFL, Jan. 20082Aenao Group/Toronto Future Caches: Just Larger? CPU I$ D$ CPU I$D$ CPU I$D$ interconnect Main Memory 1.“Big Picture” Management 2.Store Metadata 10s – 100s of MB
3
EPFL, Jan. 20083Aenao Group/Toronto Conventional Block Centric Cache n “Small” Blocks l Optimizes Bandwidth and Performance n Large L2/L3 caches especially Fine-Grain View of Memory L2 Cache Big Picture Lost
4
EPFL, Jan. 20084Aenao Group/Toronto “Big Picture” View n Region: 2 n sized, aligned area of memory n Patterns and behavior exposed l Spatial locality n Exploit for performance/area/power Coarse-Grain View of Memory L2 Cache
5
EPFL, Jan. 20085Aenao Group/Toronto Exploiting Coarse-Grain Patterns n Many existing coarse-grain optimizations n Add new structures to track coarse-grain information CPU L2 Cache Stealth Prefetching Run-time Adaptive Cache Hierarchy Management via Reference Analysis Destination-Set Prediction Spatial Memory Streaming Coarse-Grain Coherence Tracking RegionScout Circuit-Switched Coherence Hard to justify for a commercial design Coarse-Grain Framework n Embed coarse-grain information in tag array n Support many different optimizations with less area overhead Adaptable optimization FRAMEWORK
6
EPFL, Jan. 20086Aenao Group/Toronto L2 Cache RegionTracker Solution Manage blocks, but also track and manage regions Tag Array L1 Data Array Data Blocks Block Requests Block Requests Region Tracker Region Probes Region Responses
7
EPFL, Jan. 20087Aenao Group/Toronto RegionTracker Summary n Replace conventional tag array : l 4-core CMP with 8MB shared L2 cache l Within 1% of original performance l Up to 20% less tag area l Average 33% less energy consumption n Optimization Framework: l Stealth Prefetching: same performance, 36% less area l RegionScout: 2x more snoops avoided, no area overhead
8
EPFL, Jan. 20088Aenao Group/Toronto Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion
9
EPFL, Jan. 20089Aenao Group/Toronto Goals 1. Conventional Tag Array Functionality l Identify data block location and state l Leave data array un-changed 2. Optimization Framework Functionality l Is Region X cached? l Which blocks of Region X are cached? Where? l Evict or migrate Region X l Easy to assign properties to each Region
10
EPFL, Jan. 200810Aenao Group/Toronto Coarse-Grain Cache Designs n Increased BW, Decreased hit-rates Region X Large Block Size Tag ArrayData Array
11
EPFL, Jan. 200811Aenao Group/Toronto Sector Cache n Decreased hit-rates Region X Tag ArrayData Array
12
EPFL, Jan. 200812Aenao Group/Toronto Sector Pool Cache n High Associativity (2 - 4 times) Region X Tag ArrayData Array
13
EPFL, Jan. 200813Aenao Group/Toronto Decoupled Sector Cache n Region information not exposed n Region replacement requires scanning multiple entries Region X Tag ArrayData ArrayStatus Table
14
EPFL, Jan. 200814Aenao Group/Toronto Design Requirements n Small block size (64B) n Miss-rate does not increase n Lookup associativity does not increase n No additional access latency l (i.e., No scanning, no multiple block evictions) n Does not increase latency, area, or energy n Allows banking and interleaving n Fit in conventional tag array “envelope”
15
EPFL, Jan. 200815Aenao Group/Toronto RegionTracker: A Tag Array Replacement L1 Data Array n 3 SRAM arrays, combined smaller than tag array Region Vector Array Block Status Table Evicted Region Buffer
16
EPFL, Jan. 200816Aenao Group/Toronto Basic Structures Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV 14 Block Status Table (BST) status 32 n Address: specific RVA set and BST set n RVA entry: multiple, consecutive BST sets n BST entry: one of four RVA sets Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region
17
EPFL, Jan. 200817Aenao Group/Toronto Common Case: Hit Region Tag RVA IndexRegion OffsetBlock Offset 49061021 Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) 14 status 32 Data Array + BST Index To Data Array Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region
18
EPFL, Jan. 200818Aenao Group/Toronto Worst Case (Rare): Region Miss Region Tag RVA IndexRegion OffsetBlock Offset 49061021 Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) status 3 Ptr 2 Data Array + BST Index Evicted Region Buffer (ERB) No Match! Ptr
19
EPFL, Jan. 200819Aenao Group/Toronto Methodology n Flexus simulator from CMU SimFlex group l Based on Simics full-system simulator n 4-core CMP modeled after Piranha l Private 32KB, 4-way set-associative L1 caches l Shared 8MB, 16-way set-associative L2 cache l 64-byte blocks n Miss-rates: Functional simulation of 2 billion instructions per core n Performance and Energy: Timing simulation using SMARTS sampling methodology n Area and Power: Full custom implementation on 130nm commercial technology n 9 commercial workloads: l WEB: SpecWEB on Apache and Zeus l OLTP: TPC-C on DB2 and Oracle l DSS: 5 TPC-H queries on DB2 Interconnect L2 P D$I$ P D$I$ P D$I$ P D$I$
20
EPFL, Jan. 200820Aenao Group/Toronto Miss-Rates vs. Area n Sector Cache: 512KB sectors, SPC and RT: 1KB regions n Trade-offs comparable to conventional cache better Relative Miss-Rate Relative Tag Array Area Sector Cache (0.25, 1.26) 14-way 15-way 52-way 48-way
21
EPFL, Jan. 200821Aenao Group/Toronto Performance & Energy n 12-way set-associative RegionTracker: 20% less area n Error bars: 95% confidence interval n Performance within 1%, with 33% tag energy reduction Normalized Execution Time better Reduction in Tag Energy better PerformanceEnergy
22
EPFL, Jan. 200822Aenao Group/Toronto Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion
23
EPFL, Jan. 200823Aenao Group/Toronto RegionTracker: An Optimization Framework L1 RVA ERB Data Array BST Stealth Prefetching: Average 20% performance improvement Drop-in RegionTracker for 36% less area overhead RegionScout: In-depth analysis
24
EPFL, Jan. 200824Aenao Group/Toronto Snoop Coherence: Common Case Main Memory CPU Read x miss Read x+1Read x+2Read x+n Many snoops are to non-shared regions
25
EPFL, Jan. 200825Aenao Group/Toronto RegionScout Eliminate broadcasts for non-shared regions Main Memory CPU Global Region Miss Region Miss Non-Shared RegionsLocally Cached Regions Read x Region Miss
26
EPFL, Jan. 200826Aenao Group/Toronto RegionTracker Implementation n Minimal overhead to support RegionScout optimization n Still uses less area than conventional tag array Non-Shared Regions Add 1 bit to each RVA entry Locally Cached Regions Already provided by RVA
27
EPFL, Jan. 200827Aenao Group/Toronto RegionTracker + RegionScout Reduction in Snoop Broadcasts better n 4 processors, 512KB L2 Caches n 1KB regions Avoid 41% of Snoop Broadcasts, no area overhead compared to conventional tag array BlockScout (4KB)
28
EPFL, Jan. 200828Aenao Group/Toronto Result Summary n Replace Conventional Tag Array: l 20% Less tag area l 33% Less tag energy l Within 1% of original performance n Coarse-Grain Optimization Framework: l 36% reduction in area overhead for Stealth Prefetching l Filter 41% of snoop broadcasts with no area overhead compared to conventional cache
29
Predictor Virtualization Ioana Burcea Joint work with Stephen Somogyi Babak Falsafi
30
EPFL, Jan. 200830Aenao Group/Toronto Predictor Virtualization Interconnect L2 CPU L1-DL1-I CPU L1-DL1-I Main Memory Optimization Engines: Predictors CPU L1-DL1-I CPU L1-DL1-I CPU L1-DL1-I CPU L1-D L1-I L1-D L1-I L1-D L1-I L1-D
31
EPFL, Jan. 200831Aenao Group/Toronto Motivating Trends n Dedicating resources to predictors hard to justify: l Chip multiprocessors u Space dedicated to predictors X #processors l Larger predictor tables u Increased performance n Memory hierarchies offer the opportunity l Increased capacity l How many apps really use the space? Use conventional memory hierarchies to store predictor information
32
EPFL, Jan. 200832Aenao Group/Toronto PV Architecture contd. Optimization Engine Predictor Table request prediction request
33
EPFL, Jan. 200833Aenao Group/Toronto PV Architecture contd. Optimization Engine prediction Predictor Virtualization request
34
EPFL, Jan. 200834Aenao Group/Toronto PV Architecture contd. Optimization Engine prediction + index PVStart PVCache MSHR PVProxy L2 Main Memory PVTable request On the backside of the L1
35
EPFL, Jan. 200835Aenao Group/Toronto To Virtualize Or Not to Virtualize? 1.Re-Use 2. Predictor Info Prefetching Common Case CPU I$D$ interconnect Main Memory L2/L3 Infrequent
36
EPFL, Jan. 200836Aenao Group/Toronto To Virtualize or Not? n Challenge l Hit in the PVCache most of the time n Will not work for all predictors out of the box n Reuse is necessary l Intrinsic u Easy to virtualize l Non-intrinsic u Must be engineered n More so if the predictor needs to be fast to start with
37
EPFL, Jan. 200837Aenao Group/Toronto Will There Be Reuse? n Intrinsic: l Multiple [predictions per entry l We’ll see an example n Can be engineered l Group temporally correlated entries together: Cache block CPU I$D$ interconnect Main Memory L2/L3
38
EPFL, Jan. 200838Aenao Group/Toronto Spatial Memory Streaming n Footprint: l Blocks accessed per memory region n Predict next time the footprint will be the same l Handle: PC + offset within region
39
EPFL, Jan. 200839Aenao Group/Toronto Spatial Generations
40
EPFL, Jan. 200840Aenao Group/Toronto Virtualizing SMS Detector Predictor patterns prefetches trigger access Virtualize
41
EPFL, Jan. 200841Aenao Group/Toronto Virtualizing SMS Virtual Table 1K 11 PVCache 8 11 tagpatterntag pattern 0114354 85 unused
42
EPFL, Jan. 200842Aenao Group/Toronto Packing Entries in One Cache Block n Index: PC + offset within spatial group u PC →16 bits u 32 blocks in a spatial group → 5 bit offset → 32 bit spatial pattern n Pattern table: 1K sets u 10 bits to index the table → 11 bit tag n Cache block: 64 bytes u 11 entries per cache block → Pattern table 1K sets – 11-way set associative 21 bit index tagpatterntag pattern 0114354 85 unused
43
EPFL, Jan. 200843Aenao Group/Toronto Memory Address Calculation + 000000 16 bits5 bits 10 bits PV Start Address PCBlock offset Memory Address
44
EPFL, Jan. 200844Aenao Group/Toronto Simulation Infrastructure n SimFlex: CMU Impetus l Full-system simulator based on Simics n Base processor configuration l 8-wide OoO l 256-entry ROB / 64-entry LSQ l L1D/L1I 64KB 4-way set-associative l UL2 8MB 16-way set-associative n Commercial workloads l TPC-C: DB2 and Oracle l TPC-H: Query 1, Query 2, Query 16, Query 17 l Web: Apache and Zeus
45
EPFL, Jan. 200845Aenao Group/Toronto SMS – Performance Potential better
46
EPFL, Jan. 200846Aenao Group/Toronto Virtualized Spatial Memory Streaming Original Prefetcher: Cost: 60KB Virtualized Prefetcher: Cost: <1Kbyte Nearly Identical Performance better
47
EPFL, Jan. 200847Aenao Group/Toronto Impact of Virtualization on L2 Misses
48
EPFL, Jan. 200848Aenao Group/Toronto Impact of Virtualization on L2 Requests
49
Coarse-Grain Tracking Jason Zebchuk
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.