Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas Moshovos {aasaraai, ioana, myrto, elham, zebchuk,
EPFL, Jan Aenao Group/Toronto Future Caches: Just Larger? CPU I$ D$ CPU I$D$ CPU I$D$ interconnect Main Memory 1.“Big Picture” Management 2.Store Metadata 10s – 100s of MB
EPFL, Jan Aenao Group/Toronto Conventional Block Centric Cache n “Small” Blocks l Optimizes Bandwidth and Performance n Large L2/L3 caches especially Fine-Grain View of Memory L2 Cache Big Picture Lost
EPFL, Jan Aenao Group/Toronto “Big Picture” View n Region: 2 n sized, aligned area of memory n Patterns and behavior exposed l Spatial locality n Exploit for performance/area/power Coarse-Grain View of Memory L2 Cache
EPFL, Jan Aenao Group/Toronto Exploiting Coarse-Grain Patterns n Many existing coarse-grain optimizations n Add new structures to track coarse-grain information CPU L2 Cache Stealth Prefetching Run-time Adaptive Cache Hierarchy Management via Reference Analysis Destination-Set Prediction Spatial Memory Streaming Coarse-Grain Coherence Tracking RegionScout Circuit-Switched Coherence Hard to justify for a commercial design Coarse-Grain Framework n Embed coarse-grain information in tag array n Support many different optimizations with less area overhead Adaptable optimization FRAMEWORK
EPFL, Jan Aenao Group/Toronto L2 Cache RegionTracker Solution Manage blocks, but also track and manage regions Tag Array L1 Data Array Data Blocks Block Requests Block Requests Region Tracker Region Probes Region Responses
EPFL, Jan Aenao Group/Toronto RegionTracker Summary n Replace conventional tag array : l 4-core CMP with 8MB shared L2 cache l Within 1% of original performance l Up to 20% less tag area l Average 33% less energy consumption n Optimization Framework: l Stealth Prefetching: same performance, 36% less area l RegionScout: 2x more snoops avoided, no area overhead
EPFL, Jan Aenao Group/Toronto Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion
EPFL, Jan Aenao Group/Toronto Goals 1. Conventional Tag Array Functionality l Identify data block location and state l Leave data array un-changed 2. Optimization Framework Functionality l Is Region X cached? l Which blocks of Region X are cached? Where? l Evict or migrate Region X l Easy to assign properties to each Region
EPFL, Jan Aenao Group/Toronto Coarse-Grain Cache Designs n Increased BW, Decreased hit-rates Region X Large Block Size Tag ArrayData Array
EPFL, Jan Aenao Group/Toronto Sector Cache n Decreased hit-rates Region X Tag ArrayData Array
EPFL, Jan Aenao Group/Toronto Sector Pool Cache n High Associativity (2 - 4 times) Region X Tag ArrayData Array
EPFL, Jan Aenao Group/Toronto Decoupled Sector Cache n Region information not exposed n Region replacement requires scanning multiple entries Region X Tag ArrayData ArrayStatus Table
EPFL, Jan Aenao Group/Toronto Design Requirements n Small block size (64B) n Miss-rate does not increase n Lookup associativity does not increase n No additional access latency l (i.e., No scanning, no multiple block evictions) n Does not increase latency, area, or energy n Allows banking and interleaving n Fit in conventional tag array “envelope”
EPFL, Jan Aenao Group/Toronto RegionTracker: A Tag Array Replacement L1 Data Array n 3 SRAM arrays, combined smaller than tag array Region Vector Array Block Status Table Evicted Region Buffer
EPFL, Jan Aenao Group/Toronto Basic Structures Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV 14 Block Status Table (BST) status 32 n Address: specific RVA set and BST set n RVA entry: multiple, consecutive BST sets n BST entry: one of four RVA sets Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region
EPFL, Jan Aenao Group/Toronto Common Case: Hit Region Tag RVA IndexRegion OffsetBlock Offset Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) 14 status 32 Data Array + BST Index To Data Array Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region
EPFL, Jan Aenao Group/Toronto Worst Case (Rare): Region Miss Region Tag RVA IndexRegion OffsetBlock Offset Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) status 3 Ptr 2 Data Array + BST Index Evicted Region Buffer (ERB) No Match! Ptr
EPFL, Jan Aenao Group/Toronto Methodology n Flexus simulator from CMU SimFlex group l Based on Simics full-system simulator n 4-core CMP modeled after Piranha l Private 32KB, 4-way set-associative L1 caches l Shared 8MB, 16-way set-associative L2 cache l 64-byte blocks n Miss-rates: Functional simulation of 2 billion instructions per core n Performance and Energy: Timing simulation using SMARTS sampling methodology n Area and Power: Full custom implementation on 130nm commercial technology n 9 commercial workloads: l WEB: SpecWEB on Apache and Zeus l OLTP: TPC-C on DB2 and Oracle l DSS: 5 TPC-H queries on DB2 Interconnect L2 P D$I$ P D$I$ P D$I$ P D$I$
EPFL, Jan Aenao Group/Toronto Miss-Rates vs. Area n Sector Cache: 512KB sectors, SPC and RT: 1KB regions n Trade-offs comparable to conventional cache better Relative Miss-Rate Relative Tag Array Area Sector Cache (0.25, 1.26) 14-way 15-way 52-way 48-way
EPFL, Jan Aenao Group/Toronto Performance & Energy n 12-way set-associative RegionTracker: 20% less area n Error bars: 95% confidence interval n Performance within 1%, with 33% tag energy reduction Normalized Execution Time better Reduction in Tag Energy better PerformanceEnergy
EPFL, Jan Aenao Group/Toronto Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion
EPFL, Jan Aenao Group/Toronto RegionTracker: An Optimization Framework L1 RVA ERB Data Array BST Stealth Prefetching: Average 20% performance improvement Drop-in RegionTracker for 36% less area overhead RegionScout: In-depth analysis
EPFL, Jan Aenao Group/Toronto Snoop Coherence: Common Case Main Memory CPU Read x miss Read x+1Read x+2Read x+n Many snoops are to non-shared regions
EPFL, Jan Aenao Group/Toronto RegionScout Eliminate broadcasts for non-shared regions Main Memory CPU Global Region Miss Region Miss Non-Shared RegionsLocally Cached Regions Read x Region Miss
EPFL, Jan Aenao Group/Toronto RegionTracker Implementation n Minimal overhead to support RegionScout optimization n Still uses less area than conventional tag array Non-Shared Regions Add 1 bit to each RVA entry Locally Cached Regions Already provided by RVA
EPFL, Jan Aenao Group/Toronto RegionTracker + RegionScout Reduction in Snoop Broadcasts better n 4 processors, 512KB L2 Caches n 1KB regions Avoid 41% of Snoop Broadcasts, no area overhead compared to conventional tag array BlockScout (4KB)
EPFL, Jan Aenao Group/Toronto Result Summary n Replace Conventional Tag Array: l 20% Less tag area l 33% Less tag energy l Within 1% of original performance n Coarse-Grain Optimization Framework: l 36% reduction in area overhead for Stealth Prefetching l Filter 41% of snoop broadcasts with no area overhead compared to conventional cache
Predictor Virtualization Ioana Burcea Joint work with Stephen Somogyi Babak Falsafi
EPFL, Jan Aenao Group/Toronto Predictor Virtualization Interconnect L2 CPU L1-DL1-I CPU L1-DL1-I Main Memory Optimization Engines: Predictors CPU L1-DL1-I CPU L1-DL1-I CPU L1-DL1-I CPU L1-D L1-I L1-D L1-I L1-D L1-I L1-D
EPFL, Jan Aenao Group/Toronto Motivating Trends n Dedicating resources to predictors hard to justify: l Chip multiprocessors u Space dedicated to predictors X #processors l Larger predictor tables u Increased performance n Memory hierarchies offer the opportunity l Increased capacity l How many apps really use the space? Use conventional memory hierarchies to store predictor information
EPFL, Jan Aenao Group/Toronto PV Architecture contd. Optimization Engine Predictor Table request prediction request
EPFL, Jan Aenao Group/Toronto PV Architecture contd. Optimization Engine prediction Predictor Virtualization request
EPFL, Jan Aenao Group/Toronto PV Architecture contd. Optimization Engine prediction + index PVStart PVCache MSHR PVProxy L2 Main Memory PVTable request On the backside of the L1
EPFL, Jan Aenao Group/Toronto To Virtualize Or Not to Virtualize? 1.Re-Use 2. Predictor Info Prefetching Common Case CPU I$D$ interconnect Main Memory L2/L3 Infrequent
EPFL, Jan Aenao Group/Toronto To Virtualize or Not? n Challenge l Hit in the PVCache most of the time n Will not work for all predictors out of the box n Reuse is necessary l Intrinsic u Easy to virtualize l Non-intrinsic u Must be engineered n More so if the predictor needs to be fast to start with
EPFL, Jan Aenao Group/Toronto Will There Be Reuse? n Intrinsic: l Multiple [predictions per entry l We’ll see an example n Can be engineered l Group temporally correlated entries together: Cache block CPU I$D$ interconnect Main Memory L2/L3
EPFL, Jan Aenao Group/Toronto Spatial Memory Streaming n Footprint: l Blocks accessed per memory region n Predict next time the footprint will be the same l Handle: PC + offset within region
EPFL, Jan Aenao Group/Toronto Spatial Generations
EPFL, Jan Aenao Group/Toronto Virtualizing SMS Detector Predictor patterns prefetches trigger access Virtualize
EPFL, Jan Aenao Group/Toronto Virtualizing SMS Virtual Table 1K 11 PVCache 8 11 tagpatterntag pattern unused
EPFL, Jan Aenao Group/Toronto Packing Entries in One Cache Block n Index: PC + offset within spatial group u PC →16 bits u 32 blocks in a spatial group → 5 bit offset → 32 bit spatial pattern n Pattern table: 1K sets u 10 bits to index the table → 11 bit tag n Cache block: 64 bytes u 11 entries per cache block → Pattern table 1K sets – 11-way set associative 21 bit index tagpatterntag pattern unused
EPFL, Jan Aenao Group/Toronto Memory Address Calculation bits5 bits 10 bits PV Start Address PCBlock offset Memory Address
EPFL, Jan Aenao Group/Toronto Simulation Infrastructure n SimFlex: CMU Impetus l Full-system simulator based on Simics n Base processor configuration l 8-wide OoO l 256-entry ROB / 64-entry LSQ l L1D/L1I 64KB 4-way set-associative l UL2 8MB 16-way set-associative n Commercial workloads l TPC-C: DB2 and Oracle l TPC-H: Query 1, Query 2, Query 16, Query 17 l Web: Apache and Zeus
EPFL, Jan Aenao Group/Toronto SMS – Performance Potential better
EPFL, Jan Aenao Group/Toronto Virtualized Spatial Memory Streaming Original Prefetcher: Cost: 60KB Virtualized Prefetcher: Cost: <1Kbyte Nearly Identical Performance better
EPFL, Jan Aenao Group/Toronto Impact of Virtualization on L2 Misses
EPFL, Jan Aenao Group/Toronto Impact of Virtualization on L2 Requests
Coarse-Grain Tracking Jason Zebchuk