A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto
Jason Zebchuk2A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Conventional Block Centric Cache n “Small” Blocks l Optimizes Bandwidth and Performance n Large L2/L3 caches especially Fine-Grain View of Memory L2 Cache Big Picture Lost
Jason Zebchuk3A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy “Big Picture” View n Region: 2 n sized, aligned area of memory n Patterns and behavior exposed l Spatial locality n Exploit for performance/area/power Coarse-Grain View of Memory L2 Cache
Jason Zebchuk4A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Exploiting Coarse-Grain Patterns n Many existing coarse-grain optimizations n Add new structures to track coarse-grain information CPU L2 Cache Stealth Prefetching Flexible Snooping Destination-Set Prediction Spatial Memory Streaming Coarse-Grain Coherence Tracking RegionScout Circuit-Switched Coherence Hard to justify for a commercial design Coarse-Grain Framework
Jason Zebchuk5A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Exploiting Coarse-Grain Patterns CPU L2 Cache Coarse-Grain Framework n Embed coarse-grain information in tag array n Support many different optimizations with less area overhead Adaptable optimization FRAMEWORK
Jason Zebchuk6A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy L2 Cache RegionTracker Solution Manage blocks, but also track and manage regions Tag Array L1 Data Array Data Blocks Block Requests Block Requests Region Tracker Region Probes Region Responses
Jason Zebchuk7A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy RegionTracker Summary n Replace conventional tag array : l 4-core CMP with 8MB shared L2 cache l Within 1% of original performance l Up to 20% less tag area l Average 33% less energy consumption n Optimization Framework: l Stealth Prefetching: same performance, 36% less area l RegionScout: 2x more snoops avoided, no area overhead
Jason Zebchuk8A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion
Jason Zebchuk9A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Goals 1. Conventional Tag Array Functionality l Identify data block location and state l Leave data array un-changed 2. Optimization Framework Functionality l Is Region X cached? l Which blocks of Region X are cached? Where? l Evict or migrate Region X
Jason Zebchuk10A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Coarse-Grain Cache Designs n Increased BW, Decreased hit-rates Region X Large Block Size Tag ArrayData Array
Jason Zebchuk11A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Sector Cache n Decreased hit-rates Region X Tag Array Data Array
Jason Zebchuk12A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Sector Pool Cache n High Associativity (2 - 4 times) Region X Tag ArrayData Array
Jason Zebchuk13A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Decoupled Sector Cache n Region information not exposed n Region replacement requires scanning multiple entries Region X Tag ArrayData ArrayStatus Table
Jason Zebchuk14A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Design Requirements n Small block size (64B) n Miss-rate does not increase n Lookup associativity does not increase n No additional access latency l (i.e., No scanning, no multiple block evictions) n Does not increase latency, area, or energy n Allows banking and interleaving n Fit in conventional tag array “envelope”
Jason Zebchuk15A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy RegionTracker: A Tag Array Replacement L1 Data Array n 3 SRAM arrays, combined smaller than tag array Region Vector Array Block Status Table Evicted Region Buffer
Jason Zebchuk16A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Common Case: Hit Region Tag RVA IndexRegion OffsetBlock Offset Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) 14 status 32 Data Array + BST Index To Data Array Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region
Jason Zebchuk17A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Worst Case (Rare): Region Miss Region Tag RVA IndexRegion OffsetBlock Offset Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) status 3 Ptr 2 Data Array + BST Index Evicted Region Buffer (ERB) No Match! Ptr
Jason Zebchuk18A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Methodology n Flexus simulator from CMU SimFlex group l Based on Simics full-system simulator n 4-core CMP modeled after Piranha l Private 32KB, 4-way set-associative L1 caches l Shared 8MB, 16-way set-associative L2 cache l 64-byte blocks n Miss-rates: Functional simulation of 2 billion instructions per core n Performance and Energy: Timing simulation using SMARTS sampling methodology n Area and Power: Full custom implementation on 130nm commercial technology n 9 commercial workloads: l WEB: SpecWEB on Apache and Zeus l OLTP: TPC-C on DB2 and Oracle l DSS: 5 TPC-H queries on DB2 Interconnect L2 P D$I$ P D$I$ P D$I$ P D$I$
Jason Zebchuk19A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Miss-Rates vs. Area n Sector Cache: 512KB sectors, SPC and RT: 1KB regions n Trade-offs comparable to conventional cache better Relative Miss-Rate Relative Tag Array Area Sector Cache (0.25, 1.26) 14-way 15-way 52-way 48-way
Jason Zebchuk20A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Performance & Energy n 12-way set-associative RegionTracker: 20% less area n Error bars: 95% confidence interval n Performance within 1%, with 33% tag energy reduction Normalized Execution Time better Reduction in Tag Energy better PerformanceEnergy
Jason Zebchuk21A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion
Jason Zebchuk22A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy RegionTracker: An Optimization Framework L1 RVA ERB Data Array BST Stealth Prefetching: Average 20% performance improvement Drop-in RegionTracker for 36% less area overhead RegionScout: In-depth analysis
Jason Zebchuk23A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Snoop Coherence: Common Case Main Memory CPU Read x miss Read x+1 Read x+n Many snoops are to non-shared regions
Jason Zebchuk24A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy RegionScout Eliminate broadcasts for non-shared regions Main Memory CPU Global Region Miss Region Miss Non-Shared RegionsLocally Cached Regions Read x Region Miss
Jason Zebchuk25A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy RegionTracker Implementation n Minimal overhead to support RegionScout optimization n Still uses less area than conventional tag array Non-Shared Regions Add 1 bit to each RVA entry Locally Cached Regions Already provided by RVA
Jason Zebchuk26A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy RegionTracker + RegionScout Reduction in Snoop Broadcasts better n 4 processors, 512KB L2 Caches n 1KB regions Avoid 41% of Snoop Broadcasts, no area overhead compared to conventional tag array BlockScout (4KB) New optimization possible with RegionTracker
Jason Zebchuk27A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Result Summary n Replace Conventional Tag Array: l 20% Less tag area l 33% Less tag energy l Within 1% of original performance n Coarse-Grain Optimization Framework: l 36% reduction in area overhead for Stealth Prefetching l Filter 41% of snoop broadcasts with no area overhead compared to conventional cache
Jason Zebchuk28A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Exploiting Coarse-Grain Patterns CPU L2 Cache Stealth Prefetching Run-time Adaptive Cache Hierarchy Management via Reference Analysis Destination-Set Prediction Spatial Memory Streaming Coarse-Grain Coherence Tracking RegionScout Circuit-Switched Coherence Conclusion RegionTracker framework makes coarse-grain optimizations more attractive CPU L2 Cache
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto