Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas.

Two Ways to Exploit Multi-Megabyte Caches AENAO Research Group @ Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas Moshovos {aasaraai, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu

EPFL, Jan. 20082Aenao Group/Toronto Future Caches: Just Larger? CPU I$ D$ CPU I$D$ CPU I$D$ interconnect Main Memory 1.“Big Picture” Management 2.Store Metadata 10s – 100s of MB

EPFL, Jan. 20083Aenao Group/Toronto Conventional Block Centric Cache n “Small” Blocks l Optimizes Bandwidth and Performance n Large L2/L3 caches especially Fine-Grain View of Memory L2 Cache Big Picture Lost

EPFL, Jan. 20084Aenao Group/Toronto “Big Picture” View n Region: 2 n sized, aligned area of memory n Patterns and behavior exposed l Spatial locality n Exploit for performance/area/power Coarse-Grain View of Memory L2 Cache

EPFL, Jan. 20085Aenao Group/Toronto Exploiting Coarse-Grain Patterns n Many existing coarse-grain optimizations n Add new structures to track coarse-grain information CPU L2 Cache Stealth Prefetching Run-time Adaptive Cache Hierarchy Management via Reference Analysis Destination-Set Prediction Spatial Memory Streaming Coarse-Grain Coherence Tracking RegionScout Circuit-Switched Coherence Hard to justify for a commercial design Coarse-Grain Framework n Embed coarse-grain information in tag array n Support many different optimizations with less area overhead Adaptable optimization FRAMEWORK

EPFL, Jan. 20086Aenao Group/Toronto L2 Cache RegionTracker Solution Manage blocks, but also track and manage regions Tag Array L1 Data Array Data Blocks Block Requests Block Requests Region Tracker Region Probes Region Responses

EPFL, Jan. 20087Aenao Group/Toronto RegionTracker Summary n Replace conventional tag array : l 4-core CMP with 8MB shared L2 cache l Within 1% of original performance l Up to 20% less tag area l Average 33% less energy consumption n Optimization Framework: l Stealth Prefetching: same performance, 36% less area l RegionScout: 2x more snoops avoided, no area overhead

EPFL, Jan. 20088Aenao Group/Toronto Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion

EPFL, Jan. 20089Aenao Group/Toronto Goals 1. Conventional Tag Array Functionality l Identify data block location and state l Leave data array un-changed 2. Optimization Framework Functionality l Is Region X cached? l Which blocks of Region X are cached? Where? l Evict or migrate Region X l Easy to assign properties to each Region

EPFL, Jan. 200810Aenao Group/Toronto Coarse-Grain Cache Designs n Increased BW, Decreased hit-rates Region X Large Block Size Tag ArrayData Array

EPFL, Jan. 200811Aenao Group/Toronto Sector Cache n Decreased hit-rates Region X Tag ArrayData Array

EPFL, Jan. 200812Aenao Group/Toronto Sector Pool Cache n High Associativity (2 - 4 times) Region X Tag ArrayData Array

EPFL, Jan. 200813Aenao Group/Toronto Decoupled Sector Cache n Region information not exposed n Region replacement requires scanning multiple entries Region X Tag ArrayData ArrayStatus Table

EPFL, Jan. 200814Aenao Group/Toronto Design Requirements n Small block size (64B) n Miss-rate does not increase n Lookup associativity does not increase n No additional access latency l (i.e., No scanning, no multiple block evictions) n Does not increase latency, area, or energy n Allows banking and interleaving n Fit in conventional tag array “envelope”

EPFL, Jan. 200815Aenao Group/Toronto RegionTracker: A Tag Array Replacement L1 Data Array n 3 SRAM arrays, combined smaller than tag array Region Vector Array Block Status Table Evicted Region Buffer

EPFL, Jan. 200816Aenao Group/Toronto Basic Structures Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV 14 Block Status Table (BST) status 32 n Address: specific RVA set and BST set n RVA entry: multiple, consecutive BST sets n BST entry: one of four RVA sets Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region

EPFL, Jan. 200817Aenao Group/Toronto Common Case: Hit Region Tag RVA IndexRegion OffsetBlock Offset 49061021 Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) 14 status 32 Data Array + BST Index To Data Array Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region

EPFL, Jan. 200818Aenao Group/Toronto Worst Case (Rare): Region Miss Region Tag RVA IndexRegion OffsetBlock Offset 49061021 Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) status 3 Ptr 2 Data Array + BST Index Evicted Region Buffer (ERB) No Match! Ptr

EPFL, Jan. 200819Aenao Group/Toronto Methodology n Flexus simulator from CMU SimFlex group l Based on Simics full-system simulator n 4-core CMP modeled after Piranha l Private 32KB, 4-way set-associative L1 caches l Shared 8MB, 16-way set-associative L2 cache l 64-byte blocks n Miss-rates: Functional simulation of 2 billion instructions per core n Performance and Energy: Timing simulation using SMARTS sampling methodology n Area and Power: Full custom implementation on 130nm commercial technology n 9 commercial workloads: l WEB: SpecWEB on Apache and Zeus l OLTP: TPC-C on DB2 and Oracle l DSS: 5 TPC-H queries on DB2 Interconnect L2 P D$I$ P D$I$ P D$I$ P D$I$

EPFL, Jan. 200820Aenao Group/Toronto Miss-Rates vs. Area n Sector Cache: 512KB sectors, SPC and RT: 1KB regions n Trade-offs comparable to conventional cache better Relative Miss-Rate Relative Tag Array Area Sector Cache (0.25, 1.26) 14-way 15-way 52-way 48-way

EPFL, Jan. 200821Aenao Group/Toronto Performance & Energy n 12-way set-associative RegionTracker: 20% less area n Error bars: 95% confidence interval n Performance within 1%, with 33% tag energy reduction Normalized Execution Time better Reduction in Tag Energy better PerformanceEnergy

EPFL, Jan. 200822Aenao Group/Toronto Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion

EPFL, Jan. 200823Aenao Group/Toronto RegionTracker: An Optimization Framework L1 RVA ERB Data Array BST Stealth Prefetching: Average 20% performance improvement Drop-in RegionTracker for 36% less area overhead RegionScout: In-depth analysis

EPFL, Jan. 200824Aenao Group/Toronto Snoop Coherence: Common Case Main Memory CPU Read x miss Read x+1Read x+2Read x+n Many snoops are to non-shared regions

EPFL, Jan. 200825Aenao Group/Toronto RegionScout Eliminate broadcasts for non-shared regions Main Memory CPU Global Region Miss Region Miss Non-Shared RegionsLocally Cached Regions Read x Region Miss

EPFL, Jan. 200826Aenao Group/Toronto RegionTracker Implementation n Minimal overhead to support RegionScout optimization n Still uses less area than conventional tag array Non-Shared Regions Add 1 bit to each RVA entry Locally Cached Regions Already provided by RVA

EPFL, Jan. 200827Aenao Group/Toronto RegionTracker + RegionScout Reduction in Snoop Broadcasts better n 4 processors, 512KB L2 Caches n 1KB regions Avoid 41% of Snoop Broadcasts, no area overhead compared to conventional tag array BlockScout (4KB)

EPFL, Jan. 200828Aenao Group/Toronto Result Summary n Replace Conventional Tag Array: l 20% Less tag area l 33% Less tag energy l Within 1% of original performance n Coarse-Grain Optimization Framework: l 36% reduction in area overhead for Stealth Prefetching l Filter 41% of snoop broadcasts with no area overhead compared to conventional cache

Predictor Virtualization Ioana Burcea Joint work with Stephen Somogyi Babak Falsafi

EPFL, Jan. 200830Aenao Group/Toronto Predictor Virtualization Interconnect L2 CPU L1-DL1-I CPU L1-DL1-I Main Memory Optimization Engines: Predictors CPU L1-DL1-I CPU L1-DL1-I CPU L1-DL1-I CPU L1-D L1-I L1-D L1-I L1-D L1-I L1-D

EPFL, Jan. 200831Aenao Group/Toronto Motivating Trends n Dedicating resources to predictors hard to justify: l Chip multiprocessors u Space dedicated to predictors X #processors l Larger predictor tables u Increased performance n Memory hierarchies offer the opportunity l Increased capacity l How many apps really use the space? Use conventional memory hierarchies to store predictor information

EPFL, Jan. 200832Aenao Group/Toronto PV Architecture contd. Optimization Engine Predictor Table request prediction request

EPFL, Jan. 200833Aenao Group/Toronto PV Architecture contd. Optimization Engine prediction Predictor Virtualization request

EPFL, Jan. 200834Aenao Group/Toronto PV Architecture contd. Optimization Engine prediction + index PVStart PVCache MSHR PVProxy L2 Main Memory PVTable request On the backside of the L1

EPFL, Jan. 200835Aenao Group/Toronto To Virtualize Or Not to Virtualize? 1.Re-Use 2. Predictor Info Prefetching Common Case CPU I$D$ interconnect Main Memory L2/L3 Infrequent

EPFL, Jan. 200836Aenao Group/Toronto To Virtualize or Not? n Challenge l Hit in the PVCache most of the time n Will not work for all predictors out of the box n Reuse is necessary l Intrinsic u Easy to virtualize l Non-intrinsic u Must be engineered n More so if the predictor needs to be fast to start with

EPFL, Jan. 200837Aenao Group/Toronto Will There Be Reuse? n Intrinsic: l Multiple [predictions per entry l We’ll see an example n Can be engineered l Group temporally correlated entries together: Cache block CPU I$D$ interconnect Main Memory L2/L3

EPFL, Jan. 200838Aenao Group/Toronto Spatial Memory Streaming n Footprint: l Blocks accessed per memory region n Predict next time the footprint will be the same l Handle: PC + offset within region

EPFL, Jan. 200839Aenao Group/Toronto Spatial Generations

EPFL, Jan. 200840Aenao Group/Toronto Virtualizing SMS Detector Predictor patterns prefetches trigger access Virtualize

EPFL, Jan. 200841Aenao Group/Toronto Virtualizing SMS Virtual Table 1K 11 PVCache 8 11 tagpatterntag pattern 0114354 85 unused

EPFL, Jan. 200842Aenao Group/Toronto Packing Entries in One Cache Block n Index: PC + offset within spatial group u PC →16 bits u 32 blocks in a spatial group → 5 bit offset → 32 bit spatial pattern n Pattern table: 1K sets u 10 bits to index the table → 11 bit tag n Cache block: 64 bytes u 11 entries per cache block → Pattern table 1K sets – 11-way set associative 21 bit index tagpatterntag pattern 0114354 85 unused

EPFL, Jan. 200843Aenao Group/Toronto Memory Address Calculation + 000000 16 bits5 bits 10 bits PV Start Address PCBlock offset Memory Address

EPFL, Jan. 200844Aenao Group/Toronto Simulation Infrastructure n SimFlex: CMU Impetus l Full-system simulator based on Simics n Base processor configuration l 8-wide OoO l 256-entry ROB / 64-entry LSQ l L1D/L1I 64KB 4-way set-associative l UL2 8MB 16-way set-associative n Commercial workloads l TPC-C: DB2 and Oracle l TPC-H: Query 1, Query 2, Query 16, Query 17 l Web: Apache and Zeus

EPFL, Jan. 200845Aenao Group/Toronto SMS – Performance Potential better

EPFL, Jan. 200846Aenao Group/Toronto Virtualized Spatial Memory Streaming Original Prefetcher: Cost: 60KB Virtualized Prefetcher: Cost: <1Kbyte Nearly Identical Performance better

EPFL, Jan. 200847Aenao Group/Toronto Impact of Virtualization on L2 Misses

EPFL, Jan. 200848Aenao Group/Toronto Impact of Virtualization on L2 Requests

Coarse-Grain Tracking Jason Zebchuk

Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas.

Similar presentations

Presentation on theme: "Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas.

Similar presentations

Presentation on theme: "Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas."— Presentation transcript:

Similar presentations

About project

Feedback