Presentation is loading. Please wait.

Presentation is loading. Please wait.

Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas.

Similar presentations


Presentation on theme: "Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas."— Presentation transcript:

1 Two Ways to Exploit Multi-Megabyte Caches AENAO Research Group @ Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas Moshovos {aasaraai, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu

2 EPFL, Jan. 20082Aenao Group/Toronto Future Caches: Just Larger? CPU I$ D$ CPU I$D$ CPU I$D$ interconnect Main Memory 1.“Big Picture” Management 2.Store Metadata 10s – 100s of MB

3 EPFL, Jan. 20083Aenao Group/Toronto Conventional Block Centric Cache n “Small” Blocks l Optimizes Bandwidth and Performance n Large L2/L3 caches especially Fine-Grain View of Memory L2 Cache Big Picture Lost

4 EPFL, Jan. 20084Aenao Group/Toronto “Big Picture” View n Region: 2 n sized, aligned area of memory n Patterns and behavior exposed l Spatial locality n Exploit for performance/area/power Coarse-Grain View of Memory L2 Cache

5 EPFL, Jan. 20085Aenao Group/Toronto Exploiting Coarse-Grain Patterns n Many existing coarse-grain optimizations n Add new structures to track coarse-grain information CPU L2 Cache Stealth Prefetching Run-time Adaptive Cache Hierarchy Management via Reference Analysis Destination-Set Prediction Spatial Memory Streaming Coarse-Grain Coherence Tracking RegionScout Circuit-Switched Coherence Hard to justify for a commercial design Coarse-Grain Framework n Embed coarse-grain information in tag array n Support many different optimizations with less area overhead Adaptable optimization FRAMEWORK

6 EPFL, Jan. 20086Aenao Group/Toronto L2 Cache RegionTracker Solution Manage blocks, but also track and manage regions Tag Array L1 Data Array Data Blocks Block Requests Block Requests Region Tracker Region Probes Region Responses

7 EPFL, Jan. 20087Aenao Group/Toronto RegionTracker Summary n Replace conventional tag array : l 4-core CMP with 8MB shared L2 cache l Within 1% of original performance l Up to 20% less tag area l Average 33% less energy consumption n Optimization Framework: l Stealth Prefetching: same performance, 36% less area l RegionScout: 2x more snoops avoided, no area overhead

8 EPFL, Jan. 20088Aenao Group/Toronto Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion

9 EPFL, Jan. 20089Aenao Group/Toronto Goals 1. Conventional Tag Array Functionality l Identify data block location and state l Leave data array un-changed 2. Optimization Framework Functionality l Is Region X cached? l Which blocks of Region X are cached? Where? l Evict or migrate Region X l Easy to assign properties to each Region

10 EPFL, Jan. 200810Aenao Group/Toronto Coarse-Grain Cache Designs n Increased BW, Decreased hit-rates Region X Large Block Size Tag ArrayData Array

11 EPFL, Jan. 200811Aenao Group/Toronto Sector Cache n Decreased hit-rates Region X Tag ArrayData Array

12 EPFL, Jan. 200812Aenao Group/Toronto Sector Pool Cache n High Associativity (2 - 4 times) Region X Tag ArrayData Array

13 EPFL, Jan. 200813Aenao Group/Toronto Decoupled Sector Cache n Region information not exposed n Region replacement requires scanning multiple entries Region X Tag ArrayData ArrayStatus Table

14 EPFL, Jan. 200814Aenao Group/Toronto Design Requirements n Small block size (64B) n Miss-rate does not increase n Lookup associativity does not increase n No additional access latency l (i.e., No scanning, no multiple block evictions) n Does not increase latency, area, or energy n Allows banking and interleaving n Fit in conventional tag array “envelope”

15 EPFL, Jan. 200815Aenao Group/Toronto RegionTracker: A Tag Array Replacement L1 Data Array n 3 SRAM arrays, combined smaller than tag array Region Vector Array Block Status Table Evicted Region Buffer

16 EPFL, Jan. 200816Aenao Group/Toronto Basic Structures Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV 14 Block Status Table (BST) status 32 n Address: specific RVA set and BST set n RVA entry: multiple, consecutive BST sets n BST entry: one of four RVA sets Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region

17 EPFL, Jan. 200817Aenao Group/Toronto Common Case: Hit Region Tag RVA IndexRegion OffsetBlock Offset 49061021 Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) 14 status 32 Data Array + BST Index To Data Array Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region

18 EPFL, Jan. 200818Aenao Group/Toronto Worst Case (Rare): Region Miss Region Tag RVA IndexRegion OffsetBlock Offset 49061021 Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) status 3 Ptr 2 Data Array + BST Index Evicted Region Buffer (ERB) No Match! Ptr

19 EPFL, Jan. 200819Aenao Group/Toronto Methodology n Flexus simulator from CMU SimFlex group l Based on Simics full-system simulator n 4-core CMP modeled after Piranha l Private 32KB, 4-way set-associative L1 caches l Shared 8MB, 16-way set-associative L2 cache l 64-byte blocks n Miss-rates: Functional simulation of 2 billion instructions per core n Performance and Energy: Timing simulation using SMARTS sampling methodology n Area and Power: Full custom implementation on 130nm commercial technology n 9 commercial workloads: l WEB: SpecWEB on Apache and Zeus l OLTP: TPC-C on DB2 and Oracle l DSS: 5 TPC-H queries on DB2 Interconnect L2 P D$I$ P D$I$ P D$I$ P D$I$

20 EPFL, Jan. 200820Aenao Group/Toronto Miss-Rates vs. Area n Sector Cache: 512KB sectors, SPC and RT: 1KB regions n Trade-offs comparable to conventional cache better Relative Miss-Rate Relative Tag Array Area Sector Cache (0.25, 1.26) 14-way 15-way 52-way 48-way

21 EPFL, Jan. 200821Aenao Group/Toronto Performance & Energy n 12-way set-associative RegionTracker: 20% less area n Error bars: 95% confidence interval n Performance within 1%, with 33% tag energy reduction Normalized Execution Time better Reduction in Tag Energy better PerformanceEnergy

22 EPFL, Jan. 200822Aenao Group/Toronto Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion

23 EPFL, Jan. 200823Aenao Group/Toronto RegionTracker: An Optimization Framework L1 RVA ERB Data Array BST Stealth Prefetching: Average 20% performance improvement Drop-in RegionTracker for 36% less area overhead RegionScout: In-depth analysis

24 EPFL, Jan. 200824Aenao Group/Toronto Snoop Coherence: Common Case Main Memory CPU Read x miss Read x+1Read x+2Read x+n Many snoops are to non-shared regions

25 EPFL, Jan. 200825Aenao Group/Toronto RegionScout Eliminate broadcasts for non-shared regions Main Memory CPU Global Region Miss Region Miss Non-Shared RegionsLocally Cached Regions Read x Region Miss

26 EPFL, Jan. 200826Aenao Group/Toronto RegionTracker Implementation n Minimal overhead to support RegionScout optimization n Still uses less area than conventional tag array Non-Shared Regions Add 1 bit to each RVA entry Locally Cached Regions Already provided by RVA

27 EPFL, Jan. 200827Aenao Group/Toronto RegionTracker + RegionScout Reduction in Snoop Broadcasts better n 4 processors, 512KB L2 Caches n 1KB regions Avoid 41% of Snoop Broadcasts, no area overhead compared to conventional tag array BlockScout (4KB)

28 EPFL, Jan. 200828Aenao Group/Toronto Result Summary n Replace Conventional Tag Array: l 20% Less tag area l 33% Less tag energy l Within 1% of original performance n Coarse-Grain Optimization Framework: l 36% reduction in area overhead for Stealth Prefetching l Filter 41% of snoop broadcasts with no area overhead compared to conventional cache

29 Predictor Virtualization Ioana Burcea Joint work with Stephen Somogyi Babak Falsafi

30 EPFL, Jan. 200830Aenao Group/Toronto Predictor Virtualization Interconnect L2 CPU L1-DL1-I CPU L1-DL1-I Main Memory Optimization Engines: Predictors CPU L1-DL1-I CPU L1-DL1-I CPU L1-DL1-I CPU L1-D L1-I L1-D L1-I L1-D L1-I L1-D

31 EPFL, Jan. 200831Aenao Group/Toronto Motivating Trends n Dedicating resources to predictors hard to justify: l Chip multiprocessors u Space dedicated to predictors X #processors l Larger predictor tables u Increased performance n Memory hierarchies offer the opportunity l Increased capacity l How many apps really use the space? Use conventional memory hierarchies to store predictor information

32 EPFL, Jan. 200832Aenao Group/Toronto PV Architecture contd. Optimization Engine Predictor Table request prediction request

33 EPFL, Jan. 200833Aenao Group/Toronto PV Architecture contd. Optimization Engine prediction Predictor Virtualization request

34 EPFL, Jan. 200834Aenao Group/Toronto PV Architecture contd. Optimization Engine prediction + index PVStart PVCache MSHR PVProxy L2 Main Memory PVTable request On the backside of the L1

35 EPFL, Jan. 200835Aenao Group/Toronto To Virtualize Or Not to Virtualize? 1.Re-Use 2. Predictor Info Prefetching Common Case CPU I$D$ interconnect Main Memory L2/L3 Infrequent

36 EPFL, Jan. 200836Aenao Group/Toronto To Virtualize or Not? n Challenge l Hit in the PVCache most of the time n Will not work for all predictors out of the box n Reuse is necessary l Intrinsic u Easy to virtualize l Non-intrinsic u Must be engineered n More so if the predictor needs to be fast to start with

37 EPFL, Jan. 200837Aenao Group/Toronto Will There Be Reuse? n Intrinsic: l Multiple [predictions per entry l We’ll see an example n Can be engineered l Group temporally correlated entries together: Cache block CPU I$D$ interconnect Main Memory L2/L3

38 EPFL, Jan. 200838Aenao Group/Toronto Spatial Memory Streaming n Footprint: l Blocks accessed per memory region n Predict next time the footprint will be the same l Handle: PC + offset within region

39 EPFL, Jan. 200839Aenao Group/Toronto Spatial Generations

40 EPFL, Jan. 200840Aenao Group/Toronto Virtualizing SMS Detector Predictor patterns prefetches trigger access Virtualize

41 EPFL, Jan. 200841Aenao Group/Toronto Virtualizing SMS Virtual Table 1K 11 PVCache 8 11 tagpatterntag pattern 0114354 85 unused

42 EPFL, Jan. 200842Aenao Group/Toronto Packing Entries in One Cache Block n Index: PC + offset within spatial group u PC →16 bits u 32 blocks in a spatial group → 5 bit offset → 32 bit spatial pattern n Pattern table: 1K sets u 10 bits to index the table → 11 bit tag n Cache block: 64 bytes u 11 entries per cache block → Pattern table 1K sets – 11-way set associative 21 bit index tagpatterntag pattern 0114354 85 unused

43 EPFL, Jan. 200843Aenao Group/Toronto Memory Address Calculation + 000000 16 bits5 bits 10 bits PV Start Address PCBlock offset Memory Address

44 EPFL, Jan. 200844Aenao Group/Toronto Simulation Infrastructure n SimFlex: CMU Impetus l Full-system simulator based on Simics n Base processor configuration l 8-wide OoO l 256-entry ROB / 64-entry LSQ l L1D/L1I 64KB 4-way set-associative l UL2 8MB 16-way set-associative n Commercial workloads l TPC-C: DB2 and Oracle l TPC-H: Query 1, Query 2, Query 16, Query 17 l Web: Apache and Zeus

45 EPFL, Jan. 200845Aenao Group/Toronto SMS – Performance Potential better

46 EPFL, Jan. 200846Aenao Group/Toronto Virtualized Spatial Memory Streaming Original Prefetcher: Cost: 60KB Virtualized Prefetcher: Cost: <1Kbyte Nearly Identical Performance better

47 EPFL, Jan. 200847Aenao Group/Toronto Impact of Virtualization on L2 Misses

48 EPFL, Jan. 200848Aenao Group/Toronto Impact of Virtualization on L2 Requests

49 Coarse-Grain Tracking Jason Zebchuk


Download ppt "Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas."

Similar presentations


Ads by Google