Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas.

Slides:



Advertisements
Similar presentations
Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.
Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Nikos Hardavellas, Northwestern University
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.
Skewed Compressed Cache
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Moshovos © 1 Memory State Compressors for Gigascale Checkpoint/Restore Andreas Moshovos
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Moshovos © 1 ReCast: Boosting L2 Tag Line Buffer Coverage “for Free” Won-Ho Park, Toronto Andreas Moshovos, Toronto Babak Falsafi, CMU
ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
University of Toronto Department of Electrical And Computer Engineering Jason Zebchuk RegionTracker: Optimizing On-Chip Cache.
University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.
1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason.
ASR: Adaptive Selective Replication for CMP Caches
Improving Memory Access 1/3 The Cache and Virtual Memory
Multilevel Memories (Improving performance using alittle “cash”)
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Lecture: Large Caches, Virtual Memory
Temporal Streaming of Shared Memory
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
CARP: Compression-Aware Replacement Policies
Adapted from slides by Sally McKee Cornell University
Lecture 20: OOO, Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 20: OOO, Memory Hierarchy
Cache - Optimization.
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas Moshovos {aasaraai, ioana, myrto, elham, zebchuk,

EPFL, Jan Aenao Group/Toronto Future Caches: Just Larger? CPU I$ D$ CPU I$D$ CPU I$D$ interconnect Main Memory 1.“Big Picture” Management 2.Store Metadata 10s – 100s of MB

EPFL, Jan Aenao Group/Toronto Conventional Block Centric Cache n “Small” Blocks l Optimizes Bandwidth and Performance n Large L2/L3 caches especially Fine-Grain View of Memory L2 Cache Big Picture Lost

EPFL, Jan Aenao Group/Toronto “Big Picture” View n Region: 2 n sized, aligned area of memory n Patterns and behavior exposed l Spatial locality n Exploit for performance/area/power Coarse-Grain View of Memory L2 Cache

EPFL, Jan Aenao Group/Toronto Exploiting Coarse-Grain Patterns n Many existing coarse-grain optimizations n Add new structures to track coarse-grain information CPU L2 Cache Stealth Prefetching Run-time Adaptive Cache Hierarchy Management via Reference Analysis Destination-Set Prediction Spatial Memory Streaming Coarse-Grain Coherence Tracking RegionScout Circuit-Switched Coherence Hard to justify for a commercial design Coarse-Grain Framework n Embed coarse-grain information in tag array n Support many different optimizations with less area overhead Adaptable optimization FRAMEWORK

EPFL, Jan Aenao Group/Toronto L2 Cache RegionTracker Solution Manage blocks, but also track and manage regions Tag Array L1 Data Array Data Blocks Block Requests Block Requests Region Tracker Region Probes Region Responses

EPFL, Jan Aenao Group/Toronto RegionTracker Summary n Replace conventional tag array : l 4-core CMP with 8MB shared L2 cache l Within 1% of original performance l Up to 20% less tag area l Average 33% less energy consumption n Optimization Framework: l Stealth Prefetching: same performance, 36% less area l RegionScout: 2x more snoops avoided, no area overhead

EPFL, Jan Aenao Group/Toronto Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion

EPFL, Jan Aenao Group/Toronto Goals 1. Conventional Tag Array Functionality l Identify data block location and state l Leave data array un-changed 2. Optimization Framework Functionality l Is Region X cached? l Which blocks of Region X are cached? Where? l Evict or migrate Region X l Easy to assign properties to each Region

EPFL, Jan Aenao Group/Toronto Coarse-Grain Cache Designs n Increased BW, Decreased hit-rates Region X Large Block Size Tag ArrayData Array

EPFL, Jan Aenao Group/Toronto Sector Cache n Decreased hit-rates Region X Tag ArrayData Array

EPFL, Jan Aenao Group/Toronto Sector Pool Cache n High Associativity (2 - 4 times) Region X Tag ArrayData Array

EPFL, Jan Aenao Group/Toronto Decoupled Sector Cache n Region information not exposed n Region replacement requires scanning multiple entries Region X Tag ArrayData ArrayStatus Table

EPFL, Jan Aenao Group/Toronto Design Requirements n Small block size (64B) n Miss-rate does not increase n Lookup associativity does not increase n No additional access latency l (i.e., No scanning, no multiple block evictions) n Does not increase latency, area, or energy n Allows banking and interleaving n Fit in conventional tag array “envelope”

EPFL, Jan Aenao Group/Toronto RegionTracker: A Tag Array Replacement L1 Data Array n 3 SRAM arrays, combined smaller than tag array Region Vector Array Block Status Table Evicted Region Buffer

EPFL, Jan Aenao Group/Toronto Basic Structures Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV 14 Block Status Table (BST) status 32 n Address: specific RVA set and BST set n RVA entry: multiple, consecutive BST sets n BST entry: one of four RVA sets Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region

EPFL, Jan Aenao Group/Toronto Common Case: Hit Region Tag RVA IndexRegion OffsetBlock Offset Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) 14 status 32 Data Array + BST Index To Data Array Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region

EPFL, Jan Aenao Group/Toronto Worst Case (Rare): Region Miss Region Tag RVA IndexRegion OffsetBlock Offset Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) status 3 Ptr 2 Data Array + BST Index Evicted Region Buffer (ERB) No Match! Ptr

EPFL, Jan Aenao Group/Toronto Methodology n Flexus simulator from CMU SimFlex group l Based on Simics full-system simulator n 4-core CMP modeled after Piranha l Private 32KB, 4-way set-associative L1 caches l Shared 8MB, 16-way set-associative L2 cache l 64-byte blocks n Miss-rates: Functional simulation of 2 billion instructions per core n Performance and Energy: Timing simulation using SMARTS sampling methodology n Area and Power: Full custom implementation on 130nm commercial technology n 9 commercial workloads: l WEB: SpecWEB on Apache and Zeus l OLTP: TPC-C on DB2 and Oracle l DSS: 5 TPC-H queries on DB2 Interconnect L2 P D$I$ P D$I$ P D$I$ P D$I$

EPFL, Jan Aenao Group/Toronto Miss-Rates vs. Area n Sector Cache: 512KB sectors, SPC and RT: 1KB regions n Trade-offs comparable to conventional cache better Relative Miss-Rate Relative Tag Array Area Sector Cache (0.25, 1.26) 14-way 15-way 52-way 48-way

EPFL, Jan Aenao Group/Toronto Performance & Energy n 12-way set-associative RegionTracker: 20% less area n Error bars: 95% confidence interval n Performance within 1%, with 33% tag energy reduction Normalized Execution Time better Reduction in Tag Energy better PerformanceEnergy

EPFL, Jan Aenao Group/Toronto Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion

EPFL, Jan Aenao Group/Toronto RegionTracker: An Optimization Framework L1 RVA ERB Data Array BST Stealth Prefetching: Average 20% performance improvement Drop-in RegionTracker for 36% less area overhead RegionScout: In-depth analysis

EPFL, Jan Aenao Group/Toronto Snoop Coherence: Common Case Main Memory CPU Read x miss Read x+1Read x+2Read x+n Many snoops are to non-shared regions

EPFL, Jan Aenao Group/Toronto RegionScout Eliminate broadcasts for non-shared regions Main Memory CPU Global Region Miss Region Miss Non-Shared RegionsLocally Cached Regions Read x Region Miss

EPFL, Jan Aenao Group/Toronto RegionTracker Implementation n Minimal overhead to support RegionScout optimization n Still uses less area than conventional tag array Non-Shared Regions Add 1 bit to each RVA entry Locally Cached Regions Already provided by RVA

EPFL, Jan Aenao Group/Toronto RegionTracker + RegionScout Reduction in Snoop Broadcasts better n 4 processors, 512KB L2 Caches n 1KB regions Avoid 41% of Snoop Broadcasts, no area overhead compared to conventional tag array BlockScout (4KB)

EPFL, Jan Aenao Group/Toronto Result Summary n Replace Conventional Tag Array: l 20% Less tag area l 33% Less tag energy l Within 1% of original performance n Coarse-Grain Optimization Framework: l 36% reduction in area overhead for Stealth Prefetching l Filter 41% of snoop broadcasts with no area overhead compared to conventional cache

Predictor Virtualization Ioana Burcea Joint work with Stephen Somogyi Babak Falsafi

EPFL, Jan Aenao Group/Toronto Predictor Virtualization Interconnect L2 CPU L1-DL1-I CPU L1-DL1-I Main Memory Optimization Engines: Predictors CPU L1-DL1-I CPU L1-DL1-I CPU L1-DL1-I CPU L1-D L1-I L1-D L1-I L1-D L1-I L1-D

EPFL, Jan Aenao Group/Toronto Motivating Trends n Dedicating resources to predictors hard to justify: l Chip multiprocessors u Space dedicated to predictors X #processors l Larger predictor tables u Increased performance n Memory hierarchies offer the opportunity l Increased capacity l How many apps really use the space? Use conventional memory hierarchies to store predictor information

EPFL, Jan Aenao Group/Toronto PV Architecture contd. Optimization Engine Predictor Table request prediction request

EPFL, Jan Aenao Group/Toronto PV Architecture contd. Optimization Engine prediction Predictor Virtualization request

EPFL, Jan Aenao Group/Toronto PV Architecture contd. Optimization Engine prediction + index PVStart PVCache MSHR PVProxy L2 Main Memory PVTable request On the backside of the L1

EPFL, Jan Aenao Group/Toronto To Virtualize Or Not to Virtualize? 1.Re-Use 2. Predictor Info Prefetching Common Case CPU I$D$ interconnect Main Memory L2/L3 Infrequent

EPFL, Jan Aenao Group/Toronto To Virtualize or Not? n Challenge l Hit in the PVCache most of the time n Will not work for all predictors out of the box n Reuse is necessary l Intrinsic u Easy to virtualize l Non-intrinsic u Must be engineered n More so if the predictor needs to be fast to start with

EPFL, Jan Aenao Group/Toronto Will There Be Reuse? n Intrinsic: l Multiple [predictions per entry l We’ll see an example n Can be engineered l Group temporally correlated entries together: Cache block CPU I$D$ interconnect Main Memory L2/L3

EPFL, Jan Aenao Group/Toronto Spatial Memory Streaming n Footprint: l Blocks accessed per memory region n Predict next time the footprint will be the same l Handle: PC + offset within region

EPFL, Jan Aenao Group/Toronto Spatial Generations

EPFL, Jan Aenao Group/Toronto Virtualizing SMS Detector Predictor patterns prefetches trigger access Virtualize

EPFL, Jan Aenao Group/Toronto Virtualizing SMS Virtual Table 1K 11 PVCache 8 11 tagpatterntag pattern unused

EPFL, Jan Aenao Group/Toronto Packing Entries in One Cache Block n Index: PC + offset within spatial group u PC →16 bits u 32 blocks in a spatial group → 5 bit offset → 32 bit spatial pattern n Pattern table: 1K sets u 10 bits to index the table → 11 bit tag n Cache block: 64 bytes u 11 entries per cache block → Pattern table 1K sets – 11-way set associative 21 bit index tagpatterntag pattern unused

EPFL, Jan Aenao Group/Toronto Memory Address Calculation bits5 bits 10 bits PV Start Address PCBlock offset Memory Address

EPFL, Jan Aenao Group/Toronto Simulation Infrastructure n SimFlex: CMU Impetus l Full-system simulator based on Simics n Base processor configuration l 8-wide OoO l 256-entry ROB / 64-entry LSQ l L1D/L1I 64KB 4-way set-associative l UL2 8MB 16-way set-associative n Commercial workloads l TPC-C: DB2 and Oracle l TPC-H: Query 1, Query 2, Query 16, Query 17 l Web: Apache and Zeus

EPFL, Jan Aenao Group/Toronto SMS – Performance Potential better

EPFL, Jan Aenao Group/Toronto Virtualized Spatial Memory Streaming Original Prefetcher: Cost: 60KB Virtualized Prefetcher: Cost: <1Kbyte Nearly Identical Performance better

EPFL, Jan Aenao Group/Toronto Impact of Virtualization on L2 Misses

EPFL, Jan Aenao Group/Toronto Impact of Virtualization on L2 Requests

Coarse-Grain Tracking Jason Zebchuk