Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Slides:

Advertisements

Similar presentations

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Improving Cache Performance by Exploiting Read-Write Disparity

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Memory System Characterization of Big Data Workloads

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.

1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Microprocessor-based systems Curse 7 Memory hierarchies.

Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Computer Architecture Lecture 26 Fasih ur Rehman.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Sampling Dead Block Prediction for Last-Level Caches

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

Lecture 40: Review Session #2 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Improving Cache Performance using Victim Tag Stores

Multilevel Memories (Improving performance using alittle “cash”)

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

‘99 ACM/IEEE International Symposium on Computer Architecture

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

ECE 445 – Computer Organization

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Lecture 21: Memory Hierarchy

Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Lecture 22: Cache Hierarchies, Memory

CARP: Compression-Aware Replacement Policies

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 20: OOO, Memory Hierarchy

Lecture 20: OOO, Memory Hierarchy

Lecture 22: Cache Hierarchies, Memory

Cache - Optimization.

Aliasing and Anti-Aliasing in Branch History Table Prediction

rePLay: A Hardware Framework for Dynamic Optimization

Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )

Presentation transcript:

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel

July 1, 1998Princeton University2 Spatial Locality in Caches Current approach: Exploit spatial locality within a cache line Small cache line –Lower bandwidth – Less pollution Big cache line –Exploit more locality –Fewer tags Current caches use 32 byte lines Access Cache Line { Memory

July 1, 1998Princeton University3 Spatial Locality in Caches contd. Spatial locality exhibited varies –Across applications –Within an application Using fixed size line to exploit locality –Inefficient use of cache resources »Less than half the data fetched is used »Wastes bandwidth, Pollutes cache –Limited benefit of spatial locality

July 1, 1998Princeton University4 Outline Introduction Spatial Footprint Predictors Practical Considerations Future Work Related Work Conclusions

July 1, 1998Princeton University5 Spatial Footprint Predictor (SFP) Exploit more spatial locality Need to reduce pollution Fetch words selectively Requires accurately predicting spatial footprint of a block –Bit Vector – Access Memory

July 1, 1998Princeton University6 Spatial Footprint Predictor contd. Record spatial footprints Use footprint history to make predictions –Lookup table based on »Nominating data address »Nominating instruction address »Combination L1 data caches Nominating Access (NA) Memory

July 1, 1998Princeton University7 Simple Approach Use large cache lines –Fetch specific words –leave holes for words that were not fetched Might decrease bandwidth Increases miss ratio –missed lines –under-utilization of cache MemoryCache Line

July 1, 1998Princeton University8 Our Approach Regular cache with small lines –8 bytes i.e. one word Exploit spatial locality at sector granularity –16 lines i.e. 128 bytes Spatial Footprint Predictor –Fetch 1-16 lines in a sector on a miss MemoryCache Line Sector {

July 1, 1998Princeton University9 When to Record/Predict footprints? Sectors in memory are active or inactive –Active » Record footprints »cache miss in an inactive sector –Inactive » Use history to predict »Cache miss on a line that is marked used (footprint) in an active sector Use infinite size tables

July 1, 1998Princeton University10 Recording Footprints Active Sector Table Spatial Footprint History Table NA SF Done Memory Cache Access (Records FP)(Stores FP)

July 1, 1998Princeton University11 Predicting Footprints Spatial Footprint History Table Memory Cache Fetch Lines Predicted Footprint SF Access

July 1, 1998Princeton University12 The default footprint predictor When SFP has no prediction –No history –Evicted from Spatial Footprint Table Picks a single line size for the application –Based on the footprints observed

July 1, 1998Princeton University13 Experimental Setup Cache Parameters –16 KB L1 –4-way associative –8 bytes per line –16 lines per sector Cache simulator –Miss Ratios –Fetch bandwidth 12 Intel MRL traces –gcc and go (SPEC) –Transaction processing –Web server –PC applications »word processors »spreadsheets Normalized results –16KB conventional cache with 32 byte line

July 1, 1998Princeton University14 Experimental Evaluation Normalized Miss Ratios

July 1, 1998Princeton University15 GCC Comparison

July 1, 1998Princeton University16 GCC Comparison Contd. Comparing SFP cache to –Conventional caches with varying line sizes »Comparable to best miss ratio (using 64 byte lines) »Close to lowest bandwidth (using 8 byte lines) –Bigger conventional cache »Comparable to a 32 KB Cache

July 1, 1998Princeton University17 Outline Introduction Spatial Footprint Predictors Practical Considerations Future Work Related Work Conclusions

July 1, 1998Princeton University18 Decoupled Sectored Cache Seznec et. al. –Proposed to improve sectored L2 cache Decouple tag array from data array –Dynamic mapping: no longer one-to-one –Multiple lines from the same sector share tags –Flexible: Data and tag array can be of different sizes and associativities

July 1, 1998Princeton University19 Practical Considerations Reasonable Spatial Footprint History Table –1024 entries Reduce Tag Storage –Use Decoupled Sectored Cache –Same number of tags as a conventional cache with 32 byte lines –Both data and tag array are 4-way associative

July 1, 1998Princeton University20 Experimental Evaluation Normalized Miss Ratios

July 1, 1998Princeton University21 Cost Additional Space –9 KB –Can be reduced by »Using partial tags »Compressing footprints Time –Most predictor actions are off the critical path –Little impact on cache access latency

July 1, 1998Princeton University22 Future Work Improve miss ratios further –Infinite tables: 30% –Practical Implementation: 18% Reduce the additional memory required Better coarse grained predictor L2 Caches

July 1, 1998Princeton University23 Related Work Przybylski et. al., Smith et. al. –Statically best line size, fetch size Gonzalez et. al. –Dual data cache: temporal and spatial locality –Numeric codes Johnson et. al. –Dynamically pick line size per block (1 KB)

July 1, 1998Princeton University24 Conclusions Spatial Footprint Predictors –Decrease miss ratio (18% on average) –Reduce bandwidth usage –Little impact on cache access latency Can use fine-grain predictor for data caches