Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 12 Reduce Miss Penalty and Hit Time

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Simulations of Memory Hierarchy LAB 2: CACHE LAB.

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)

Revisiting Load Value Speculation:

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Lecture 19: Virtual Memory

Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Dynamic Associative Caches:

CSCI206 - Computer Organization & Programming

Cache Performance Samira Khan March 28, 2017.

Lecture: Cache Hierarchies

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Cache Memory Presentation I

Lecture: Cache Hierarchies

Chapter 8 Digital Design and Computer Architecture: ARM® Edition

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Lecture: SMT, Cache Hierarchies

Lecture: Cache Innovations, Virtual Memory

Adapted from slides by Sally McKee Cornell University

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 20: OOO, Memory Hierarchy

Lecture: SMT, Cache Hierarchies

Lecture 20: OOO, Memory Hierarchy

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

10/18: Lecture Topics Using spatial locality

Multi-Lookahead Offset Prefetching

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek, State University of New York, Binghamton

INTRODUCTION Caches are very important to cover the processor-memory performance gap. Thus Caches should be utilized very efficiently. Fetch only the useful data Cache Utilization : Percentage of the useful words out of the total words fetched into the cache.

Utilization vs Block-Size Larger Cache Blocks Increase bandwidth requirement Reduce utilization Smaller Cache Blocks Reduces bandwidth requirement Increase utilization

Percent Cache Utilization 16KB, 4-way Set Associative Cache, 32 byte block size

Methods to improve utilization Rearrange data/code Dynamically adapt cache line size Sub-blocking

Benefits of Utilization improvement Lower energy consumption By avoiding wastage of energy on useless words. Improve performance By better utilizing the available cache space. Reduce memory traffic By not fetching useless words.

Our Goal Improve Utilization Predict the to-be-referenced words Avoid cache pollution by fetching only the predicted words

Our Contributions Illustrate high predictability of cache noise Propose efficient cache noise predictor Show potential benefits of cache noise prediction based fetching in terms of Cache utilization Cache power consumption Bandwidth requirement Illustrate benefits of cache noise prediction for prefetching Investigate cache noise prediction as an alternative to sub-blocking

Cache Noise Prediction Programs repeat the pattern of memory references. Predict cache noise based on the history of words accessed in the cache blocks.

Cache Noise Predictors 1) Phase Context Predictor (PCP) Records the words usage history of the most recently evicted cache block. 2) Memory Context Predictor (MCP) Assuming that data accessed from contiguous memory locations will be accessed in same fashion. 3) Code Context Predictor (CCP) Assuming that instructions in a particular portion of the code will access data in same fashion.

Cache Noise Predictors For code context predictors Use higher order bits of PC as context Store the context along with the cache block. Add 2 bit vectors for each cache block One for identifying the valid words present One for storing the access pattern

Code Context Predictor (CCP) Say PC of an instruction is

Code Context Predictor (CCP) Say PC of an instruction is X (100110) Code Context:

Code Context Predictor (CCP) Say PC of an instruction is X (100110) Code Context: Last Word Usage History Valid-Bit

Code Context Predictor (CCP) Say PC of an instruction is X (100110) Z (xxxxxx) Y (101001) x x 11 0 Code Context:

Code Context Predictor (CCP) Say PC of an instruction is X (100110) Z (xxxxxx) Y (101001) x x 11 0 Code Context: Miss due to PC Only 1st and 2nd words are brought Evicted cache block was brought by PC and used only 1st word

Code Context Predictor (CCP) Say PC of an instruction is X (100110) Z (xxxxxx) Y (101001) x x 11 0 Code Context: Miss due to PC Only 1st and 2nd words are brought Evicted cache block was brought by PC and used only 1st word

Code Context Predictor (CCP) Say PC of an instruction is X (100110) Z (101110) Y (101001) Code Context:

Code Context Predictor (CCP) Say PC of an instruction is X (100110) Z (101110) Y (101001) Code Context: Miss due to PC Only 1st word brought Evicted block was brought by PC and used 2nd and 4th word

Code Context Predictor (CCP) Say PC of an instruction is X (100110) Z (101110) Y (101001) Code Context: Miss due to PC Only 1st word brought Evicted block was brought by PC and used 2nd and 4th word

Code Context Predictor (CCP) Say PC of an instruction is X (100110) Z (101110) Y (101001) Code Context:

Predictability of CCP PCP - 56% MCP - 67% Predictability = Correct prediction/Total misses No prediction almost 0%

Improving the Predictability Miss Initiator Based History (MIBH) Words usage history based on the offset of the word that initiated the miss. ORing Previous Two Histories (OPTH) Bitwise ORing past two histories.

Predictability of CCP The predictability of PCP and MCP was about 68% and 75% respectively using both MIBH and OPTH.

CCP Implementation context valid-bit words history usage

CCP Implementation words history words history usage context valid-bit MIWO MIWO -- Miss Initiator Word Offset

CCP Implementation read/write port words history words history usage context valid-bit MIWO broadcast tag MIWO -- Miss Initiator Word Offset

CCP Implementation read/write port words history words history usage context valid-bit MIWO broadcast tag MIWO -- Miss Initiator Word Offset = ==

CCP Implementation read/write port words history words history usage context valid-bit MIWO broadcast tag MIWO -- Miss Initiator Word Offset = ==

CCP Implementation read/write port words history words history usage context valid-bit MIWO broadcast tag MIWO -- Miss Initiator Word Offset = ==

CCP Implementation read/write port words history words history usage context valid-bit MIWO broadcast tag MIWO -- Miss Initiator Word Offset = ==

CCP Implementation read/write port words history words history usage context valid-bit MIWO broadcast tag MIWO -- Miss Initiator Word Offset = ==

CCP Implementation read/write port words history words history usage context valid-bit MIWO broadcast tag MIWO -- Miss Initiator Word Offset = ==

CCP Implementation read/write port words history words history usage context valid-bit MIWO broadcast tag MIWO -- Miss Initiator Word Offset = ==

Experimental Setup Applied noise prediction to L1 data cache L1 Dcache of 16KB 4-way associative 32byte block size Unified L2 cache of 512KB 8-way associative 64 byte block size L1 Icache of 16KB direct mapped ROB instructions LSB - 64 entries Issue Queue - 96 Int/64 FP

Prediction Accuracies with 32/4, 16/8 & 16/4 CCP 32/416/816/4

RESULTS

Percentage Dynamic Energy Savings

Prefetching Processors employ prefetching to improve the cache miss rate. Fetch the next cache block on a miss to exploit spatial locality. The prefetched cache block is predicted to have the same pattern as that of the currently fetched block.

Prefetching Prefetched cache block updates the predictor table when evicted. Prefetched cache block is stored without any context information Whenever it is accessed for the first time, the context and the offset information is stored Prefetched block does not update the predictor table when evicted.

Prediction Accuracy with Prefetching Energy consumption reduced by about 22% Utilization increased by about 70% Miss Rate increased only by about 2% No Prefetching No Update Update

Sub-blocking Sub-blocking is used to Reduce cache noise Reduce bandwidth requirement Limitations of sub-blocking Increased miss rate Can we use cache noise prediction as an alternative to sub-blocking?

Cache Noise Prediction vs Sub-blocking

Conclusion Cache noise is highly predictable. Proposed cache noise predictors. CCP achieves 75% prediction rate with correct prediction of 97% using a small 16 entry table. Prediction without impact on IPC and minimal impact (0.1%) on miss rate. Very effective with prefetching. Compared to sub-blocking cache noise prediction based fetching improves Miss rate by 97% and Utilization by 10%

QUESTIONS ??? Prateek Pujara & Aneesh Aggarwal {prateek, State University of New York, Binghamton