Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
parity bit is 1: data should have an odd number of 1's
Lecture 12 Reduce Miss Penalty and Hit Time
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Fault-Tolerant Systems Design Part 1.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
UNIVERSITY OF MASSACHUSETTS Dept
Memory Management (II)
CS 104 Introduction to Computer Science and Graphics Problems
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
Memory Organization.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Virtual Memory BY JEMINI ISLAM. What is Virtual Memory Virtual memory is a memory management system that gives a computer the appearance of having more.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
1 CMPE 421 Advanced Computer Architecture Caching with Associativity PART2.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Computer Architecture Lecture 28 Fasih ur Rehman.
CMPE 421 Parallel Computer Architecture
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Fault-Tolerant Systems Design Part 1.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 2 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
Computer Architecture Lecture 26 Fasih ur Rehman.
Computer Studies (AL) Memory Management Virtual Memory I.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Accelerating Two-Dimensional Page Walks for Virtualized Systems Jun Ma.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
Virtual Memory 1 1.
Fault-Tolerant Systems Design Part 1.
Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
COSC 3330/6308 Second Review Session Fall Instruction Timings For each of the following MIPS instructions, check the cycles that each instruction.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
نظام المحاضرات الالكترونينظام المحاضرات الالكتروني Cache Memory.
Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.
COSC2410: LAB 19 INTRODUCTION TO MEMORY/CACHE DIRECT MAPPING 1.
Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.
Virtual Memory (Section 9.3). The Need For Virtual Memory Many computers don’t have enough memory in RAM to accommodate all the programs a user wants.
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
COSC3330 Computer Architecture
Memory COMPUTER ARCHITECTURE
Lecture 12 Virtual Memory.
Virtual Memory - Part II
CSC 4250 Computer Architectures
Cache Memory Presentation I
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Module IV Memory Organization.
Module IV Memory Organization.
Morgan Kaufmann Publishers
Miss Rate versus Block Size
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Virtual Memory 1 1.
Presentation transcript:

Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307 – 316 Feb On seminar book: 254

2/20 Abstract  As the performance gap between the processor and the memory subsystem increases, designers are forced to develop new latency techniques. Arguably, the most common technique is to utilize multi-level caches. Each new generation of processors is equipped with higher levels of memory hierarchy with increasing sizes at each level. In this paper, we propose 5 different techniques that will reduce the data access times and power consumption in processors with multi-level caches. Using the information about the blocks placed into and replaced from the caches, the techniques quickly determine whether an access at any cache level will be a miss. The accesses that are identified to miss are aborted. The structures used to recognize misses are much smaller than the cache structures. Consequently the data access times and power consumption is reduced. Using SimpleScalar simulator, we study the performance of these techniques for a processor with 5 cache levels. The best technique is able to abort 53.1% of the misses on average in SPEC2000 applications. Using these techniques, the execution time of the applications are reduced by up to 12.4% (5.4% on average), and the power consumption of the caches is reduced by as much as 11.6% (3.8% on average).

3/20 What’s the Problem  The fraction of data access time and cache power consumption caused by cache misses increases as the number of levels is increased in multi-level cache system  A great deal of the time and cache power are spent for accessing caches that miss On average, In processor with 5 levels of cache  The misses cause 25.5% of the data access time  The misses cause 18% of the cache power consumption  Motivating the exploration of technique to minimize the effects of cache misses

4/20 Introduction  Motivating example  If the data will be supplied by the nth level cache All the cache levels before n will be accesses causing unnecessary delay and power consumption  The proposed technique of this paper  Identify miss and bypass the access to the cache that will miss Store partial information about the blocks in a cache to identify whether the cache access may hit or definitely miss  If these misses are known in advance and not performed  The delay of data access will be reduced and the cache power consumption by the misses can be prevented

5/20 Mostly No Machine (MNM) Overview  When the address is given to the MNM  Miss signals for each cache level (except L1) is generated  The miss signals are propagated with the access through the cache levels  Two possible locations where the MNM can be realized The ith miss bit dictates whether the access at level i should be performed or bypassed  (a) Parallel MNM  L1 cache and MNM are accessed in parallel - Advantage : No MNM delay - Disadvantage: MNM consumes more power  (b) Serial MNM  The MNM is accessed only after the L1cache misses - Advantage : MNM consumes less power - Disadvantage: higher data access time (Increased by the delay of MNM)

6/20 Modification of Cache to Incorporate the MNM  Modification of cache structure  Extend each cache structure with logic To detect the miss signal and bypass the access if necessary  Each cache has to send the information to MNM about The blocks that are replaced from the cache  This is needed for the bookkeeping required at the MNM  In serial MNM, to synchronize the access and miss signal The request generated by the L1 is sent to MNM, which forwards the request to the L2

7/20 Benefits of the MNM Technique  Average data access time without MNM  Average data access time with MNM Cache_hit_time : time to access data at a cache Cache_miss_time : time to detect a miss in a cache 1  Abort the access to the cache when the MNM identifies a miss Prevent the time to access cache that will miss => improves data access time

8/20 Assumptions of the MNM Techniques  Portion of the address used by the MNM  Store block address to instead store exact bytes that are stored in a cache  MNM don’t assume the inclusion property of caches  EX: If cache level i contains a block b, block b is not necessarily contained in cache level i+1  MNM checks for the misses on cache level i+1, even if it can’t identify a miss in cache level i  EX: If MNM identifies the miss in L3, but couldn’t identify at L2, first the L2 cache will be accessed

9/20 1. Replacements MNM (RMNM)  Replacements MNM  Stores addresses that are replaced from the cache Therefore, access to the address will miss  Information about the replaced blocks is stored in an RMNM cache RMNM cache has a block size of (n-i) bits  n : # of separate caches  i : # of level 1 cache ● Each bit in the block corresponds to each level of cache, except the L1 cache When ith bit is set, that means the block is replaced from the Li cache

10/20 1. Replacements MNM (RMNM)  Scenario for a 2-level cache  pl. : place block into cache  repl. : replace block from cache  Since there are only two levels of caches Each RMNM block contains a single bit indicating Hit/Miss for L2 cache Block 0x2FC0 replace from L2 cache and place into RMNM cache Find block 0x2FC0 in RMNM cache, then identify L2 cache miss

11/20 2. Sum MNM (SMNM)  Sum MNM  Store hash values for block addresses in cache When a block is placed into cache, the block address is hashed and the resulting hash value is stored  The specific hash function  If the hash value of the access match any of the hash values of the existing cache blocks ● Else : Miss is captured, bypass cache access Gather information about the bit values on the address that are high ● Then : The access is performed

12/20 2. Sum MNM (SMNM)  The SMNM configuration is denoted by sum_width x replication Sum_width : sum_width at each checker Replication : # of parallel checkers implemented  SMNM example : SMNM_10x2  If there are multiple checkers The first one examines : the least significant bits The second one examines : the bits starting from the 7th rightmost bit The third one examines : the bits starting from the 13th rightmost bit the bits starting from the 7th rightmost bit 2 parallel checker, each check different portion of block address (the bit strings length = 10)

13/20 3. Table MNM (TMNM)  Table MNM  Store the least significant N bits for block address in cache  The values are stored in the TMNM table, an array of size 2 N Locations corresponding to the address stored in cache  Are set to ‘0’, the remaining locations are set to ‘1’ The least significant N bits of the access are used to address TMNM table The value stored at the corresponding location is used as miss signal 1  Example TMNM for N = 6  The cache in the example only has 2 block  When the request comes to MNM, the corresponding bit position is read The location is high, which means the access will be misses and can be bypassed

14/20 3. Table MNM (TMNM)  There can be several block addresses that map to the same bit position in the TMNM table  Therefore the values at TMNM table are counters instead of single bit When a block is placed into cache  The corresponding counter is incremented, unless it is saturated When a block is replaced from cache  The corresponding counter is decremented, unless it is saturated  The TMNM configuration is denoted by TMNM_n x replication N : # of bits checked by each table (Store least significant N bits for block address in cache) Replication : # of tables examining different positions of the address

15/20 4. Common Address MNM (CMNM)  Common address MNM  Capture the common value at the block address by examining the most significant bits of the address  Virtual tag finder has K registers  Store the most significant portion of the cache block  During an access  The most significant (32-m) bits of the address are compared to the values in virtual tag finder If it matches any of the existing values The index of the matching register is attached to the remaining m bits of the examined address And used to address CMNM table M bits index

16/20 4. Common Address MNM (CMNM)  When an address is checked, there are two ways to identify a miss  First, the (32-m) most significant bits of the address are entered to the virtual tag finder If it doesn’t match any of the register values in the virtual tag finder  The access is marked as a miss  Second, if a register matches the address, used the index attaches with the remaining m bits of the address to access CMNM table If the corresponding position has value ‘1’  Again a miss is indicated  The CMNM configuration is denoted by CMNM_k x m  k : # of registers in the virtual tag finder  m : least significant m bits of the examined address

17/20 Discussion of the MNM Techniques  The MNM techniques  Never incorrectly indicate that bypassing should be used  But don’t detect all opportunity for bypassing  The miss signal should be reliable  Because the cost of indicating an access will miss when the data is actually in the cache is high Must perform redundant access to higher level of memory hierarchy  The cost of a hit misindication is relatively less A redundant tag comparison at the cache ● If the MNM indicates a miss Then, the block certainly doesn’t exist in the cache ● If the MNM output maybe hit Then, the access might still miss in the cache

18/20 Improvement in Execution Time  To eliminate the delay of MNM, we perform simulations with the parallel MNM  The HMNM4 technique reduces the execution time by as much as 12.4% and by 5.4% on average HMNM means hybrid MNM which combines all the techniques to increase the misses identified by the technique  The perfect MNM reduces the execution time by as much as 25.0% and 10.0% on average The perfect MNM identifies all the misses, and hence bypasses all the cache miss

19/20 Reduction in Cache Power Consumption  To achieve the maximum power reduction, we perform simulations with the serial MNM  The HMNM4 reduces the cache power consumption by as much as 11.6% and by 3.8% on average  The perfect MNM reduces the cache power consumption by as much as 37.6% and 10.2% on average

20/20 Conclusions  Proposed techniques to identify the misses in different cache levels  When an access is identified to miss, the access is directly bypassed to the next cache level Thereby, reduce the delay and power consumption associated with the misses  Totally presented 5 different techniques to recognize some of the cache misses  For Hybrid MNM technique The execution time is reduced by 5.4% on average (ranging from 0.6% to 12.4%) The cache power consumption is reduced by 3.8% on average (ranging from 0.4% to 11.6%)