An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Caches J. Nelson Amaral University of Alberta. Processor-Memory Performance Gap Bauer p. 47.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
CMPE 421 Parallel Computer Architecture
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Lecture 20 Last lecture: Today’s lecture: Types of memory
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
CSE 351 Section 9 3/1/12.
Lecture 12 Virtual Memory.
Associativity in Caches Lecture 25
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
Lecture: Cache Hierarchies
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
Lecture: Cache Hierarchies
Module IV Memory Organization.
Performance metrics for caches
Adapted from slides by Sally McKee Cornell University
Performance metrics for caches
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CS 704 Advanced Computer Architecture
Lecture 22: Cache Hierarchies, Memory
Performance metrics for caches
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Update : about 8~16% are writes
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Virtual Memory 1 1.
Presentation transcript:

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions on Computers Volume: 52 Issue: 5 Pages: May 2003

An Intelligent Cache System with Hardware Prefetching for High Performance 2/ /7/12 Abstract  In this paper, we present a high performance cache structure with a hardware prefetching mechanism that enhances exploitation of spatial and temporal locality. The proposed cache, which we call a Selective- Mode Intelligent (SMI) cache, consists of three parts: a direct-mapped cache with a small block size, a fully associative spatial buffer with a large block size, and a hardware prefetching unit. Temporal locality is exploited by selective moving small blocks into the direct-mapped cache after monitoring their activity in the spatial buffer for a time period. Spatial locality is enhanced by intelligently prefetching a neighboring block when a spatial buffer hit occurs.  The overhead of this prefetching operation is shown to be negligible. We also show that the prefetch operation is highly accurate: over 90 percent of all prefetches generated are for blocks that are subsequently accessed. Our results show that the system enables the cache size to be reduced by a factor of four to eight relative to a conventional direct- mapped cache while maintaining similar performance. Also, the SMI cache can reduce the miss ratio by around 20 percent and the average memory access time by 10 percent, compared with a victim-buffer cache configuration.

An Intelligent Cache System with Hardware Prefetching for High Performance 3/ /7/12 What’s the Problem  Most cache system haven shown a tendency to emphasize only one of the spatial locality or temporal locality  Contradictory requirements on the structure of hardware  Existing hardware prefetching mechanisms often with high overhead  Prefetch generation rate is frequently Higher power consumption Increase the Memory Cycles Per Instruction

An Intelligent Cache System with Hardware Prefetching for High Performance 4/ /7/12 Related Work  The stream cache (direct-mapped cache + additional small buffer)  Besides miss data, several consecutive words are prefetched to the stream buffer  The victim cache (direct-mapped cache + additional small buffer)  The victim buffer holds the blocks that are discarded from the main cache  reduce conflict miss  Selective victim cache (direct-mapped cache + additional small buffer)  Places incoming blocks in main cache or victim buffer base its history of use  The assist cache (direct-mapped cache + additional small buffer)  Blocks are first loaded into the assist buffer and only promoted into the main cache if they exhibit temporal locality Temporal locality detection is provided statically by the compiler Use different associativities with the same block size

An Intelligent Cache System with Hardware Prefetching for High Performance 5/ /7/12 Related Work (cont.)  The selective cache (temporal cache + spatial cache)  Uses locality prediction table to decide whether the requested data has either temporal locality or spatial locality Data may be in just one of the two subcaches depends on the predicted type of locality for a given memory access  The split temporal / spatial cache (STS)  At compile time, data access are classified as either temporal locality or spatial locality, and tagged for one of these caches  Focus on how to detect the type of locality and how to deal with the reference based on the predicted locality Use the same associativity with different block sizes

An Intelligent Cache System with Hardware Prefetching for High Performance 6/ /7/12 The Proposed SMI Cache System  The SMI cache is constructed in three parts:  Direct-mapped cache with small block size Exploit temporal locality (increase number of blocks in the cache)  Fully associative spatial buffer with large block size Exploit spatial locality Large block size is a multiple of the small block size  Hardware prefetching unit  The SMI cache exploit both type of locality by  Determine whether a block in the spatial buffer also has temporal locality 1) Load a large block including the missed block into spatial buffer 2) Monitor and determine whether the missed block shows strong temporal locality while it was resident in the spatial buffer 3) Move the block which shows temporal locality into the direct-mapped cache while the large block is replaced from the spatial buffer Extend lifetime of blocks by storing them in direct-mapped cache Temporal locality is enhanced

An Intelligent Cache System with Hardware Prefetching for High Performance 7/ /7/12 SMI Cache System Structure Distinguish referenced block from unreferenced Determine whether generate prefetch operation Generate the tag of (i+1)th large block (i+1)th large block is not in spatial buffer Activate just one of the banks

An Intelligent Cache System with Hardware Prefetching for High Performance 8/ /7/12 Basic Operation of the SMI Cache  Main cache and spatial buffer are searched in parallel  Hit in main cache (direct-mapped cache) As a hit in a conventional L1 cache  Miss in main cache, but hit in spatial buffer Fetch corresponding small block from spatial buffer and set hit bit  Miss in both main cache and spatial buffer Large block is fetched into the spatial buffer  Temporal locality is enhanced by  Increasing the number of blocks in the cache  Extending lifetime of small blocks that exhibit temporal locality By storing those blocks in the direct-mapped cache  Spatial locality is enhanced by  Fetching a large block  Intelligently prefetching of data that exhibit spatial locality

An Intelligent Cache System with Hardware Prefetching for High Performance 9/ /7/12 Operational Model in the Case of Cache hits  Hit in main cache  Read hit -> transmit requested data to the CPU without delay  Write hit -> perform write operation and set the dirty bit for the block  Hit in spatial buffer  Corresponding small block is sent to the CPU and set the hit bit (H)  Prefetch controller generates a prefetch signal when  A large block is accessed  Prefetch bit (P) is not set  Multiple hit bits (H) in a large block is set Two operations are performed by prefetch controller 1) Search the tags of the spatial buffer (one cycle penalty)  (i+1)th large block is already in spatial buffer  stop prefetch and set P bit  (i+1)th large block is not in spatial buffer  second operation is performed 2) prefetch into prefetch buffer and P bit of the ith large block is set If the P bit is set, the consecutive large block must be present in either spatial buffer or prefetch buffer

An Intelligent Cache System with Hardware Prefetching for High Performance 10/ /7/12 Operational Model in the Case of Cache Misses  Miss in both main cache and spatial buffer  Bring a large block including the missed small block into spatial buffer, two possible cases: Spatial buffer is not full Spatial buffer is full  The oldest entry is replaced according to a FIFO policy  Move the blocks in that entry whose hit bits are set into main cache  Example of the move operation between two caches - 8KB main cache with 8 bytes block, 1KB spatial buffer with 32 bytes block  In main cache, tag is 19bits, index is 10bits, offset is 3 bits  In spatial buffer, tag is 27 bits and offset is 5 bits If only the hit bit of the first small block is set  The two bit offset ’00’ is combined with the spatial buffer’s tag value  Two bit offset for the four small blocks are ’00’,’01’,’10’,’11’  Therefore, a new memory address without an offset is formed 27 bits tag2 bits offset 19 bits tag10 bits index tag and index value for main cache

An Intelligent Cache System with Hardware Prefetching for High Performance 11/ /7/12 Avoid Data Incoherence in Different Subcache  Mechanism that avoids incoherence  When a global miss occurs, search the tags of main cache Detect whether being fetched small blocks are in main cache If a match is detected 1) Invalidate corresponding small blocks in main cache 2) Use these dirty small blocks to update its entry in spatial buffer Modified OriginalMissed Small block in main cache Large block being fetched which include the missed small block This small block is brought due to the miss of other blocks Now, valid copy is referenced from the spatial buffer The small block is copied into the main cache while its corresponding large block is replaced

An Intelligent Cache System with Hardware Prefetching for High Performance 12/ /7/12 Transfer From Prefetch Buffer to Spatial Buffer  Large block in prefetch buffer is transferred to spatial buffer  When a global miss occurs, cache controller handles a miss Therefore, the transfer time can be hidden  But, the missed block may already be in the prefetch buffer Compare tag of prefetch block with generated address while transfer  If a match in comparison 1) The requested data is transferred to the CPU and spatial buffer 2) Cache controller cancels the ongoing miss signal If a global miss occurs while a prefetch operation is being performed, miss handling is deferred until the ongoing prefetch operation completes

An Intelligent Cache System with Hardware Prefetching for High Performance 13/ /7/12 Observation From the SMI Cache  Cache write-back doesn’t occur from spatial buffer  Because any referenced small block is moved to main cache  Write-back occurs only from the main cache  Therefore, reduce the write traffic into memory effectively Write back operation only for the 8 bytes small blocks  The following three operations don’t incur additional delay  Move small blocks that exhibit temporal locality into main cache  Search the tags of main cache (for cache coherence)  Transfer a large block from prefetch buffer to spatial buffer Because these operations are accomplished while the cache controller is handling a miss

An Intelligent Cache System with Hardware Prefetching for High Performance 14/ /7/12 Threshold for Prefetch Generation  # of hit bits that should be set before prefetch generation  Prefetch-2 achieves more performance gain but greater overhead Due to increased prefetch frequency  Prefetch-4 achieves the most accuracy of the prefetch Prefetch-2 Prefetch-4 Prefetch generation != Prefetch actually occurred

An Intelligent Cache System with Hardware Prefetching for High Performance 15/ /7/12 Prefetch Overhead  When a prefetch operation is performed  Search the tags of spatial buffer -> additional one cycle Total two cycles : normal access cycle + search cycle Prefetch-2 tend to have greater Search overhead Overhead increase : Due to Increase the rate that a block to prefetch already in spatial buffer Case A: P not set, not in spatial buffer Case B: P not set, in spatial buffer Prefetch actually occur Case C: P set

An Intelligent Cache System with Hardware Prefetching for High Performance 16/ /7/12 Rate of Prefetch Operation Actually Occurred Prefetch-2 have higher actually occurred rate Decrease in actually occurred Due to decrease the rate that a block to prefetch doesn’t exist in spatial buffer Due to higher rate of prefetch generation Prefetch-4 have higher actually referenced rate Increase in actually referenced Due to decrease the rate that prefetch actually occurred Due to lower rate of prefetch generation The prefetching accuracy of Prefetch-4 is over 90%

An Intelligent Cache System with Hardware Prefetching for High Performance 17/ /7/12 Comparison of a Conventional Cache With the SMI Cache  Assume SMI cache operating in “prefetch-4” configuration  The average miss ratio of 8KB SMI cache in nonprefetching mode is equal to 32KB direct-mapped cache Cache size is reduced by a factor of four  The average miss ratio of 8KB SMI cache in prefetching mode is equal to 64KB direct-mapped cache Cache size is reduced by a factor of eight

An Intelligent Cache System with Hardware Prefetching for High Performance 18/ /7/12 Comparison of a Victim Cache With the SMI Cache  When a victim buffer hit occurs  Content swap between main cache and victim buffer (one cycle penalty)  Miss in both main cache and victim buffer  Content swap penalty can be hidden  SMI cache (8KB main cache with 8 bytes block, 1KB spatial buffer with 32 bytes block) has better performance than 8KB victim cache with 1KB victim buffer SMI cache can further reduce write traffic into memory, due to the use of smaller block size Victim cache with 32 bytes block SMI cache

An Intelligent Cache System with Hardware Prefetching for High Performance 19/ /7/12 Relation Between Cost and Performance  SMI cache shows about 60% and 80% area reduction compared with 32KB DM and 64KB DM respectively  SMI cache size can be reduced by a factor of 4~8 relative to DM cache while maintaining similar performance  SMI cache can reduced around 20% miss ratio, and 10% average memory access time V.S. the victim cache Best configuration in performance Triumph

An Intelligent Cache System with Hardware Prefetching for High Performance 20/ /7/12 Conclusions  Proposed a simple, high performance and low cost cache system  Exploiting two type of locality effectively Direct-mapped cache with small block size  For exploiting temporal locality Fully associative spatial buffer with large block size  For exploiting spatial locality An intelligent hardware prefetching mechanism  Enhancing spatial locality  The SMI cache overcomes the structural drawbacks of direct-mapped caches (e.g. conflict missed and thrashing)