Copyright 1998 UC, Irvine1 Miss Stride Buffer Department of Information and Computer Science University of California, Irvine.

Slides:



Advertisements
Similar presentations
Reference stream A memory reference stream is an n-tuple of addresses which corresponds to n ordered memory accesses. – A program which accesses memory.
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Faculty of Computer Science © 2006 CMPUT 229 Cache Performance Analysis Hitting for performance.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
Systems I Locality and Caching
ECE Dept., University of Toronto
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Reuse Distance as a Metric for Cache Behavior Kristof Beyls and Erik D’Hollander Ghent University PDCS - August 2001.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Cache Memory Chapter 17 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.
11 Intro to cache memory Kosarev Nikolay MIPT Nov, 2009.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.
CMSC 611: Advanced Computer Architecture
Cache Memories.
CSE 351 Section 9 3/1/12.
Introduction To Computer Systems
Cache Memories CSE 238/2038/2138: Systems Programming
Improving Memory Access 1/3 The Cache and Virtual Memory
Multilevel Memories (Improving performance using alittle “cash”)
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
Morgan Kaufmann Publishers
5.2 Eleven Advanced Optimizations of Cache Performance
CS 105 Tour of the Black Holes of Computing
Lecture: Cache Hierarchies
Improving cache performance of MPEG video codec
Bojian Zheng CSCD70 Spring 2018
Presented by: Isaac Martin
Lecture: Cache Innovations, Virtual Memory
Siddhartha Chatterjee
15-740/ Computer Architecture Lecture 14: Prefetching
Lecture: Cache Hierarchies
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Caches: AAT, 3C’s model of misses Prof. Eric Rotenberg
Memory System Performance Chapter 3
Cache - Optimization.
Cache Performance Improvements
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Copyright 1998 UC, Irvine1 Miss Stride Buffer Department of Information and Computer Science University of California, Irvine

Copyright 1998 UC, Irvine2 Introduction In this presentation, we present a new technique to eliminate conflict miss in cache. We use a Miss History Buffer to record the miss address. From the buffer we can calculate the miss stride to predict what address will miss again. Then we prefetch this memory address into cache. Experiment shows this technique is very effective in eliminate conflict miss in some applications, and it incurs little increase in bandwidth.

Copyright 1998 UC, Irvine3 Overview Importance of cache performanceImportance of cache performance Techniques to reduce cache missesTechniques to reduce cache misses Our Approach--Miss Stride BufferOur Approach--Miss Stride Buffer ExperimentsExperiments DiscussionDiscussion

Copyright 1998 UC, Irvine4 Importance of Cache performance Disparity of processor and memory speedDisparity of processor and memory speed Cache missCache miss –Compulsory –Capacity –Conflict Increasing cache miss penalty for faster machineIncreasing cache miss penalty for faster machine

Copyright 1998 UC, Irvine5 Techniques to reduce Cache miss All use some kind of predictions on the pattern of missAll use some kind of predictions on the pattern of miss Victim CacheVictim Cache Stream BufferStream Buffer Stride prefetchStride prefetch

Copyright 1998 UC, Irvine6 Victim Cache Mainly used to eliminate conflict missMainly used to eliminate conflict miss Prediction: the memory address of a cache line that is replaced is likely to be accessed again in near futurePrediction: the memory address of a cache line that is replaced is likely to be accessed again in near future Scenario for prediction to be effective: false sharing, ugly address mappingScenario for prediction to be effective: false sharing, ugly address mapping Architecture implementation: use a on-chip buffer to store the contents of recently replaced cache lineArchitecture implementation: use a on-chip buffer to store the contents of recently replaced cache line

Copyright 1998 UC, Irvine7

8 Drawback of Victim Cache Ugly mapping can be rectified by cache aware compilerUgly mapping can be rectified by cache aware compiler Small size of victim cache, probability of memory address reuse within short period is very low.Small size of victim cache, probability of memory address reuse within short period is very low. Experiment shows victim cache is not effectiveExperiment shows victim cache is not effective

Copyright 1998 UC, Irvine9 Stream Buffer Mainly used to eliminate compulsory/capacity missesMainly used to eliminate compulsory/capacity misses Prediction: if a memory address is missed, the consecutive address is likely to be missed in near futurePrediction: if a memory address is missed, the consecutive address is likely to be missed in near future Scenario for prediction to be useful: stream accessScenario for prediction to be useful: stream access Architecture implementation: when an address miss, prefetch consecutive address into on-chip buffer. When there is a hit in stream buffer, prefetch the consecutive address of the hit address.Architecture implementation: when an address miss, prefetch consecutive address into on-chip buffer. When there is a hit in stream buffer, prefetch the consecutive address of the hit address.

Copyright 1998 UC, Irvine10

Copyright 1998 UC, Irvine11 Stream Cache Modification of stream bufferModification of stream buffer Use a separate cache to store stream data to prevent cache pollutionUse a separate cache to store stream data to prevent cache pollution When there is a hit in stream buffer, the hit address is sent to stream cache instead of L1 cacheWhen there is a hit in stream buffer, the hit address is sent to stream cache instead of L1 cache

Copyright 1998 UC, Irvine12

Copyright 1998 UC, Irvine13 Stride Prefetch Mainly used to eliminate compulsory/capacity missMainly used to eliminate compulsory/capacity miss Prediction: if a memory address is missed, an address that is offset by a distance from the missed address is likely to be missed in near futurePrediction: if a memory address is missed, an address that is offset by a distance from the missed address is likely to be missed in near future Scenario for prediction to be useful: stride accessScenario for prediction to be useful: stride access Architecture implementation: when an address miss, prefetch address that is offset by a distance from the missed address. When there is a hit in buffer, also prefetch the address that is offset by a distance from the hit address.Architecture implementation: when an address miss, prefetch address that is offset by a distance from the missed address. When there is a hit in buffer, also prefetch the address that is offset by a distance from the hit address.

Copyright 1998 UC, Irvine14 Miss Stride Buffer Mainly used to eliminate conflict missMainly used to eliminate conflict miss Prediction: if a memory address miss again after N other misses, the memory address is likely to miss again after N other missesPrediction: if a memory address miss again after N other misses, the memory address is likely to miss again after N other misses Scenario for the prediction to be usefulScenario for the prediction to be useful –multiple loop nests –some variables or array elements are reused across iterations

Copyright 1998 UC, Irvine15 Advantage over Victim Cache Eliminate conflict miss that even cache aware compiler can not eliminateEliminate conflict miss that even cache aware compiler can not eliminate –Ugly mappings are fewer and can be rectified –Much more conflicts are random. From probability perspective, a certain memory address will conflict with other addresses after some time, but we can not know at compile time which address it will conflict. There can be a much longer period before the conflict address is reusedThere can be a much longer period before the conflict address is reused –Victim cache’s small size

Copyright 1998 UC, Irvine16 Architecture Implementation Memory history bufferMemory history buffer –FIFO buffer to record recently missed memory address –Predict only when there is a hit in the buffer –Miss stride can be calculated by the relative position of consecutive miss for the same address –The size of the buffer determines the number of predictions Prefetch buffer (On-chip)Prefetch buffer (On-chip) –Store the contents of prefetched memory address –The size of the buffer determines how much we can tolerate the variation of miss stride

Copyright 1998 UC, Irvine17 Architecture Implementation Prefetch schedulerPrefetch scheduler –Select a right time to prefetch –Avoid collision PrefetcherPrefetcher –prefetch the contents of miss address into on-chip prefetch buffer

Copyright 1998 UC, Irvine18

Copyright 1998 UC, Irvine19

Copyright 1998 UC, Irvine21 Experiment Application: Matrix MultiplyApplication: Matrix Multiply #define N 257 main() { int i, j, k, sum, a[N][N], b[N][N], c[N][N]; for ( i=0; i<N; i++ ) for ( j=0; j<N; j++ ) { b[i][j] = 1; c[i][j] = 1; } for ( i=0; i<N; i++ ) for ( j=0; j<N; j++ ) { sum = 0; for ( k=0; k<N; k++ ) { sum += b[i][k]+c[k][j]; } a[i][j] = sum; }

Copyright 1998 UC, Irvine22

Copyright 1998 UC, Irvine23

Copyright 1998 UC, Irvine24

Copyright 1998 UC, Irvine25

Copyright 1998 UC, Irvine26

Copyright 1998 UC, Irvine27 Discussion The effectiveness depends on the hit ratio in MHBThe effectiveness depends on the hit ratio in MHB Combined with blocking to increase the hit ratio in MHBCombined with blocking to increase the hit ratio in MHB Used with victim cacheUsed with victim cache –long time vs.. short time memory address reuse Used with other miss elimination techniquesUsed with other miss elimination techniques –decrease the number of miss seen by MHB, equivalent to increase the size of MHB –More accurate prediction

Copyright 1998 UC, Irvine28 Discussion ReconfigurationReconfiguration –Miss stride prefetch buffer, victim cache, and stream buffer share the same big buffer, dynamically partition buffers –Use Conflict counter to recognize recent cache miss pattern--conflict dominant or not