Cache Performance CPI contributed by cache = CPIc

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 12 Reduce Miss Penalty and Hit Time
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Translation Buffers (TLB’s)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Mem. Hier. CSE 471 Aut 011 Evolution in Memory Management Techniques In early days, single program run on the whole machine –Used all the memory available.
Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
CMPE 421 Parallel Computer Architecture
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Caches CSE Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items will be accessed.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Cache Perf. CSE 471 Autumn 021 Tolerating/hiding Memory Latency One particular technique: prefetching Goal: bring data in cache just in time for its use.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
CMSC 611: Advanced Computer Architecture
Associativity in Caches Lecture 25
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers
Lecture: Cache Hierarchies
ECE 445 – Computer Organization
Evolution in Memory Management Techniques
CMSC 611: Advanced Computer Architecture
Lecture 23: Cache, Memory, Virtual Memory
Lecture 14: Reducing Cache Misses
Chapter 5 Memory CSE 820.
Lecture 08: Memory Hierarchy Cache Performance
Lecture: Cache Innovations, Virtual Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Performance metrics for caches
Adapted from slides by Sally McKee Cornell University
Performance metrics for caches
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Translation Buffers (TLB’s)
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Translation Buffers (TLB’s)
Lecture: Cache Hierarchies
CSC3050 – Computer Architecture
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
CSE 471 Autumn 1998 Virtual memory
CSE 471 Cache Enhancements
Cache - Optimization.
CSE 586 Computer Architecture Lecture 5
Cache Memory Rabi Mahapatra
Principle of Locality: Memory Hierarchies
Translation Buffers (TLBs)
Performance metrics for caches
Review What are the advantages/disadvantages of pages versus segments?
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Cache Performance CPI contributed by cache = CPIc = miss rate * number of cycles to handle the miss Another important metric Average memory access time = cache hit time * hit rate + Miss penalty * (1 - hit rate) Cache Perf. CSE 471 Autumn 01

Improving Cache Performance To improve cache performance: Decrease miss rate without increasing time to handle the miss (more precisely: without increasing average memory access time) Decrease time to handle the miss w/o increasing miss rate A slew of techniques: hardware and/or software Increase capacity, associativity etc. Hardware assists (victim caches, write buffers etc.) Tolerating memory latency: Prefetching (hardware and software), lock-up free caches O.S. interaction: mapping of virtual pages to decrease cache conflicts Compiler interactions: code and data placement; tiling Cache Perf. CSE 471 Autumn 01

Obvious Solutions to Decrease Miss Rate Increase cache capacity Yes, but the larger the cache, the slower the access time Limitations for first-level (L1) on-chip caches Solution: Cache hierarchies (even on-chip) Increasing L2 capacity can be detrimental on multiprocessor systems because of increase in coherence misses Increase cache associativity Yes, but “law of diminishing returns” (after 4-way for small caches; not sure of the limit for large caches) More comparisons needed, i.e., more logic and therefore longer time to check for hit/miss? Make cache look more associative than it really is (see later) Cache Perf. CSE 471 Autumn 01

What about Cache Block Size? For a given application, cache capacity and associativity, there is an optimal cache block size Long cache blocks Good for spatial locality (code, vectors) Reduce compulsory misses (implicit prefetching) But takes more time to bring from next level of memory hierarchy (can be compensated by “critical word first” and subblocks) Increase possibility of fragmentation (only fraction of the block is used – or reused) Increase possibility of false-sharing in multiprocessor systems Cache Perf. CSE 471 Autumn 01

What about Cache Block Size? (c’ed) In general, the larger the cache, the longer the best block size (e.g., 32 or 64 bytes for on-chip, 64, 128 or even 256 bytes for large off-chip caches) Longer block sizes in I-caches Sequentiality of code Matching with the IF unit Cache Perf. CSE 471 Autumn 01

Example of (naïve) Analysis Choice between 32B (miss rate m32) and 64B block sizes (m64) Time of a miss: send request + access time + transfer time send request independent of block size (e.g., 4 cycles) access time can be considered independent of block size (memory interleaving) (e.g., 28 cycles) transfer time depends on bus width. For a bus width of say 64 bits transfer time is twice as much for 64B (say 16 cycles) than for 32B (8 cycles). In this example, 32B is better if (4+28+8)m32 < (4+28+16) m64 Cache Perf. CSE 471 Autumn 01

Example of (naïve) Analysis (c’ed) Case 1. 16 KB cache: m32 = 2.87, m64 = 2.64 2.87 * 40 < 2.64 * 48 Case 2. 64KB cache: m32 = 1.35, m64 = 1.06 1.35 * 40 > 1.06 * 48 32B better for 16KB and 64B better for 64KB (Of course the example was designed to support the “in general the larger the cache, the longer the best block size ” statement of two slides ago). Cache Perf. CSE 471 Autumn 01

Impact of Associativity “Old” conventional wisdom Direct-mapped caches are faster; cache access is bottleneck for on-chip L1; make L1 caches direct mapped For on-board (L2) caches, direct-mapped are 10% faster. “New” conventional wisdom Can make 2-way set-associative caches fast enough for L1. Allows larger caches to be addressed only with page offset bits (see later) Looks like time-wise it does not make much difference for L2/L3 caches, hence provide more associativity (but if caches are extremely large there might not be much benefit) Cache Perf. CSE 471 Autumn 01

Reducing Cache Misses with more “Associativity” -- Victim caches First example (in this course) of an “hardware assist” Victim cache: Small fully-associative buffer “behind” the L1 cache and “before” the L2 cache Of course can also exist “behind” L2 and “before” main memory Main goal: remove some of the conflict misses in L1 direct-mapped caches (or any cache with low associativity) Cache Perf. CSE 471 Autumn 01

2.Miss in L1; Hit in VC; Send data to register and swap 3’. evicted Victim Cache 1. Hit Cache Index + Tag 3. From next level of memory hierarchy Cache Perf. CSE 471 Autumn 01

Operation of a Victim Cache 1. Hit in L1; Nothing else needed 2. Miss in L1 for block at location b, hit in victim cache at location v: swap contents of b and v (takes an extra cycle) 3. Miss in L1, miss in victim cache : load missing item from next level and put in L1; put entry replaced in L1 in victim cache; if victim cache is full, evict one of its entries. Victim buffer of 4 to 8 entries for a 32KB direct-mapped cache works well. Cache Perf. CSE 471 Autumn 01

Bringing more Associativity -- Column-associative Caches Split (conceptually) direct-mapped cache into two halves Probe first half according to index. On hit proceed normally On miss, probe 2nd half ; If hit, send to register and swap with entry in first half (takes an extra cycle) On miss (on both halves) go to next level, load in 2nd half and swap Cache Perf. CSE 471 Autumn 01

Skewed-associative Caches Have different mappings for the two (or more) banks of the set-associative cache First mapping conventional; second one “dispersing” the addresses (XOR a few bits) Hit ratio of 2-way skewed as good as 4-way conventional. Cache Perf. CSE 471 Autumn 01

Reducing Conflicts --Page Coloring Interaction of the O.S. with the hardware In caches where the cache size > page size * associativity, bits of the physical address (besides the page offset) are needed for the index. On a page fault, the O.S. selects a mapping such that it tries to minimize conflicts in the cache . Cache Perf. CSE 471 Autumn 01

Options for Page Coloring Option 1: It assumes that the process faulting is using the whole cache Attempts to map the page such that the cache will access data as if it were by virtual addresses Option 2: do the same thing but hash with bits of the PID (process identification number) Reduce inter-process conflicts (e.g., prevent pages corresponding to stacks of various processes to map to the same area in the cache) Implemented by keeping “bins” of free pages Cache Perf. CSE 471 Autumn 01

Tolerating/hiding Memory Latency One particular technique: prefetching Goal: bring data in cache just in time for its use Not too early otherwise cache pollution Not too late otherwise “hit-wait”cycles Under the constraints of (among others) Imprecise knowledge of instruction stream Imprecise knowledge of data stream Hardware/software prefetching Works well for regular stride data access Difficult when there are pointer-based accesses Cache Perf. CSE 471 Autumn 01

Why, What, When, Where Why? What When Where cf. goals: Hide memory latency and/or reduce cache misses What Ideally a semantic object Practically a cache block, or a sequence of cache blocks When Ideally, just in time. Practically, depends on the prefetching technique Where In the cache or in a prefetch buffer Cache Perf. CSE 471 Autumn 01

Hardware Prefetching Nextline prefetching for instructions Bring missing block and the next one if not already there) OBL “one block look-ahead” for data prefetching As Nextline but with more variations -- e.g. depends on whether prefetching was successful the previous time Use of special assists: Stream buffers, i.e., FIFO queues to fetch consecutive lines (good for instructions not that good for data); Stream buffers with hardware stride detection mechanisms; Use of a reference prediction table etc. Cache Perf. CSE 471 Autumn 01

Software Prefetching Use of special instructions (cache hints: touch in Power PC, load in register 31 for Alpha, prefetch in recent micros) Non-binding prefetch (in contrast with proposals to prefetch in registers). If an exception occurs, the prefetch is ignored. Must be inserted by software (compiler analysis) Advantage: no special hardware Drawback: more instructions executed. Cache Perf. CSE 471 Autumn 01

Techniques to Reduce Cache Miss Penalty Give priority to reads -> Write buffers Send the requested word first -> critical word or wrap around strategy Sectored (subblock) caches Lock-up free (non-blocking) caches Cache hierarchy Cache Perf. CSE 471 Autumn 01

Write Policies Loads (reads) occur twice as often as stores (writes) Miss ratio of reads and miss ratio of writes pretty much the same Although it is more important to optimize read performance, write performance should not be neglected Write misses can be delayed w/o impeding the progress of execution of subsequent instructions Cache Perf. CSE 471 Autumn 01

A Sample of Write Mechanisms Fetch-on-write and Write-allocate Proceed like on a read miss followed by a write hit. The preferred method for write-back caches. No-write-before-hit, no-fetch-on-write and no-write-allocate (called “write-around”) The cache is not modified on write misses No-fetch-on-write and write-allocate (“write-validate”) Write directly in cache and invalidate all other parts of the line being written. Requires a valid bit/writeable entity. Good for initializing a data structure. Assumedly the best policy for elimination of write misses in write-through caches but more expensive (dirty bits) Cache Perf. CSE 471 Autumn 01

Write Mechanisms (c’ed) write-before-hit, no-fetch-on-write and no-write-allocate (called “write invalidate”) Possible only for direct-mapped write-through caches The data is written in the cache totally in parallel with checking of the tag. On a miss, the rest of the line is invalidated as in the write-validate case A mix of these policies can be (and has been) implemented Dictate the policy on a page per page basis (bits set in the page table entry) Have the compiler generate instructions (hints) for dynamic changes in policies (e.g. write validate on initialization of a data structure) Cache Perf. CSE 471 Autumn 01

Write Buffers Reads are more important than Writes to memory if WT cache Replacement of dirty lines if WB Hence buffer the writes in write buffers Write buffers = FIFO queues to store data Since writes have a tendency to come in bunches (e.g., on procedure calls, context-switches etc), write buffers must be “deep” Cache Perf. CSE 471 Autumn 01

Write Buffers (c’ed) Writes from write buffer to next level of the memory hierarchy can proceed in parallel with computation Now loads must check the contents of the write buffer; also more complex for cache coherency in multiprocessors Allow read misses to bypass the writes in the write buffer Cache Perf. CSE 471 Autumn 01

Coalescing Write Buffers and Write Caches Writes to an address (block) already in the write buffer are combined Note the tension between writing the coalescing buffer to memory at high rate -- more writes -- vs. coalescing to the max -- but buffer might become full Extend write buffers to small fully associative write caches with WB strategy and dirty bit/byte. Not implemented in any machine I know of Cache Perf. CSE 471 Autumn 01

Critical Word First Send first, from next level in memory hierarchy, the word for which there was a miss Send that word directly to CPU register (or IF buffer if it’s an I-cache miss) as soon as it arrives Need a one block buffer to hold the incoming block (and shift it) before storing it in the cache Cache Perf. CSE 471 Autumn 01

Sectored (or subblock) Caches First cache ever (IBM 360/85 in late 60’s) was a sector cache On a cache miss, send only a subblock, change the tag and invalidate all other subblocks Saves on memory bandwidth Reduces number of tags but requires good spatial locality in application Requires status bits (valid, dirty) per subblock Might reduce false-sharing in multiprocessors But requires metadata status bits for each subblock Alpha 21164 L2 uses a dirty bit/16 B for a 64B block size Cache Perf. CSE 471 Autumn 01

Sector Cache Status bits data tag subblock1 subblockn Cache Perf. CSE 471 Autumn 01

Lock-up Free Caches Proposed in early 1980’s but implemented only recently because quite complex Allow cache to have several outstanding miss requests (hit under miss). Cache miss “happens” during EX stage, i.e., longer (unpredictable) latency Important not to slow down operations that don’t depend on results of the load Single hit under miss (HP PA 1700) relatively simple For several outstanding misses, require the use of MSHR’s (Miss Status Holding Register). Cache Perf. CSE 471 Autumn 01

MSHR’s The outstanding misses do not necessarily come back in the order they were detected For example, miss 1 can percolate from L1 to main memory while miss 2 can be resolved at the L2 level Each MSHR must hold information about the particular miss it will handle such as: Info. relative to its placement in the cache Info. relative to the “missing” item (word, byte) and where to forward it (CPU register) Cache Perf. CSE 471 Autumn 01

Implementation of MSHR’s Quite a variety of alternatives MIPS 10000, Alpha 21164, Pentium Pro One particular way of doing it: Valid (busy) bit (limited number of MSHR’s – structural hazard) Address of the requested cache block Index in the cache where the block will go Comparator (to prevent using the same MSHR for a miss to the same block) If data to be forwarded to CPU at the same time as in the cache, needs addresses of registers (one per possible word/byte) Valid bits (for writes) Cache Perf. CSE 471 Autumn 01

Cache Hierarchy Two, and even three, levels of caches quite common now L2 (or L3, i.e., board-level) very large but since L1 filters many references, “local” hit rate might appear low (maybe 50%) (compulsory misses still happen) In general L2 have longer cache blocks and larger associativity In general L2 caches are write-back, write allocate Cache Perf. CSE 471 Autumn 01

Characteristics of Cache Hierarchy Multi-Level inclusion (MLI) property between off-board cache (L2 or L3) and on-chip cache(s) (L1 and maybe L2) L2 contents must be a superset of L1 contents (or at least have room to store these contents if L1 is write-back) If L1 and L2 are on chip, they could be mutually exclusive (and inclusion will be with L3) MLI very important for cache coherence in multiprocessor systems (shields the on-chip caches from unnecessary interference) Prefetching at L2 level is an interesting challenge (made easier if L2 tags are kept on-chip) Cache Perf. CSE 471 Autumn 01

“Virtual” Address Caches Will get back to this after we study TLB’s Virtually addressed, virtually tagged caches Main problem to solve is the Synonym problem (2 virtual addresses corresponding to the same physical address). Virtually addressed, physically tagged Advantage: can allow cache and TLB accesses concurrently Easy and usually done for small L1, i.e., capacity < (page * ass.) Can be done for larger caches if O.S. does a form of page coloring such that “index” is the same for synonyms Cache Perf. CSE 471 Autumn 01

Impact of Branch Prediction on Caches If we are on predicted path and: An I-cache miss occurs, what should we do: stall or fetch? A D-cache miss occurs, what should we do: stall or fetch? If we fetch and we are on the right path, it’s a win If we fetch and are on the wrong path, it is not necessarily a loss Could be a form of prefetching (if branch was mispredicted, there is a good chance that that path will be taken later) However, the channel between the cache and higher-level of hierarchy is occupied while something more pressing could be waiting for it Cache Perf. CSE 471 Autumn 01