Lecture 7 Memory Hierarchy and Cache Design

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
Performance of Cache Memory
The Memory Hierarchy & Cache
Memory Hierarchy: The motivation
Caches Vincent H. Berk October 21, 2005
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
Memory Hierarchy: Motivation
EECC551 - Shaaban #1 lec # 7 Winter Memory Hierarchy: The motivation The gap between CPU performance and main memory has been widening with.
Memory: PerformanceCSCE430/830 Memory Hierarchy: Performance CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine)
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
December 18, Digital System Architecture Memory Hierarchy Design Pradondet Nilagupta Spring 2005 (original notes from Prof. Shaaban)
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Lecture 15 Calculating and Improving Cache Perfomance
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:
CMSC 611: Advanced Computer Architecture
Soner Onder Michigan Technological University
CSE 351 Section 9 3/1/12.
Yu-Lun Kuo Computer Sciences and Information Engineering
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
Appendix B. Review of Memory Hierarchy
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Systems Architecture II
Lecture 08: Memory Hierarchy Cache Performance
CPE 631 Lecture 05: Cache Design
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Performance metrics for caches
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CS 704 Advanced Computer Architecture
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
/ Computer Architecture and Design
CSC3050 – Computer Architecture
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
CPE 631 Lecture 04: Review of the ABC of Caches
Memory & Cache.
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Lecture 7 Memory Hierarchy and Cache Design Pradondet Nilagupta ( Based on notes by Robert F. Hodson -- CNU) ( Based on notes by Prof. Mike Schulte)

Why is memory important? Processor performance has increased at a much faster rate than memory performance, making main memory the bottleneck. CPU-DRAM Gap 1980: no cache in mproc; 1995 2-level cache, 60% trans. on Alpha 21164 mproc

General Principles of Memory Locality Temporal Locality: referenced memory is likely to be referenced again soon (e.g. code within a loop) Spatial Locality: memory close to referenced memory is likely to be referenced soon (e.g., data in a sequentially access array) Locality + smaller HW is faster = memory hierarchy Levels: each smaller, faster, more expensive/byte than level below Inclusive: data found in top also found in the bottom Definitions Upper: closer to processor Block: minimum unit that present or not in upper level Block address: location of block in memory Hit time: time to access upper level, including hit determination

Memory Hierarchy Secondary Storage Disks DRAM Memory Hierarchy L2 Cache L1 Cache Processor Registers

Differences in Memory Levels

Cache: A definition Webster’s Dictionary : In CS (originally) cache (kash), a safe place for hiding or storing things In CS (originally) the first level of memory hierarchy encounter after the CPU In CS today (typically) a term applied to any level of buffering employed to reuse commonly accessed items

Cache Measures Hit rate: fraction found in that level So high that usually talk about Miss rate Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory Average memory-access time = Hit rate x Hit access time + Miss rate x Miss penalty Miss penalty: time to replace a block from lower level, including time to replace in CPU access time: time to lower level = ƒ(lower level latency) transfer time: time to transfer block = ƒ(BW upper & lower, block size)

Block Size vs. Cache Measures Increasing Block Size generally increases Miss Penalty and decreases Miss Rate Miss Penalty Miss Rate Avg. Memory Access Time X = Block Size Block Size Block Size

Implications For CPU Fast hit check since every memory access Hit is the common case Unpredictable memory access time 10s of clock cycles: wait 1000s of clock cycles: Interrupt & switch & do something else New style: multithreaded execution How to handle miss (10s of cycles=> HW, 1000s of cycles => SW)?

Four Questions for Memory Hierarchy Designers Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed in the upper level? Direct Mapped: Each block has only one place that it can appear in the cache. Fully associative: Each block can be placed anywhere in the cache. Set associative: Each block can be placed in a restricted set of places in the cache. If there are n blocks in a set, the cache placement is called n-way set associative What is the associativity of a direct mapped cache?

Associativity Examples (Figure 5.2, pg. 376) Fully associative: Block 12 can go anywhere Direct mapped: Block no. = (Block address) mod (No. of blocks in cache) Block 12 can go only into block 4 (12 mod 8) Set associative: Set no. = (Block address) mod (No. of sets in cache) Block 12 can go anywhere in set 0 (12 mod 4)

Q1: Where can a block be placed in the upper level? Direct Mapped Full Associative Set Associative 12 31 Block 12 placed in 8 block cache

Q2: How Is a Block Found If It Is in the Upper Level? The address can be divided into two main parts Block offset: selects the data from the block offset size = log2(block size) Block address: tag + index index: selects set in cache index size = log2(#blocks/associativity) tag: compared to tag in cache to determine hit tag size = addreess size - index size - offset size Tag Index

Q2: How Is a Block Found If It Is in the Upper Level? Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag Block Address Block Address Block offset Tag Index FA: No index DM: No tag

Address organization for Set Associative Cache n-bits of CPU address k-bits of Block Offset i-bits of Tag j-bits of Index n= i + j + k

Cache organization N-way set associative means there are N-blocks TAG BLOCK 1 N-way set associative means there are N-blocks per Set SET 0 TAG BLOCK N Index used to select Set j SET 2 -1

Cache organization There must also be a mechanism to in- TAG BLOCK 1 SET 0 There must also be a mechanism to in- dicate invalid block data. This is commonly done by attaching a valid bit to the Tag field (not shown). TAG BLOCK N TheTag is associatively compared to all tags in a set. If there is a macth we have a cache hit. j SET 2 -1

Cache organization Each block is broken into TAG BLOCK 1 SET 0 TAG BLOCK N Each block is broken into 2 elements. The Block Offset is used to select the which element. k j SET 2 -1 The Tag and Index identify the block & the Block Offset identifies the element.

Q3: Which Block Should be Replaced on a Miss? Easy for Direct Mapped Set Associative or Fully Associative: Random - easier to implement Least Recently used - harder to implement - may approximate Miss rates for caches with different size, associativity and replacemnt algorithm. Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% For caches with low miss rates, random is almost as good as LRU.

Q4: What Happens on a Write? Write through: The information is written to both the block in the cache and to the block in the lower-level memory. Write back: The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? (add a dirty bit to each block) Pros and Cons of each: Write through read misses cannot result in writes to memory, easier to implement Always combine with write buffers to avoid memory latency Write back Less memory traffic Perform writes at the speed of the cache

Q4: What Happens on a Write? Since data does not have to be brought into the cache on a write miss, there are two options: Write allocate The block is brought into the cache on a write miss Used with write-back caches Hope subsequent writes to the block hit in cache No-write allocate The block is modified in memory, but not brought into the cach Used with write-through caches Writes have to go to memory anyway, so why bring the block into the cache

Example: Alpha 21064 Data Cache A cache read has 4 steps (1) The address from the cache is divided into the tag, index, and block offset (2) The index selects block (3) The address tag is compared with the tag in the cache, the valid bit is checked, and data to be loaded is selected (4) If the valid bit is set, the data is loaded into the processor If there is a write, the data is also sent to the write buffer

Writes in Alpha 21064 No write merging vs. write merging in write buffer V V V V 100 1 Buffer fills up on 4 stores to seq addresses 104 1 108 1 112 1 100 1 1 1 1 Seq stores merged to a single block

Split vs. Unified Cache Unified cache (mixed cache): Data and instructions are stored together (von Neuman architecture) Split cache: Data and instructions are stored separately (Harvard architecture) Why do instructions caches have a lower miss ratio? Size Instruction Cache Data Cache Unified Cache 1 KB 3.06% 24.61% 13.34% 2 KB 2.26% 20.57% 9.78% 4 KB 1.78% 15.94% 7.24% 8 KB 1.10% 10.19% 4.57% 16 KB 0.64% 6.47% 2.87% 32 KB 0.39% 4.82% 1.99% 64 KB 0.15% 3.77% 1.35% 128 KB 0.02% 2.88% 0.95%

Example: Split vs. Unified Cache Which has the lower average memeory access time? Split cache : 16 KB instructions + 16 KB data Unified cache: 32 KB instructions + data Assumptions Use miss rates from previous chart Miss penalty is 50 cycles Hit time is 1 cycle Load or store hit takes an extra cycle, since there is only one port for instructions and data

Example: Split vs. Unified Cache Average memory-access time = Hit time + Miss rate x Miss penalty AMAT = %instr x (instr hit time + instr miss rate x instr miss penalty) + %data x (data hit time + data miss rate x data miss penalty) For the split cache: AMAT = 75% x (1 + 0.64%x 50) + 25% (1 + 6.74% x 50) = 2.05 For the unified cache AMAT = 75% x (1 + 1.99%x 50) + 25% x (1 + 1.99% x 50) = 2.24 The unified cache has a longer AMAT, even though its miss rate is lower, due to conflicts for instruction and data hazards. What are advantages of a split cache? Of unified cache?

CPU Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time CPU execution clock cycle = Number of cycles for instructions assuming all hits (ignoring memory stalls) Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty) Memory stall Combining Reads & Writes (approx. of the above) clock cycles = Memory accesses x Miss rate x Miss penalty

CPU Performance CPUtime = IC x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time (Cache) Misses per instruction = Memory accesses per instruction x Miss rate CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time

Performance Example Compare CPU time of a processor with and w/o a cache. Hit Rate 98% 50 cycle miss penalty 2 cycles/instruction 1.33 memory ref/instr With cache CPU Time = IC x [ 2 + 1.33 x (1-.98) x 50] x clock cycle time no cache CPU Time = IC x [ 2 + 1.33 x 50 ] x clock cycle time Speedup = 68.5/3.33 = 20.5

2-Way Set Associative Cache Features of an 8 KB 2-way set associative cache 5 bit block offset 7 bit index 22 bit tag The set associative cache has extra hardware for 2 tag comparisons mux to select data Compared to the direct mapped cache, the set associative cache will tend to have a smaller miss rate but a larger hit time.

Example: Two different cache organization See example pp. 387

Improving Cache Performance AMAT = Hit time + Miss rate x Miss penalty Improve performance by: 1. Reduce the miss rate (or increase hit rate), 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

Summary CPU-Memory gap is major performance obstacle for performance, HW and SW Take advantage of program behavior: locality Time of program still only reliable performance measure 4Qs of memory hierarchy Where can a block be placed? How is a block found? Which block should be replaces on a miss? What happens on a write?