EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Today’s Menu: Announcements Cache Review Intro Cache organizations Mechanics (Index, Tag, etc.) Design Choices Examples HW Hints
Review: The Memory Problem We need: Big, fast, cheap memory But: Big memories are slow Even when built from fast components Fast memories are expensive and small
We Are Lucky: Programs Have Locality! Principle of Locality Programs access a relatively small portion of the address space at any given time Can tell what memory locations a program will reference in the future by looking at what it referenced recently in the past Two Types of Locality Temporal Locality - If an item has been referenced recently, it will tend to be referenced again soon Spatial Locality - If an item has been referenced recently, nearby items will tend to be referenced soon Nearby refers to memory addresses
The Solution Memory can be arranged as hierarchies The goal is to provide the illusion of lots of fast memory But how do you manage this, and make it work? Processor Control Memory Memory Memory Datapath Memory Memory Speed: Fastest Slowest Size: Smallest Biggest Cost: Highest Lowest
Designing Caches Organization: Direct Mapped Set Associative Fully Associative Design Choices: Block size Replacement Policy Write back/Write through Write Miss/Fetch Policy Others: consistency, etc?
Direct Mapped Cache Memory Address Cache Index 1 1 2 2 3 3 4 5 1 1 2 2 3 3 4 5 4 Word Direct Mapped Cache 6 7 8 9 A B C D E F 16 Word Memory
Set Associative Cache Memory Address Cache Index 1 1 1 2 3 4 5 1 1 1 2 3 4 5 4 Word 2 Way Set Associative Cache 6 7 8 9 A B C D E F 16 Word Memory
Fully Associative Cache No Cache Index Memory Address 1 2 3 4 5 6 7 8 9 A 4 Word Fully Associative Cache B C Complete Freedom More Complex Replacement Policy and HW No Memory partitioning D E F 16 Word Memory
Direct Mapped Cache (block size=2) Memory Address 1 Cache Index 2 3 1 4 5 6 7 8 9 A B C D E F 16 Word Memory
Quick Example Direct mapped cache with 16 KB of data and 4-word blocks. 32-bit addresses How big is the entire cache?
Cache Tag & Index : : : Assume a 32 bit memory address Assume we also have a 2n word direct mapped cache with 1 word blocks 31 n+2 2 1 Cache Tag : (=0x50) Cache Index (=3) Byte Offset Valid Bit Tag Data Word 0 Word 1 1 Word 2 2 0x50 Word 3 3 2 n Words : : : Word 2n -1 2 - 1 n
Cache Blocks Previous example was 4 word Direct Mapped Cache Each block was 1 word wide Strategy took advantage of temporal locality since if a word is referenced, it will tend to be referenced soon Did not take advantage of spatial locality To take advantage of spatial locality, increase block size Valid Cache Tag Cache Data word 0 word 1 word 2 word 3 word 4 word 5 word 6 word 7
Cache Block Example Assume a 2n byte direct mapped cache with 2m byte blocks (word size = 1 Byte) Byte select – The lower m bits Cache index - The lower (n-m) bits of the memory address Cache tag - The upper (32-n-m) bits of the memory address 31 9 4 1 KB Cache 32 B Blocks Cache Tag Cache Index Byte Select 0x50 0x01 0x1F 5 5 Valid Cache Tag Cache Data Byte0 : Byte 2 Byte 31 0x50 1 Byte32 : Byte62 Byte 63 2 3 25 = 32 cache lines : : : Byte 992 : 31 Byte 1023
Increased Miss Penalty Block Sizes Larger block sizes take advantage of spatial locality Also incurs larger miss penalty since it takes longer to transfer the block into the cache Large block can also increase the average time or the miss rate Tradeoff in selecting block size Average Access Time = Hit Time + Miss Penalty × MR Average Access Time Miss Penalty Miss Rate Exploits Spatial Locality Increased Miss Penalty & Miss Rate Fewer blocks: compromises temporal locality Block Size Block Size Block Size
Fully Associative Cache Opposite extreme in that it has no cache index Use any available entry to store memory elements No conflict misses, only capacity misses Must compare cache tags of all entries in parallel to find the desired one 31 4 Cache Tag (27 bits long) Byte Select 0x1E Cache Tag Valid Cache Data = Byte 01 : Byte 30 Byte 31 = Byte 32 : Byte 62 Byte 63 = = : : : =
Replacement Policies Least Recently Used (LRU) Often Not Recently Used works pretty well and is easier to implement Random Round Robin
Write Policy Write-through Write-back Misses are simpler and cheaper since block does not need to be written back Consistency is easy Easier to implement, though most systems need an additional buffer, called a write buffer, to be practical Uses a lot of bandwidth to the next level of memory Potentially horrible performance Write-back Words can be written at the cache rate Multiple writes within a block require only one “writeback” later 2 cycle writes
Write Miss/Fetch Policies On a write miss, do we load the cache line in the cache? Yes! – “Write Allocate” Fetch-on-write (write through caches, write back caches) Fetch the rest of the block No-fetch-on-write (write through caches) Mark the parts of the block that are not valid No! – “No Write Allocate” No-write-allocate (write through caches) Write data to directly memory, without keeping a copy to cache
Other Topics Split caches Multilevel caches - The goal is to provide the illusion of lots of fast memory Processor Control Memory Memory Memory Datapath Memory Memory Speed: Fastest Slowest Size: Smallest Biggest Cost: Highest Lowest
Example: What happens to L1 cache? Hit Time Miss Rate Miss Penalty Larger L1 cache Higher associativity Larger blocks Multilevel caches
Terminology Block – Minimum unit of information transfer between levels of the hierarchy Block addressing varies by technology at each level Blocks are moved one level at a time Hit – Data appears in a block in upper level Hit rate – Percent of accesses found Hit time – Time to access at upper level Hit time = Access time + Time to determine hit/miss Miss – Data was not in upper level and had to be fetched from a lower level Miss rate – Percent of misses (1 - Hit rate) Miss penalty – Overhead in getting data from a lower level Miss penalty = Lower level access time + Replacement time + Time to deliver to upper level Miss penalty is usually much larger than the hit time
AMAT = Hit Time + Miss Rate * Miss Penalty Need to define an average access time (some will be fast, some will be slow) This formula can be applied recursively: AMATL1 = HitTimeL1 + MissRateL1 * AMATL2 AMATL2 = HitTimeL2 + MissRateL2 * AMATMAIN MEM How do you compute CPIs given AMAT AMAT is in unit of time (i.e. usually ns) Converting AMAT to CPI AMAT x Clock Rate x Frequency of AMAT CPIoverall = CPIbase + Clock Rate x Frequency of L1 accesses x AMATL1