5.2 Eleven Advanced Optimizations of Cache Performance
1. Small & simple caches - Reduce hit time Cost: indexing, Smaller is faster L2 small enough to fit on processor chip Direct mapping is simple Overlap tag check with data transmit CACTI - Simulate impact on hit time E.g., Fig 5.4 Access vs. size & associativity Suggest: Hit time Direct mapped is 1.2 - 1.5 x faster than 2-way set associative 2-way is 1.02 - 1.1 x 4-way 4-way is 1.0 - 1.08 x fully associative
2. Way prediction - Reduce hit time Extra bits kept in cache to predict way, block within set of next cache access Set multiplexor early to select desired block Single tag compare in that cycle in parallel with reading cache data Miss? Check other blocks for matches in next cycle Saves pipeline stages 85% of accesses for 2-way ==> Good match for speculative processors Pentium 4 uses
3. Trace caches - Reduce hit time ILP challenge: Enough instructions to execute every cycle without dependencies Trace cache - dynamic traces of executed instructions Not static sequences of instructions from memory Branch prediction folded into instruction cache More complicated address mapping Better use of long blocks Disadvantage: Conditional branches making different choices put same instructions in separate traces Pentium 4 uses trace cache of decoded micro-instructions
4. Pipelined cache access - Increase cache bandwidth Pipeline cache access Effective latency of L1 cache hit is multiple clock cycles Fast clock high bandwidth Slow hits Pentium 4 L1 cache hit takes 4 cycles Increased pipeline stages
5. Nonblocking cache (hit under miss) - Increase cache bandwidth With out-of-order completion, processor need not stall on cache miss Continue fetching instructions while waiting for cache data If cache does not block, allow cache to supply data to hits while processing a miss Reduces effective miss penalty Overlap multiple misses? Called hit under multiple misses or miss under miss Requires memory to service multiple misses simultaneously
6. Multi-banked caches - Increase cache bandwidth Independent banks supporting simultaneous access Originally used in main memory AMD Opteron has 2 banks of L2 Sun Niagara has 4 banks of L2 Best when accesses spread across banks Spread addresses sequentially across banks - Sequential interleaving
7. Critical word first, Early restart - Reduce miss penalty Processor needs 1 word of block Give it what it needs first How is block retrieved from memory? Critical word first - Get requested word first Return it Continue with memory transfer Early restart - Fetch in normal order When requested word comes, return it Benefits only with large blocks. Why? Disadvantage: Spatial locality. Why? Miss penalty is hard to estimate
8. Merge write buffers - Reduce miss penalty Write-through relies on write buffers All stores sent to lower level Write-back uses simple buffer when block is replaced Case: Write buffer is empty Data & addresses written from cache block to buffer Cache thinks write is done Case: Write buffer contained modified blocks Is this block already in write buffer? Write merging - Combine newly modified with buffer contents Case: Buffer full & no address match Must wait for empty buffer block Uses memory more efficiently - multi-word writes
9. Compiler optimizations - Reduce miss rate Compiler research improvements Instruction misses Data misses Optimizations include Code & data rearrangement Reorder procedures - might reduce conflict misses Align basic blocks to beginning of a cache block - decreases chance of cache miss Branch straightening - Change sense of branch test, swap basic blocks of branches Data - Arrange to improve spatial & temporal locality E.g., arrays by block
9. Compiler optimizations - Reduce miss rate Loop interchange - Make code access data in order it is stored, e.g., /* Before, stride 100 */ for (j = 0; j < 100; j++) for (i = 0; i < 500; i++) x[i][j] = 2 * x[i][j]; vs. /* After, stride 1 */ for (i = 0; i < 500; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j]; vs. blocking for Gaussian elimination?
10. Hardware prefetch instructions & data - Reduce miss penalty or miss rate Prefetch instructions and data before processor requests Fetch by block already tries On miss, fetch missed block and next one Block prediction? Data access, similarly Multiple streams? e.g., matrix * matrix Pentium 4 can prefetch data into L2 from 8 streams from 8 different 4 Kb pages
11. Compiler-controlled prefetch - Reduce miss penalty or miss rate Compiler inserts instructions to prefetch To register To cache Faulting or nonfaulting? Should prefetch cause page fault or memory protection fault? Assume nonfaulting cache prefetch Does not change contents of registers or memory Does not cause memory fault Goal: Overlap execution with prefetching