Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of Michigan
Motivation Increasing Memory – Processor frequency Gap Increasing Memory – Processor frequency Gap Large Data Caches to hide Long Latencies Large Data Caches to hide Long Latencies Larger caches – Longer Access Latencies [McFarland 98] Larger caches – Longer Access Latencies [McFarland 98] Processor Cycle determines Cache Size Intel Pentium III – 16K DL1 Cache, 3 cycle access Intel Pentium 4 – 8K DL1 Cache, 2 cycle access Need Large AND Fast Caches! Need Large AND Fast Caches!
Related Work Load Latency Tolerance [Srinivasan & Lebeck, MICRO 98] Load Latency Tolerance [Srinivasan & Lebeck, MICRO 98] All Loads are NOT equal Determining Criticality – Very Complex Sophisticated Simulator with Rollback Non-Critical Buffer [Fisk & Bahar, ICCD99] Non-Critical Buffer [Fisk & Bahar, ICCD99] Determining Criticality – Performance Degradation/Dependency Chains Non-Critical Buffer – Victim Cache for non- critical loads Small Performance Improvements (upto 4%)
Related Work(contd.) Locality vs. Criticality [Srinivasan et.al., ISCA 01] Locality vs. Criticality [Srinivasan et.al., ISCA 01] Determining Criticality – Practical Heuristics Potential for Improvement – 40% Locality is better than Criticality Non-Vital Loads [Rakvic et.al., HPCA 02] Non-Vital Loads [Rakvic et.al., HPCA 02] Determining Criticality – Run-time Heuristics Small and fast Vital cache for Vital Loads 17% Performance Improvement
Load Latency Tolerance
Criticality Criticality – Effect of Load Latency on Performance Criticality – Effect of Load Latency on Performance Two thresholds – Performance and Latency Two thresholds – Performance and Latency A Very Direct Estimation of Criticality A Very Direct Estimation of Criticality Computation Intensive! Computation Intensive! Static Static
Determining Criticality- A Closer Look IPC Threshold=99.6% Latency Threshold = 8cycles
Most Frequently Executed Loads
Criticality(contd..) Benchmark (SPECINT 2000) # of Load Insns accounting for 80% of Load references BZIP2130 CRAFTY905 EON550 GAP100 GCC4650! GZIP74 MCF115 PARSER305 TWOLF185
Critical Cache Configuration
Effectiveness? Load Reference Distribution Load Reference Distribution What %age of Loads Identified as Critical Miss Rate for Critical Load References Critical Cache Configuration compared with Critical Cache Configuration compared with Faster Conventional Cache Configuration DL1/DL2 Latencies – 3/10, 6/20, 9/30 cycles Critical Cache Configuration compared with Critical Cache Configuration compared with Larger Conventional Cache Configuration DL1 Sizes – 8KB, 16KB, 32KB, 64KB
Processor Configuration Similar to Alpha using SimpleScalar-3.0 [Austin, Burger 97] Fetch Width 8 instructions per cycle Fetch Queue Size 64 Branch Predictor 2 Level, 4K entry level 2 Branch Target Buffer 2K entries, 8 way associative Issue Width 4 instructions per cycle Decode Width 4 instructions per cycle RUU Size 128 Load/Store Queue Size 32 Instruction Cache 64KB, 2-way, 64 byte lines L2 Cache 1MB, 2-way, 128 byte lines Memory Latency 64 cycles
Results Benchmark # of Critical Load Insns. Critical Load Refs (% of total Load Refs) Miss rate of Critical Loads for 1K critical cache BZIP CRAFTY EON GAP GZIP MCF PARSER TWOLF
Results Comparison with a faster conventional Cache Configuration IPCs normalized to 16K-1cycle Configuration 25-66% of the Penalty due to a slower cache is eliminated
Results Comparison with a faster Conventional Cache Configuration IPCs normalized to 32K- 1cycle Configuration 25-70% of the Penalty due to a slower cache is eliminated
Results Comparison with a larger Conventional cache Configuration IPCs normalized to 16K-3cycle Configuration
Results Comparison with a larger Conventional cache Configuration IPCs normalized to 32k_6cycle Configuration Critical cache Configuration outperforms a larger conventional cache
Conclusions & Future Work Conclusions Conclusions Compares well with a faster conventional cache Outperforms a larger conventional cache in most cases Future Work Future Work More heuristics to refine “criticality” Why are “critical loads” critical? Criticality of a memory address vs. criticality of a load instruction Criticality for lowpower Caches