On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University
June 12, 2005MSP Introduction Memory wall –Increasing gap between processor and memory speeds –Concentration on bandwidth at the expense of latency Prefetch important data –Do not wait until the processor requests data –Pro-actively fetch the data that is likely to be consumed in the near future
June 12, 2005MSP Stream Prefetching Prefetching with outcome-based prediction –Use the history of previous misses to guess data addresses that are likely to miss soon Stream prefetching –A special case of outcome-based prediction –Proposed 15 years ago –The only hardware prefetching scheme used in modern microprocessors
June 12, 2005MSP Contributions Detailed sensitivity analysis of main prefetcher parameters on SPECcpu2000 programs –No such study in the literature –Many research papers fail to specify prefetcher parameters in comparative studies Case study –Evaluate performance of Runahead execution on a baseline with different stream prefetcher parameters
June 12, 2005MSP Outline Introduction Stream Prefetcher Operation Evaluation Methodology Experimental Results Conclusion
June 12, 2005MSP How Stream Prefetchers Work validstream addressstride validstream addressstride ……… validstream addressstride addr … AGU Global miss history addr + stride * lookahead miss addr = Stream exists? prefetch addr Stream table
June 12, 2005MSP Measured Parameters validstream addressstride validstream addressstride ……… validstream addressstride addr … miss addr prefetch distance Number of supported streams miss history length AGU addr + stride * lookahead = Stream exists? prefetch addr
June 12, 2005MSP Evaluation Methodology Benchmarks –22 SPECcpu2000 programs, highly optimized –All F77, C, and C++ programs –Multiple reference inputs per program –SimPoint interval of 500 million instructions Simulated architecture –SimpleScalar v4.0 cycle-accurate simulator –Aggressive superscalar Alpha like core
June 12, 2005MSP Simulated System Execution Core Fetch/issue/commit4/4/4 I-window/ROB/LSQ64/128/64 LdSt/Int/FP units2/4/2 Execution latenciesSimilar to Alpha Branch predictor16K-entry bimodal/gshare hybrid Memory Subsystem Cache sizes64KB IL1, 64KB DL1, 1MB L2 Cache associativity2-way L1, 4-way L2 Cache latencies2 cyc L1, 20 cyc L2 Main memory latency400 cycles
June 12, 2005MSP Outline Introduction Motivation Implementation Experimental Results Conclusion
June 12, 2005MSP Miss History Length 7 programs are very sensitive 16-entry history is enough
June 12, 2005MSP Number of Stream Table Entries only 3 programs are sensitive > 8 streams provides little benefit
June 12, 2005MSP L2 Cache Prefetch Distance 11 programs are very sensitive FP speedup varies by 80% - 140%
June 12, 2005MSP Case Study: Runahead Execution Performance of stream prefetching is highly dependent on parameter choice Another proposal: Runahead execution –Pseudo-retire long latency loads stalling the pipeline and continue executing –Roll back to checkpoint after load comes back from memory
June 12, 2005MSP Speedup over Stream Prefetching SPEC fp speedup drops by > 2x
June 12, 2005MSP Conclusion Key observations –The performance of the stream prefetcher is highly dependent on its configuration –Varying the prefetch distance alone almost doubles the average performance benefit –Choosing a non-optimal stream prefetcher as a baseline can distort results by a factor of two Conclusion –Parameter optimizations are imperative when comparing stream prefetchers to other prefetching techniques
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University