Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith
February 20042/19 Outline Motivation Related Work Global History Buffer Prefetching Results Conclusion
February 20043/19 Motivation D-Cache misses to main memory are of increasing importance –Main memory is getting farther away (in clock cycles) –Many demanding, memory intensive workloads Computation is inexpensive compared to data accesses –Good opportunity to reevaluate prefetching data structures –Simple computation can supplement table information We consider prefetches from main memory to lowest level cache (L2 cache in this study)
February 20044/19 Markov Prefetching Markov prefetching forms address correlations –Joseph and Grunwald (ISCA ‘97) Uses global memory addresses as states in the Markov graph Correlation Table approximates Markov graph B C B A B C Correlation Table 1st predict.2nd predict. miss address A B C A B C B C... AB C 1.5 Miss Address Stream 1.5 Markov Graph A
February 20045/19 Correlation Prefetching Distance Prefetching forms delta correlations –Kandiraju and Sivasubramaniam (ISCA ‘02) Delta-based prefetching leads to much smaller table than “classical” Markov Prefetching Delta-based prefetching can remove compulsory misses Markov Prefetching Global Delta Stream Distance Prefetching Miss Address Stream global delta st predict.2nd predict. miss address 1st predict.2nd predict.
February 20046/19 Global History Buffer (GHB) Holds miss address history in FIFO order Linked lists within GHB connect related addresses –Same static load –Same global miss address –Same global delta Global History Buffer miss addresses Index Table FI Load PC Linked list walk is short compared with L2 miss latency FO
February 20047/19 Miss Address Stream Global History Buffer miss addresspointer Index Table head pointer GHB - Example => Current => Prefetches Key Global Miss Address
February 20048/19 GHB – Deltas Global Delta Stream Miss Address Stream => Current => Prefetches Key 844 WidthDepthHybrid Markov Graph => => 87 Prefetches => => 79 Prefetches => => 75 Prefetches
February 20049/19 GHB – Hybrid Delta Width prefetching suffers from poor accuracy and short look- ahead Depth prefetching has good look-ahead, but may miss prefetch opportunities when a number of “next” addresses have similar probability The hybrid method combines depth and width
February / => => 75 Global History Buffer miss addresspointer Index Table head pointer GHB - Hybrid Example 1 => Current => Prefetches Key Global Delta Global Delta Stream Miss Address Stream => => 87 Prefetches
February /19 Simulation Methodology Simulated SPEC CPU2000 benchmarks Fast forwarded 1 billion instructions and simulated 1 billion instructions Used peak binaries compiled -O4 optimization Results include all benchmarks that have at least a 5% IPC improvement with an ideal L2 cache Issue Width4 Instructions Load Store Queue 64 Entries RUU Size128 Entries Level 1 D-Cache16 KB, 2-way Level 1 I-Cache16 KB, 2-way Level 2 Cache512 KB, 4-way Memory Latency140 Cycles
February /19 Simulation Methodology Table walk - one cycle per access IT size reduces table conflicts GHB size reflects prefetch history working set In general, the GHB prefetching requires less history Prefetching MethodTable ConfigurationSize Conventional Distance Prefetching512 Table Entries18 KB GHB Distance Prefetching512 IT Entries & 512 GHB Entries 8 KB
February /19 Results Our results compare: –IPC Improvement (harmonic mean) vs. Prefetch Degree –Increase in Memory Traffic per instruction (arithmetic mean) vs. Prefetch Degree –Prefetch Accuracy – The percent of prefetches that are used by the program
February /19 Distance Prefetching (Performance) 5% 15% 25% 35% Prefetch Degree Table (width) GHB (width) GHB (depth) GHB (hybrid) IPC Improvement
February /19 Distance Prefetching (Performance) -10% 10% 30% 50% 70% 90% 110% ammp art wupwise swim lucas mgrid applu galgel apsi mcf twolf vpr parser gap bzip2 hmean Table (width) GHB (width) GHB (depth) GHB (hybrid) IPC Improvement (~300%)
February /19 Distance Prefetching (Memory Traffic) 0% 30% 60% 90% 120% 150% 180% Prefetch Degree Table (width) GHB (width) GHB (depth) GHB (hybrid) Increase in Memory Traffic
February /19
February /19 Distance Prefetching (Memory Traffic) 0% 30% 60% 90% 120% 150% 180% Prefetch Degree Table (width) GHB (width) GHB (depth) GHB (hybrid) Increase in Memory Traffic
February /19 Conclusions More complete picture of history –Allows width, depth, and hybrid –Also can improve other prefetching methods (covered in depth in the paper) Eliminates stale history in a natural way –FIFO discards old history to make room for new history –In a conventional table, old history can remain for a very long time and trigger inaccurate prefetches
February /19 Acknowledgements This research was funded by: –An Intel Undergraduate Research scholarship. –A University of Wisconsin Hilldale Undergraduate Research fellowship. –The National Science Foundation under grants CCR and EIA
February /19 Backup Slides
February /19 Prefetching Metrics Accuracy is the percent of prefetches that are actually used. Coverage is the percent of memory references prefetched rather than demand fetched. Timeliness indicates if prefetched data arrives early enough to prevent the processor from stalling.
February /19 GHB – Deltas Global Delta Stream Miss Address Stream => Current => Prefetches Key 844 Markov Graph
February /19 Prefetch Taxonomy To simplify the discussion and illustrate the relation between prefetching methods we introduce a consistent naming convention. Each name is a X/Y pair. –X is the key used for localizing the address stream. –Y is the method for detecting address patterns.
February /19 Prefetch Taxonomy We study two localizing methods –No localization or global (G) –Program Counter (PC) And three pattern detection methods –Address Correlation –Delta Correlation –Constant Stride
February /19 Prefetch Taxonomy Markov Prefetching - G/AC Distance Prefetching - G/DC Stride Prefetching - PC/CS
February /19 Stride Prefetching Table tracks the local history of loads. If a constant stride is detected in a load’s local history, then n + s, n + 2s, …, n + ds are prefetched. – n is the current target address – s is the detected stride – d is the prefetch degree or aggressiveness of the prefetching.
February /19 Stride Prefetching TagLast AddressStrideState Reference Prediction Table PC of Load Target Address sub add Prefetch Address
February /19 GHB – Stride Prefetching GHB-Stride uses the PC to access the index table. The linked lists contain the local history of each load. Compare the last two local strides. If the same then prefetch n + s, n + 2s, …, n + ds. Global History Buffer miss addresspointer Index Table head pointer A B C A B C B 1 C 1 PC =?
February /19 GHB – Local Delta Correlation Form delta correlations within each load’s local history. For example, consider the local miss address stream: Addresses Deltas CorrelationPrefetch Predictions (1,1)6211 (1,62)1162 (62, 1)1621
February /19
February /19
February /19