Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department of Electrical and Computer Engineering University of Wisconsin, Madison
1.Introduction Cache-hierarchy: CPU: registers, very small number, fastest L1 Cache: usually 8k, larger than CPU registers, slower than CPU L2 Cache: usually 256/512k, larger than L1, slower than L1 L3 Cache (optional): usually 1M/2M, larger than L2, slower than L2 Cache Main memory: –Usually 256M/512 M or more, –larger than L3, slowest CPU-Memory Cache Hierarchy
Each level on cache hierarchy: –latency is around 10 times Problem with the cache hierarchy architecture –limited capacity (size) –Limited associativity Solution for the problems: using effective prefetching 2. Pre-fetching technique 1.Sequential prefetching 1.What: access cache lines that immediately following the current cache line (for the cache miss) 2.Algorithm: –early: pre-fetch after each cache miss –mature: Issue prefetch after a sequential access pattern is built Degree of prefetching: –Maximum number of cache lines prefetched in response to a single prefetch request –in order to: completely hide the latency of a miss to main memory
2. Table based prefetching: What: –record history information related to data access Operate: –Table is accessed with a key (Program Counter of the load instruction, or the missed address) –Use history information to predict the prefetching behavior Evaluate: –Pro: simple –Con: inefficient »Fixed amount of history for each prefetching key »Stale happens: data in entry sit for a very long time. When using this information, the memory access behavior has changed 3. Global History Buffer (GHB) prefetching 1.Organized: Fig 1.b 2.Features: FIFO Table: cache misses: enter from bottom, goes up to top Separate IT and GHB: Fixed table size: Circular table: overwrite existing items, when overflow happens
–Benefit of GHB: reduce stale data more accurate construction of history access patterns more effective prefetching algorithm 4. Table-based prefetching techniques 1.Stride Prefetching: Fig. 2. –the following addresses are fetched: »a + s, where: a: target address »a + 2s, s: detected stride »… … d: degree of prefetching »a + d s,note: in this case, stride s is a const 2.Correlation Prefetching (Markov Prefetching): Fig explain 1.Use a history table to record cache-misses 2.missing address: index the correlation table 3.Each entry: 1.List of addresses that have immediately followed the current miss address 2.Most recent miss first 4.Markov graph: 1.each node: cache miss address 2.edge: probabilities that source will be immediately followed by target
3. Distance Prefetching: Fig 4. + explain –Generalized Correlation Prefetching –Use distance (between 2 global miss address) to index correlation table Problems with table-based prefetching: –Table data becomes stale: not used, not refreshed neither –Table entry conflicts: multiple access keys map to the same table entry –Fixed + small history data per entry: Fig 3. 2-piece of history per data item 5. Global History Buffer (GHB) base prefetching: 1.Table structure: Fig. 1 (b) –IT: Index Table –accessed by key as traditional table-based prefetching –Key: Program Counter, cache missing address or a combination of them –Have pointers to GHB
GHB (cont) GHB: n-entry FIFO circular table –holding: n most recent misses –each entry: »global miss address »Pointer: chain other GHB entries into address list (access info for the same address) Notions used later: Prefetching Method: X / Y –X: »PC: Program Counter based indexing »G: global address –Y: »CS: Const Striding »DC: Delta Correlation »AC: Address Correlation –Different combination of X and Y creates different prefetching methods
2. GHB for Correlation Prefetching -Fig. 5. -Explain: breadth first, shaded area 3. GHB for Stride Prefetching -PC / CS -Use again Fig. 5. to explain (depth 1st) 6. Global History Buffer (GHB) error handling: -error can occur: -how: -when GHB array is over-written -Pointers become obsolete, as of information re-written -Solution: -Use low-order extra bits of a pointer to reference entries -Compare: -(head pointer – ref pointer) > table size, then, it is an error
7. GHB evaluation 1.GHB benefits: –FIFO: –first in, first out buffer –naturally gives table space to the most recent history –Separation of IT + GHB buffer: –IT: Indexing Table –Hold working set of prefetching list –Relatively small –GHB: –Larger than IT –Sized to hold missed address stream –Benefit of this design: –Enable more sophisticated prefetching methods (show later) 2.GHB drawback: –Multiple access on collecting prefetching info (internal linked-list traversal)
7. GHB evaluation (cont) 3. Types of GHB prefetching: Width prefetching: –prefetch only the immediate adjacent nodes –E.g. in Fig. 5 Depth prefetching: –begin with current miss –Follow with a sequence of most likely node on its path –prefetch at each node –E.g. in Fig 5. Hybrid: –Mix of the width prefetch and depth prefetch 4. New prefetching technique: Global / Delta Correlation -what: non-const step prefetching -Example: Table 1 -Pattern: {0, 1, 1, 62, 1, 1, …}, access 1 st 3 elements of a 2-dimensional array -Const stride: prefetching down to incorrect addresses {1, 1, 1, 1, …}
Non-const address stream
4. New prefetching technique: (cont) Using GHB: –Sequence of the load’s missing addresses –Detecting variable stride steps –Use delta pairs (Table-1) to predict 8. Simulation and testing 1.Simulator + its configuration: 1.Config: table 4 2.Simple Scalar: Other details: –Each access to IT: 1 cycle –Each access to GHB: 1 cycle –Degree of prefetching: 4 2.Benchmark under ideal L2 cache: table 2 + table 3 3.GHB’s train set –use some benchmarks to decide the optimal table size for –IT –GHB –Table size result: Table 6
4. GHB Testing: Global / Delta Correlation
5. GHB Testing: PC / Local Prefetching GHB PC / CS, GHB PC / DC with table-based PC /CS
Conclusion Global History Buffer based prefetching: -2-level table hierarchy: -IT: Index table -GHB: Global History Buffer -Performance improvements: -Generally: as well as or better than on 14 out of 15 tested benchmarks -Increase IPC -Reduce memory traffic -Advantage: -Reduce stale data -Increase prediction accuracy -Reduce memory traffic -Enable further predicting opportunity: variable step striding -Disadvantage: -Multiple table access on building history information -but, extra delay is relatively small and tolerable