Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Slides:



Advertisements
Similar presentations
Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh
Advertisements

1 Cache and Caching David Sands CS 147 Spring 08 Dr. Sin-Min Lee.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
School of Electrical Engineering and Computer Science University of Central Florida Combining Local and Global History for High Performance Data Prefetching.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Computer Organization and Architecture
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Access Map Pattern Matching Prefetch: Optimization Friendly Method
Chapter 6 Computer Architecture
CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.
Computer Organization and Architecture
Computer Organization and Architecture
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
Data Prefetching Mechanism by Exploiting Global and Local Access Patterns Ahmad SharifQualcomm Hsien-Hsin S. LeeGeorgia Tech The 1 st JILP Data Prefetching.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
Computer Organization and Architecture The CPU Structure.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.
CH12 CPU Structure and Function
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Fetch Directed Prefetching - a Study
An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
CS203 – Advanced Computer Architecture
William Stallings Computer Organization and Architecture 8th Edition
Lecture: Cache Hierarchies
Morgan Kaufmann Publishers Memory & Cache
Lecture: Cache Hierarchies
ECE 445 – Computer Organization
Address-Value Delta (AVD) Prediction
Lecture: Cache Innovations, Virtual Memory
Performance metrics for caches
Lecture 10: Branch Prediction and Instruction Delivery
Performance metrics for caches
Lecture: Cache Hierarchies
CSC3050 – Computer Architecture
Virtual Memory: Working Sets
Update : about 8~16% are writes
Chapter 11 Processor Structure and function
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department of Electrical and Computer Engineering University of Wisconsin, Madison

1.Introduction Cache-hierarchy: CPU: registers, very small number, fastest L1 Cache: usually 8k, larger than CPU registers, slower than CPU L2 Cache: usually 256/512k, larger than L1, slower than L1 L3 Cache (optional): usually 1M/2M, larger than L2, slower than L2 Cache Main memory: –Usually 256M/512 M or more, –larger than L3, slowest CPU-Memory Cache Hierarchy

Each level on cache hierarchy: –latency is around 10 times Problem with the cache hierarchy architecture –limited capacity (size) –Limited associativity Solution for the problems: using effective prefetching 2. Pre-fetching technique 1.Sequential prefetching 1.What: access cache lines that immediately following the current cache line (for the cache miss) 2.Algorithm: –early: pre-fetch after each cache miss –mature: Issue prefetch after a sequential access pattern is built Degree of prefetching: –Maximum number of cache lines prefetched in response to a single prefetch request –in order to: completely hide the latency of a miss to main memory

2. Table based prefetching: What: –record history information related to data access Operate: –Table is accessed with a key (Program Counter of the load instruction, or the missed address) –Use history information to predict the prefetching behavior Evaluate: –Pro: simple –Con: inefficient »Fixed amount of history for each prefetching key »Stale happens: data in entry sit for a very long time. When using this information, the memory access behavior has changed 3. Global History Buffer (GHB) prefetching 1.Organized: Fig 1.b 2.Features: FIFO Table: cache misses: enter from bottom, goes up to top Separate IT and GHB: Fixed table size: Circular table: overwrite existing items, when overflow happens

–Benefit of GHB: reduce stale data more accurate construction of history access patterns more effective prefetching algorithm 4. Table-based prefetching techniques 1.Stride Prefetching: Fig. 2. –the following addresses are fetched: »a + s, where: a: target address »a + 2s, s: detected stride »… … d: degree of prefetching »a + d s,note: in this case, stride s is a const 2.Correlation Prefetching (Markov Prefetching): Fig explain 1.Use a history table to record cache-misses 2.missing address: index the correlation table 3.Each entry: 1.List of addresses that have immediately followed the current miss address 2.Most recent miss first 4.Markov graph: 1.each node: cache miss address 2.edge: probabilities that source will be immediately followed by target

3. Distance Prefetching: Fig 4. + explain –Generalized Correlation Prefetching –Use distance (between 2 global miss address) to index correlation table Problems with table-based prefetching: –Table data becomes stale: not used, not refreshed neither –Table entry conflicts: multiple access keys map to the same table entry –Fixed + small history data per entry: Fig 3. 2-piece of history per data item 5. Global History Buffer (GHB) base prefetching: 1.Table structure: Fig. 1 (b) –IT: Index Table –accessed by key as traditional table-based prefetching –Key: Program Counter, cache missing address or a combination of them –Have pointers to GHB

GHB (cont) GHB: n-entry FIFO circular table –holding: n most recent misses –each entry: »global miss address »Pointer: chain other GHB entries into address list (access info for the same address) Notions used later: Prefetching Method: X / Y –X: »PC: Program Counter based indexing »G: global address –Y: »CS: Const Striding »DC: Delta Correlation »AC: Address Correlation –Different combination of X and Y creates different prefetching methods

2. GHB for Correlation Prefetching -Fig. 5. -Explain: breadth first, shaded area 3. GHB for Stride Prefetching -PC / CS -Use again Fig. 5. to explain (depth 1st) 6. Global History Buffer (GHB) error handling: -error can occur: -how: -when GHB array is over-written -Pointers become obsolete, as of information re-written -Solution: -Use low-order extra bits of a pointer to reference entries -Compare: -(head pointer – ref pointer) > table size, then, it is an error

7. GHB evaluation 1.GHB benefits: –FIFO: –first in, first out buffer –naturally gives table space to the most recent history –Separation of IT + GHB buffer: –IT: Indexing Table –Hold working set of prefetching list –Relatively small –GHB: –Larger than IT –Sized to hold missed address stream –Benefit of this design: –Enable more sophisticated prefetching methods (show later) 2.GHB drawback: –Multiple access on collecting prefetching info (internal linked-list traversal)

7. GHB evaluation (cont) 3. Types of GHB prefetching: Width prefetching: –prefetch only the immediate adjacent nodes –E.g. in Fig. 5 Depth prefetching: –begin with current miss –Follow with a sequence of most likely node on its path –prefetch at each node –E.g. in Fig 5. Hybrid: –Mix of the width prefetch and depth prefetch 4. New prefetching technique: Global / Delta Correlation -what: non-const step prefetching -Example: Table 1 -Pattern: {0, 1, 1, 62, 1, 1, …}, access 1 st 3 elements of a 2-dimensional array -Const stride: prefetching down to incorrect addresses {1, 1, 1, 1, …}

Non-const address stream

4. New prefetching technique: (cont) Using GHB: –Sequence of the load’s missing addresses –Detecting variable stride steps –Use delta pairs (Table-1) to predict 8. Simulation and testing 1.Simulator + its configuration: 1.Config: table 4 2.Simple Scalar: Other details: –Each access to IT: 1 cycle –Each access to GHB: 1 cycle –Degree of prefetching: 4 2.Benchmark under ideal L2 cache: table 2 + table 3 3.GHB’s train set –use some benchmarks to decide the optimal table size for –IT –GHB –Table size result: Table 6

4. GHB Testing: Global / Delta Correlation

5. GHB Testing: PC / Local Prefetching GHB PC / CS, GHB PC / DC with table-based PC /CS

Conclusion Global History Buffer based prefetching: -2-level table hierarchy: -IT: Index table -GHB: Global History Buffer -Performance improvements: -Generally: as well as or better than on 14 out of 15 tested benchmarks -Increase IPC -Reduce memory traffic -Advantage: -Reduce stale data -Increase prediction accuracy -Reduce memory traffic -Enable further predicting opportunity: variable step striding -Disadvantage: -Multiple table access on building history information -but, extra delay is relatively small and tolerable