CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

CS 7810 Lecture 8 Memory Dependence Prediction using Store Sets G.Z. Chrysos and J.S. Emer Proceedings of ISCA

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Spring 2003CSE P5481 Control Hazard Review The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)

Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Fetch Directed Prefetching - a Study

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Lecture: Out-of-order Processors

Data Prefetching Smruti R. Sarangi.

Lecture: Large Caches, Virtual Memory

Lecture: Cache Hierarchies

Samira Khan University of Virginia Nov 13, 2017

Lecture: Cache Hierarchies

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Address-Value Delta (AVD) Prediction

Lecture: Branch Prediction

Lecture: Cache Innovations, Virtual Memory

Lecture: Out-of-order Processors

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 20: OOO, Memory Hierarchy

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

Data Prefetching Smruti R. Sarangi.

15-740/ Computer Architecture Lecture 14: Prefetching

Lecture: Cache Hierarchies

Dynamic Hardware Prediction

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5) May 1995

Memory Hierarchy Bottlenecks Caching Strategies  victim cache, replacement policies, temporal/spatial caches Prefetching  stream buffers, strided predictors, pointer-chasing Memory dependences  store barrier cache, store sets Latency tolerance  o-o-o execution, runahead

Data Access Patterns Scalar: simple variable references Zero stride: constant array index throughout a loop Constant stride: index is a linear function of loop count Irregular: None of the above (non-linear index, pointer chasing, etc.)

Prefetch Overheads A regular access can be delayed  Increased contention on buses and L1/L2/memory ports  Before initiating the prefetch, the L1 tags have to be examined Cache pollution and more misses Software prefetching increases instruction count

Software Prefetching Pros: reduces hardware overhead, can avoid the first miss (software pipelining), can handle complex address equations, Cons: Code bloat, can only handle addresses that are easy to compute, control flow is a problem, unpredictable latencies

Basic Reference Prediction For each PC, detect and store a stride and the last fetched address Every fetch initiates the next prefetch If the stride changes, remain in transient states until a regular stride is observed Prefetches are not issued only in no-pred state

Basic Reference Prediction tagprev_addrstridestate PC init trans steady no-pred incorrect correct incorrect (update stride) correct incorrect (update stride) incorrect (update stride) Outstanding Request List L1 Tags prefetch address

Shortcomings Basic technique only prefetches one iteration ahead – Lookahead predictor Mispredictions at the end of every inner loop

Lookahead Prediction Note: these are in-order processors Fetch stalls when instructions stall – but, continue to increment PC The Lookahead PC (LA-PC) accesses the branch predictor and BTB to make forward progress The LA-PC indexes into the RPT and can be up to  instructions ahead Note additional bpred and BTB ports

Lookahead Prediction RPT ORL LA-PC BPred PC look-up update decoupled New entry in the RPT, that keeps track of how many steps ahead the LA-PC is

Correlated Reference Prediction for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k) Access pattern for B: (1,0) (1,1) (2,0) (2,1) (2,2) (3,0) (3,1) (3,2) (3,3) (4,0) (4,1) (4,2) (4,3) (4,4) Access pattern for W:(1) (2)(2) (2) (3)(3) (3) (3) (4) (4) (4) (4) (4)

Correlated Reference Prediction Inner-loop predictions work well, but the first inner-loop prediction always fails There is a correlation between the branch outcomes and the reference patterns Access pattern for B: (1,0) 1 (1,1) 101 (2,0) 1011 (2,1) (2,2) (3,0) (3,1) (3,2) (3,3) (4,0) (4,1) (4,2) … for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k)

Implementation Each PC keeps track of multiple access patterns (prev_addr and stride) The branch history determines the patterns that are relevant (01 refers to an outer loop access) Other branches in the loop can mess up the history – use the compiler to mark loop-termination branches?

Benchmark Characteristics

Results

Results Summary Lookahead is the most cost-effective technique RPT needs 512 entries (4KB capacity) Lookahead should be a little more than memory latency

Modern Processors Pentium 4  I-cache: use the bpred and BTB to stay ahead of current execution  L2 cache: attempts to stay 256 bytes ahead; monitors history for multiple streams; minimizes fetch of unwanted data (no h/w prefetch in PIII) Alpha has 16-entry victim caches in L1D and L2D and 16-entry L1I stream buffer Ultra Sparc III has a 2KB prefetch cache

Next Week’s Paper “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors”, O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt, Proceedings of HPCA-9, February 2003 Useful execution while waiting for a cache miss – perhaps, prefetch the next miss?

Title Bullet