Project Guidelines Prof. Eric Rotenberg.

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.

Appendix A Pipelining: Basic and Intermediate Concepts

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Revisiting Load Value Speculation:

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

CS Lecture 2 Limits of Instruction-Level Parallelism David W. Wall WRL Research Report 93/6 Also appears in ASPLOS’91.

Pipelining and Parallelism Mark Staveley

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

现代计算机体系结构主讲教师：张钢天津大学计算机学院 2009 年.

Use of Pipelining to Achieve CPI < 1

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Lecture: Out-of-order Processors

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Data Prefetching Smruti R. Sarangi.

Concepts and Challenges

Dynamic Branch Prediction

Multiscalar Processors

PowerPC 604 Superscalar Microprocessor

Part IV Data Path and Control

Lecture: Out-of-order Processors

CC 423: Advanced Computer Architecture Limits to ILP

Morgan Kaufmann Publishers The Processor

Exploring Value Prediction with the EVES predictor

Lecture 6: Advanced Pipelines

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Module 3: Branch Prediction

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 18: Pipelining Today’s topics:

Address-Value Delta (AVD) Prediction

Lecture 11: Memory Data Flow Techniques

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 5: Pipelining Basics

Alpha Microarchitecture

Lecture: Out-of-order Processors

Lecture 10: Branch Prediction and Instruction Delivery

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Data Prefetching Smruti R. Sarangi.

Explaining issues with DCremoval( )

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

How to improve (decrease) CPI

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Spring 2019 Prof. Eric Rotenberg

ECE 721, Spring 2019 Prof. Eric Rotenberg.

Presentation transcript:

Project Guidelines Prof. Eric Rotenberg

Required “what works” table in final report See report format.

Why run with perfect branch prediction Branch mispredictions can be a major bottleneck that hides the speedup of your technique. Get your simulator working with perfect branch prediction mode. Run with perfect branch prediction mode to highlight the performance gains that are possible with a realistic implementation of your technique.

Why run with real branch prediction Much of microarchitecture design has to do not only with the proposed technique, but also how to make it perform correctly and with the intended performance, in the presence of branch mispredictions. Recovery from mispredicted branches is half the battle for any microarchitecture technique. Test your skills as a microarchitect and simulator developer.

Performance debug Why am I getting negligible or no speedup? Why am I getting a (big) slowdown? Write and compile your own microbenchmarks Create microbenchmarks which you know are ideal for the proposed technique Example: Create a large array and initialize it with strided values using a first “for” loop. Follow this by a second “for” loop that sums all elements of the array into a reduction variable “sum”. Print “sum” after the second loop (if no output, compiler will remove all code as dead code). Compile with –O3 using RISCV compiler and examine assembly with RISCV objdump. A correctly implemented stride predictor will get very high accuracy for this microbenchmark. Eliminate all other performance bottlenecks, that are orthogonal to the proposed technique and might otherwise hide the speedup potential of the proposed technique. Also, stress other performance bottlenecks that your technique is specifically targeting. Perfect branch prediction To identify or eliminate the problem as having to do with interplay between branch speculation/recovery and the proposed technique To remove branch misprediction bottleneck Example: incorrect recovery of value predictor context; not using latest value predictor context to avoid recovery in presence of branch mispredictions Oracle memory disambiguation Make sure all structure sizes and pipeline stage widths are set appropriately to highlight the strengths of the proposed technique. Example: value prediction on top of a 2048 entry Active List/PRF, is probably of little use because huge window exposes more ILP even without value prediction. Value prediction on a 2-issue machine is probably of little use because peak IPC is in any case low. Instead, stress the importance of value prediction for smaller windows and wider issue. Example: CLEAR, CFP, and any other large-window microarchitecture, need a large SQ/LQ in the LSU and “cti” queue in the branch prediction unit to buffer pseudo-retired stores, loads, and branches, respectively, until bulk-commit. Also a large “cti” queue in the branch prediction class. In some cases it may make sense to study performance with perfect caches or real caches, for issues similar to the branch misprediction bottleneck. Run with perfect versions of your technique (e.g., perfect value prediction, real value prediction + oracle confidence, etc.) to diagnose problems with implementation Example: diagnose that your value predictor is making correct predictions but that the speculation machinery is not actually breaking data dependencies Example: “We only inject correct predictions (real vp + oracle conf) and yet we are seeing a slowdown – how can this be?” Look at key measurements from the simulator such as cache misses and branch predictions Add key measurements to understand the performance of your technique Example: for value prediction, break down 100% of eligible predictions as: correct+confident, incorrect+confident (misprediction), correct-not_confident (lost coverage), incorrect+not_confident

Gotchas Trace cache & trace processor VP Miss handling Within-trace branch misprediction handling Real multiple branch prediction Realize that trace processor will have lower IPC than equally-provisioned monolithic superscalar (load balance, global bypass latency and arbitration, discrete window shifting, trace cache misses and mispredictions delay instruction supply more in trace processor due to trace repair latency being exposed) VP Recommendation: update all contexts and table(s) at retirement, and infer speculative context using table(s) + in-flight instruction queue and/or per-PC iteration counts Examples: stride predictor, value-context-based predictor, VTAGE

Resources Coming soon RISCV tools [hostname] % add riscv RISC-V Toolchain for gcc 4.9.2 ------------------------------ riscv64-unknown-elf-gcc riscv64-unknown-elf-g++ riscv64-unknown-elf-objdump For more cmds, see /afs/eos/dist/riscv/bin Coming soon More benchmark checkpoints Benchmark executables and input files Benchmark source code RISCV tools …riscv64-unknown-elf-gcc -O3 -o microbench microbench.cc … …riscv64-unknown-elf-objdump -d microbench | less Characterization flow (example: top mispredicted branches) profile branch PCs using simulator -> sort branch PCs based on # mispredictions using unix “sort” -> look up branch PCs in assembly to find loops, functions, etc. -> study source code