Project Guidelines Prof. Eric Rotenberg
Required “what works” table in final report See report format.
Why run with perfect branch prediction Branch mispredictions can be a major bottleneck that hides the speedup of your technique. Get your simulator working with perfect branch prediction mode. Run with perfect branch prediction mode to highlight the performance gains that are possible with a realistic implementation of your technique.
Why run with real branch prediction Much of microarchitecture design has to do not only with the proposed technique, but also how to make it perform correctly and with the intended performance, in the presence of branch mispredictions. Recovery from mispredicted branches is half the battle for any microarchitecture technique. Test your skills as a microarchitect and simulator developer.
Performance debug Why am I getting negligible or no speedup? Why am I getting a (big) slowdown? Write and compile your own microbenchmarks Create microbenchmarks which you know are ideal for the proposed technique Example: Create a large array and initialize it with strided values using a first “for” loop. Follow this by a second “for” loop that sums all elements of the array into a reduction variable “sum”. Print “sum” after the second loop (if no output, compiler will remove all code as dead code). Compile with –O3 using RISCV compiler and examine assembly with RISCV objdump. A correctly implemented stride predictor will get very high accuracy for this microbenchmark. Eliminate all other performance bottlenecks, that are orthogonal to the proposed technique and might otherwise hide the speedup potential of the proposed technique. Also, stress other performance bottlenecks that your technique is specifically targeting. Perfect branch prediction To identify or eliminate the problem as having to do with interplay between branch speculation/recovery and the proposed technique To remove branch misprediction bottleneck Example: incorrect recovery of value predictor context; not using latest value predictor context to avoid recovery in presence of branch mispredictions Oracle memory disambiguation Make sure all structure sizes and pipeline stage widths are set appropriately to highlight the strengths of the proposed technique. Example: value prediction on top of a 2048 entry Active List/PRF, is probably of little use because huge window exposes more ILP even without value prediction. Value prediction on a 2-issue machine is probably of little use because peak IPC is in any case low. Instead, stress the importance of value prediction for smaller windows and wider issue. Example: CLEAR, CFP, and any other large-window microarchitecture, need a large SQ/LQ in the LSU and “cti” queue in the branch prediction unit to buffer pseudo-retired stores, loads, and branches, respectively, until bulk-commit. Also a large “cti” queue in the branch prediction class. In some cases it may make sense to study performance with perfect caches or real caches, for issues similar to the branch misprediction bottleneck. Run with perfect versions of your technique (e.g., perfect value prediction, real value prediction + oracle confidence, etc.) to diagnose problems with implementation Example: diagnose that your value predictor is making correct predictions but that the speculation machinery is not actually breaking data dependencies Example: “We only inject correct predictions (real vp + oracle conf) and yet we are seeing a slowdown – how can this be?” Look at key measurements from the simulator such as cache misses and branch predictions Add key measurements to understand the performance of your technique Example: for value prediction, break down 100% of eligible predictions as: correct+confident, incorrect+confident (misprediction), correct-not_confident (lost coverage), incorrect+not_confident
Gotchas Trace cache & trace processor VP Miss handling Within-trace branch misprediction handling Real multiple branch prediction Realize that trace processor will have lower IPC than equally-provisioned monolithic superscalar (load balance, global bypass latency and arbitration, discrete window shifting, trace cache misses and mispredictions delay instruction supply more in trace processor due to trace repair latency being exposed) VP Recommendation: update all contexts and table(s) at retirement, and infer speculative context using table(s) + in-flight instruction queue and/or per-PC iteration counts Examples: stride predictor, value-context-based predictor, VTAGE
Resources Coming soon RISCV tools [hostname] % add riscv RISC-V Toolchain for gcc 4.9.2 ------------------------------ riscv64-unknown-elf-gcc riscv64-unknown-elf-g++ riscv64-unknown-elf-objdump For more cmds, see /afs/eos/dist/riscv/bin Coming soon More benchmark checkpoints Benchmark executables and input files Benchmark source code RISCV tools …riscv64-unknown-elf-gcc -O3 -o microbench microbench.cc … …riscv64-unknown-elf-objdump -d microbench | less Characterization flow (example: top mispredicted branches) profile branch PCs using simulator -> sort branch PCs based on # mispredictions using unix “sort” -> look up branch PCs in assembly to find loops, functions, etc. -> study source code