Decoupled Access-Execute Pioneering Compilation for Energy Efficiency Present self + team Previous decades: Time is money Performance Current decade: Power is money Energy This work improves energy efficiency without damaging performance. Alexandra Jimborean Kim-Ahn Tran, Konstantinos Koukos, Per Ekemark, Vasileios Spiliopoulos, Magnus Själander, Trevor Carlson, Stefanos Kaxiras
Efficiency vs Performance Compiler techniques to deliver high performance at low energy costs! OoO Limited OoO Performance Hardware InO Energy efficiency
Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CPU Abilities Roadmap Compiler Sophistication
(Best presentation award) Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CGO’14 (Best presentation award) CPU Abilities CC’16 (Best paper award) Roadmap Compiler Sophistication HiPEAC’16
Traditional HW techniques DVFS: Dynamic Voltage Frequency Scaling Generate large memory-bound and compute bound phases and scale frequency (DVFS) Optimize the software to match the hardware DVFS capabilities DAE - OoO Automatically tune the code at compile time!
Coupled Execution DAE - OoO Memory bound Compute bound Optimal f - Coupled Execution Max f - Coupled Execution fopt fmax Energy waste Performance loss Memory bound Compute bound DAE - OoO
Decoupled Execution DAE - OoO Access (prefetch) Execute (compute) Ideal DVFS (fmin - fmax) - Coupled Execution fmax fmin fmax fmax fmax fmin fmax fmin fmax fmax Decouple the compute (execute) and memory (access) Access: prefetches the data to the L1 cache Execute: using the data from the L1 cache Access at low frequency and Execute at high frequency DAE - OoO Decoupled Execution (DAE) Access (prefetch) Execute (compute) fmin fmax
How do we implement DAE? DAE - OoO Access phase: prefetches the data Remove arithmetic computation Keep (replicate) address calculation Turn loads into prefetches Remove stores to globals (no side effects) Execute phase: original (unmodified) task Scheduled to run on same core right after Access DVFS Access and Execute independently fLow Save energy DAE - OoO fHigh Compute fast
Modeling zero-latency, per-core DVFS Understanding DAE Modeling zero-latency, per-core DVFS Evaluation Energy(Joule) Time(sec) Coupled Decoupled Coupled Decoupled fmin fmax Access: fmin Execute: fmin fmax
Understanding DAE Evaluation Slightly Faster and 25% Lower Energy Performance is “unaffected” Energy is 25% reduced Slightly Faster and 25% Lower Energy Evaluation Energy(Joule) Time(sec) Coupled Decoupled Coupled Decoupled
5% Performance improvements! SPEC CPU 2006 / Parboil Time EXE 22% EDP improvement Time ACC 5% Performance improvements! EDP DAE - OoO Energy EXE Energy ACC ON REAL HW! Intel Sandybridge & Haswell Lower is better!
https://github.com/etascale/daedal Take-away message Goal: Save energy, keep performance Tool: DVFS Problem: Fine granularity of memory- and compute-bound phases impossible to adjust f at this rate Solution: Decoupled Access Execute Contribution: Automatic DAE (compiler) Results: 22%-25% EDP improvement compared to original Open source https://github.com/etascale/daedal DAE - OoO
(Artifact evaluation) Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CPU Abilities CGO’17 (Artifact evaluation) Roadmap Compiler Sophistication
} Clairvoyance Instruction Window Limits MLP Program instructions that the processor can “see” … instruction window Clairvoyance
Target: processors with a limited instruction window Instruction Window Limits perf. Program Program … instruction reordering overlap long-latency loads … long-latency loads Target: processors with a limited instruction window Clairvoyance exposes memory-level parallelism (MLP)
Clairvoyance: Look-ahead instruction scheduling Global instruction scheduling Advanced software pipelining Optimizations Clairvoyance
Clairvoyance transformations This is NOT ABOUT DVFS anymore Goal: Hide cache misses, memory level parallelism, instruction level parallelism Improve the performance of more energy-efficient but limited OoO processors Clairvoyance
Break Dependent Chains t1 = load A t2 = load t1 t3 = load B t4 = load t3 t5 = t2+t4 ACCx Clairvoyance EXE
Break Dependent Chains ACCx_1 t1 = load A t3 = load B t1 = load A t2 = load t1 t3 = load B t4 = load t3 t5 = t2+t4 ACCx_2 t2 = load t1 t4 = load t3 Clairvoyance EXE t5 = t2+t4 Multiple access phases maximize the distance between loads and their use.
Clairvoyance: mem intensive Clairvoyance: ~15% performance Clairvoyance ON REAL ARM HW! ARMv8-64 APM X-GENE Lower is better!
Clairvoyance: all the rest performance not affected Clairvoyance ON REAL ARM HW! ARMv8-64 APM X-GENE Lower is better!
Take away message Clairvoyance Limited OoO lower tolerance to overhead Aggressive optimizations Fine grain memory-bound phases Increase memory level parallelism Fine grain compute-bound phases Increase instruction level parallelism Hide long latency misses in software: 15% performance improvements https://github.com/ktran/clairvoyance Clairvoyance
Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CPU Abilities Roadmap DAE Sophistication CAL 2017
SWOOP SWOOP Add just a bit of HW to eliminate overheads Upon a cache miss run more Access phases, “out-of-order” (controlled by hardware) Virtual register contexts for more registers Reduce register pressure SWOOP
SWOOP vs. IO vs. OOO Cutting-Edge opt Caution: These are speedups A7-like A15-like A7-SWOOP Cutting-Edge opt Caution: These are speedups HIGHER is better! Good trade-off between InO (efficient) and OoO (performance)
Conclusions SWOOP In-order cores: Keep energy efficiency Increase performance In-order cores no tolerance to instruction count overhead Compiler optimizations and minimum hardware support SWOOP
Efficiency vs Performance Compiler techniques to deliver high performance at low energy costs! DAE OoO Clairvoyance Limited OoO Performance Hardware SWOOP InO Energy efficiency
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency Present self + team Previous decades: Time is money Performance Current decade: Power is money Energy This work improves energy efficiency without damaging performance. Alexandra Jimborean Alexandra.Jimborean @ it.uu.se