Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Present self + team Previous decades: Time is money  Performance Current decade: Power is money  Energy This work improves energy efficiency without damaging performance. Alexandra Jimborean Kim-Ahn Tran, Konstantinos Koukos, Per Ekemark, Vasileios Spiliopoulos, Magnus Själander, Trevor Carlson, Stefanos Kaxiras

Efficiency vs Performance
Compiler techniques to deliver high performance at low energy costs! OoO Limited OoO Performance Hardware InO Energy efficiency

Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CPU Abilities Roadmap Compiler Sophistication

(Best presentation award)
Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CGO’14 (Best presentation award) CPU Abilities CC’16 (Best paper award) Roadmap Compiler Sophistication HiPEAC’16

Traditional HW techniques
DVFS: Dynamic Voltage Frequency Scaling Generate large memory-bound and compute bound phases and scale frequency (DVFS) Optimize the software to match the hardware DVFS capabilities DAE - OoO Automatically tune the code at compile time!

Coupled Execution DAE - OoO Memory bound Compute bound
Optimal f - Coupled Execution Max f - Coupled Execution fopt fmax Energy waste Performance loss Memory bound Compute bound DAE - OoO

Decoupled Execution DAE - OoO Access (prefetch) Execute (compute)
Ideal DVFS (fmin - fmax) - Coupled Execution fmax fmin fmax fmax fmax fmin fmax fmin fmax fmax Decouple the compute (execute) and memory (access) Access: prefetches the data to the L1 cache Execute: using the data from the L1 cache Access at low frequency and Execute at high frequency DAE - OoO Decoupled Execution (DAE) Access (prefetch) Execute (compute) fmin fmax

How do we implement DAE? DAE - OoO
Access phase: prefetches the data Remove arithmetic computation Keep (replicate) address calculation Turn loads into prefetches Remove stores to globals (no side effects) Execute phase: original (unmodified) task Scheduled to run on same core right after Access DVFS Access and Execute independently fLow Save energy DAE - OoO fHigh Compute fast

Modeling zero-latency, per-core DVFS
Understanding DAE Modeling zero-latency, per-core DVFS Evaluation Energy(Joule) Time(sec) Coupled Decoupled Coupled Decoupled fmin  fmax Access: fmin Execute: fmin  fmax

Understanding DAE Evaluation Slightly Faster and 25% Lower Energy
Performance is “unaffected” Energy is 25% reduced Slightly Faster and 25% Lower Energy Evaluation Energy(Joule) Time(sec) Coupled Decoupled Coupled Decoupled

5% Performance improvements!
SPEC CPU 2006 / Parboil Time EXE 22% EDP improvement Time ACC 5% Performance improvements! EDP DAE - OoO Energy EXE Energy ACC ON REAL HW! Intel Sandybridge & Haswell Lower is better!

https://github.com/etascale/daedal
Take-away message Goal: Save energy, keep performance Tool: DVFS Problem: Fine granularity of memory- and compute-bound phases  impossible to adjust f at this rate Solution: Decoupled Access Execute Contribution: Automatic DAE (compiler) Results: 22%-25% EDP improvement compared to original Open source DAE - OoO

(Artifact evaluation)
Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CPU Abilities CGO’17 (Artifact evaluation) Roadmap Compiler Sophistication

} Clairvoyance Instruction Window Limits MLP Program
instructions that the processor can “see” … instruction window Clairvoyance

Target: processors with a limited instruction window
Instruction Window Limits perf. Program Program … instruction reordering overlap long-latency loads … long-latency loads Target: processors with a limited instruction window Clairvoyance exposes memory-level parallelism (MLP)

Clairvoyance: Look-ahead instruction scheduling
Global instruction scheduling Advanced software pipelining Optimizations Clairvoyance

Clairvoyance transformations
This is NOT ABOUT DVFS anymore Goal: Hide cache misses, memory level parallelism, instruction level parallelism Improve the performance of more energy-efficient but limited OoO processors Clairvoyance

Break Dependent Chains
t1 = load A t2 = load t1 t3 = load B t4 = load t3 t5 = t2+t4 ACCx Clairvoyance EXE

Break Dependent Chains
ACCx_1 t1 = load A t3 = load B t1 = load A t2 = load t1 t3 = load B t4 = load t3 t5 = t2+t4 ACCx_2 t2 = load t1 t4 = load t3 Clairvoyance EXE t5 = t2+t4 Multiple access phases maximize the distance between loads and their use.

Clairvoyance: mem intensive
Clairvoyance: ~15% performance Clairvoyance ON REAL ARM HW! ARMv8-64 APM X-GENE Lower is better!

Clairvoyance: all the rest
performance not affected Clairvoyance ON REAL ARM HW! ARMv8-64 APM X-GENE Lower is better!

Take away message Clairvoyance
Limited OoO  lower tolerance to overhead Aggressive optimizations Fine grain memory-bound phases Increase memory level parallelism Fine grain compute-bound phases Increase instruction level parallelism Hide long latency misses in software: 15% performance improvements Clairvoyance

Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CPU Abilities Roadmap DAE Sophistication CAL 2017

SWOOP SWOOP Add just a bit of HW to eliminate overheads
Upon a cache miss run more Access phases, “out-of-order” (controlled by hardware) Virtual register contexts for more registers Reduce register pressure SWOOP

SWOOP vs. IO vs. OOO Cutting-Edge opt Caution: These are speedups
A7-like A15-like A7-SWOOP Cutting-Edge opt Caution: These are speedups HIGHER is better! Good trade-off between InO (efficient) and OoO (performance)

Conclusions SWOOP In-order cores:
Keep energy efficiency Increase performance In-order cores  no tolerance to instruction count overhead Compiler optimizations and minimum hardware support SWOOP

Efficiency vs Performance
Compiler techniques to deliver high performance at low energy costs! DAE OoO Clairvoyance Limited OoO Performance Hardware SWOOP InO Energy efficiency

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Present self + team Previous decades: Time is money  Performance Current decade: Power is money  Energy This work improves energy efficiency without damaging performance. Alexandra Jimborean it.uu.se

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Similar presentations

Presentation on theme: "Decoupled Access-Execute Pioneering Compilation for Energy Efficiency"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Similar presentations

Presentation on theme: "Decoupled Access-Execute Pioneering Compilation for Energy Efficiency"— Presentation transcript:

Similar presentations

About project

Feedback