Download presentation
Presentation is loading. Please wait.
Published byCornelia Long Modified over 6 years ago
1
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Present self + team Previous decades: Time is money Performance Current decade: Power is money Energy This work improves energy efficiency without damaging performance. Alexandra Jimborean Kim-Ahn Tran, Konstantinos Koukos, Per Ekemark, Vasileios Spiliopoulos, Magnus Själander, Trevor Carlson, Stefanos Kaxiras
2
Efficiency vs Performance
Compiler techniques to deliver high performance at low energy costs! OoO Limited OoO Performance Hardware InO Energy efficiency
3
Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CPU Abilities Roadmap Compiler Sophistication
4
(Best presentation award)
Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CGO’14 (Best presentation award) CPU Abilities CC’16 (Best paper award) Roadmap Compiler Sophistication HiPEAC’16
5
Traditional HW techniques
DVFS: Dynamic Voltage Frequency Scaling Generate large memory-bound and compute bound phases and scale frequency (DVFS) Optimize the software to match the hardware DVFS capabilities DAE - OoO Automatically tune the code at compile time!
6
Coupled Execution DAE - OoO Memory bound Compute bound
Optimal f - Coupled Execution Max f - Coupled Execution fopt fmax Energy waste Performance loss Memory bound Compute bound DAE - OoO
7
Decoupled Execution DAE - OoO Access (prefetch) Execute (compute)
Ideal DVFS (fmin - fmax) - Coupled Execution fmax fmin fmax fmax fmax fmin fmax fmin fmax fmax Decouple the compute (execute) and memory (access) Access: prefetches the data to the L1 cache Execute: using the data from the L1 cache Access at low frequency and Execute at high frequency DAE - OoO Decoupled Execution (DAE) Access (prefetch) Execute (compute) fmin fmax
8
How do we implement DAE? DAE - OoO
Access phase: prefetches the data Remove arithmetic computation Keep (replicate) address calculation Turn loads into prefetches Remove stores to globals (no side effects) Execute phase: original (unmodified) task Scheduled to run on same core right after Access DVFS Access and Execute independently fLow Save energy DAE - OoO fHigh Compute fast
9
Modeling zero-latency, per-core DVFS
Understanding DAE Modeling zero-latency, per-core DVFS Evaluation Energy(Joule) Time(sec) Coupled Decoupled Coupled Decoupled fmin fmax Access: fmin Execute: fmin fmax
10
Understanding DAE Evaluation Slightly Faster and 25% Lower Energy
Performance is “unaffected” Energy is 25% reduced Slightly Faster and 25% Lower Energy Evaluation Energy(Joule) Time(sec) Coupled Decoupled Coupled Decoupled
11
5% Performance improvements!
SPEC CPU 2006 / Parboil Time EXE 22% EDP improvement Time ACC 5% Performance improvements! EDP DAE - OoO Energy EXE Energy ACC ON REAL HW! Intel Sandybridge & Haswell Lower is better!
12
https://github.com/etascale/daedal
Take-away message Goal: Save energy, keep performance Tool: DVFS Problem: Fine granularity of memory- and compute-bound phases impossible to adjust f at this rate Solution: Decoupled Access Execute Contribution: Automatic DAE (compiler) Results: 22%-25% EDP improvement compared to original Open source DAE - OoO
13
(Artifact evaluation)
Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CPU Abilities CGO’17 (Artifact evaluation) Roadmap Compiler Sophistication
14
} Clairvoyance Instruction Window Limits MLP Program
instructions that the processor can “see” … instruction window Clairvoyance
15
Target: processors with a limited instruction window
Instruction Window Limits perf. Program Program … instruction reordering overlap long-latency loads … long-latency loads Target: processors with a limited instruction window Clairvoyance exposes memory-level parallelism (MLP)
16
Clairvoyance: Look-ahead instruction scheduling
Global instruction scheduling Advanced software pipelining Optimizations Clairvoyance
17
Clairvoyance transformations
This is NOT ABOUT DVFS anymore Goal: Hide cache misses, memory level parallelism, instruction level parallelism Improve the performance of more energy-efficient but limited OoO processors Clairvoyance
18
Break Dependent Chains
t1 = load A t2 = load t1 t3 = load B t4 = load t3 t5 = t2+t4 ACCx Clairvoyance EXE
19
Break Dependent Chains
ACCx_1 t1 = load A t3 = load B t1 = load A t2 = load t1 t3 = load B t4 = load t3 t5 = t2+t4 ACCx_2 t2 = load t1 t4 = load t3 Clairvoyance EXE t5 = t2+t4 Multiple access phases maximize the distance between loads and their use.
20
Clairvoyance: mem intensive
Clairvoyance: ~15% performance Clairvoyance ON REAL ARM HW! ARMv8-64 APM X-GENE Lower is better!
21
Clairvoyance: all the rest
performance not affected Clairvoyance ON REAL ARM HW! ARMv8-64 APM X-GENE Lower is better!
22
Take away message Clairvoyance
Limited OoO lower tolerance to overhead Aggressive optimizations Fine grain memory-bound phases Increase memory level parallelism Fine grain compute-bound phases Increase instruction level parallelism Hide long latency misses in software: 15% performance improvements Clairvoyance
23
Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CPU Abilities Roadmap DAE Sophistication CAL 2017
24
SWOOP SWOOP Add just a bit of HW to eliminate overheads
Upon a cache miss run more Access phases, “out-of-order” (controlled by hardware) Virtual register contexts for more registers Reduce register pressure SWOOP
25
SWOOP vs. IO vs. OOO Cutting-Edge opt Caution: These are speedups
A7-like A15-like A7-SWOOP Cutting-Edge opt Caution: These are speedups HIGHER is better! Good trade-off between InO (efficient) and OoO (performance)
26
Conclusions SWOOP In-order cores:
Keep energy efficiency Increase performance In-order cores no tolerance to instruction count overhead Compiler optimizations and minimum hardware support SWOOP
27
Efficiency vs Performance
Compiler techniques to deliver high performance at low energy costs! DAE OoO Clairvoyance Limited OoO Performance Hardware SWOOP InO Energy efficiency
28
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Present self + team Previous decades: Time is money Performance Current decade: Power is money Energy This work improves energy efficiency without damaging performance. Alexandra Jimborean it.uu.se
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.