Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models Konstantinos Koukos David Black-Schaffer Vasileios Spiliopoulos Stefanos Kaxiras Institute for Information Technology Uppsala University
Informationsteknologi Institutionen för informationsteknologi | Decoupling for Energy Efficiency Motivation: range of voltage scaling shrinking Can’t rely on DVFS any longer to provide quadratic energy decrease for at most linear performance decrease Goal: minimize performance degradation Ideally: change DVFS per instruction Low frequency when waiting for memory Max frequency when computing Impractical: requires instantaneous DVFS on cache misses Our solution: split the program into Access (prefetch) and Execute (compute) DAE (Decoupled A-E)
Informationsteknologi Institutionen för informationsteknologi | DAE for Energy Efficiency Our solution: split the program into Access (prefetch) and Execute (compute) DAE (Decoupled A-E) Access: Prefetch data into lower-level cache Eliminate most LLC & TLB misses Memory waiting in access phase Run at: low-f maximize energy savings Execute: Computation and stores If access is successful not many stalls in execute Run at: high-f maximize performance
Informationsteknologi Institutionen för informationsteknologi | How do we do it Parallel Workloads Use all cores to do useful work Task-based programming model. Why? Schedule tasks independently Control task size Easy to convert to DAE! How? Split each task into access phase and execute phase Access phase: Remove stores Remove arithmetic computation Keep (replicate) address calculation Turn loads into prefetches Execute phase: Original (unmodified) task Scheduled to run on same core right after Access DVFS Access and Execute independently Low f High f
Informationsteknologi Institutionen för informationsteknologi | Estimate Access/Execute Energy Estimate Access/Execute Energy Limitations Current processors: Do not yet support low-latency, per-core DVFS Our task-based runtime supports per task-phase DVFS but impractical to do it for real for small tasks (w.s. size ~ L2) Modeling DAE for future processors: Per-core DVFS (on-chip voltage regulator) Reduce DVFS overhead 50× How do we do this? Run-time Statistics Run-time Statistics Time IPC Power Model Power Model Estimate Accuracy Verify
Informationsteknologi Institutionen för informationsteknologi | Estimate per phase energy Power = f ×V 2 ×A×C Profiling for each f and V: IPC A IPC X Time A Time X E A = f min × V min 2 × C eff (IPC A ) × Time A E X = f max × V max 2 × C eff (IPC X ) × Time X Now we can model per-core instantaneous DVFS A×C : C eff measured as a function of IPC Effective Capacitance IPC
Informationsteknologi Institutionen för informationsteknologi | Understanding DAE Time(sec) Energy(Joule) CoupledDecoupledCoupledDecoupled Execute phase f max f min Execute phase f max f min Coupled f max f min Coupled f max f min Access phase f min Access phase f min
Informationsteknologi Institutionen för informationsteknologi | Understanding DAE Time(sec) Energy(Joule) CoupledDecoupledCoupledDecoupled Performance is unaffected Performance is unaffected Energy is 25% reduced Energy is 25% reduced
Informationsteknologi Institutionen för informationsteknologi | Three Experiments Coupled at Optimal EDP f,V Decoupled Naïve Access at f min Execute at f max Decoupled Optimal EDP Access at optimal f opt (for Access) Execute at optimal f opt (for Execute) All results are Normalized to Coupled at f max
Informationsteknologi Institutionen för informationsteknologi | Evaluation: Coupled at Optimal EDP G.Mean Good EDP Bad Performance Good EDP Bad Performance Overall Slowdown : ≈ 12% EDP Improvement: ≈ 22% Overall Slowdown : ≈ 12% EDP Improvement: ≈ 22% Normalized Time Normalized EDP
Informationsteknologi Institutionen för informationsteknologi | Evaluation: Decoupled Naïve G.Mean No Slowdown DAE can improve EDP over coupled DAE can improve EDP over coupled Normalized Time Normalized EDP Better EDP No slowdown Better EDP No slowdown
Informationsteknologi Institutionen för informationsteknologi | Evaluation: Decoupled Opt. EDP G.Mean DAE Opt.EDP can further improve EDP on slight performance impact DAE Opt.EDP can further improve EDP on slight performance impact Normalized Time Normalized EDP Even Better EDP Performance Decrease Even Better EDP Performance Decrease
Informationsteknologi Institutionen för informationsteknologi | Conclusions Separating execute and access enables optimal DVFS Deliver f max performance at optimal EDP
Informationsteknologi Institutionen för informationsteknologi | Questions Thank You!