Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models Konstantinos Koukos David Black-Schaffer Vasileios Spiliopoulos Stefanos Kaxiras.

Slides:



Advertisements
Similar presentations
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
Advertisements

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Energy Efficiency through Burstiness Athanasios E. Papathanasiou and Michael L. Scott University of Rochester, Computer Science Department Rochester, NY.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
A Framework for Dynamic Energy Efficiency and Temperature Management (DEETM) Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Multiprocessing Memory Management
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.
Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.
Power Supply Aware Computing Pradeep S. Shenoy and Philip T. Krein Support provided by National Science Foundation under Grant ECS and by the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Baoxian Zhao Hakan Aydin Dakai Zhu Computer Science Department Computer Science Department George Mason University University of Texas at San Antonio DAC.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Temperature Aware Load Balancing For Parallel Applications Osman Sarood Parallel Programming Lab (PPL) University of Illinois Urbana Champaign.
Critical Power Slope Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen Ram Rajamony Raj Rajkumar.
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
Abhilash Thekkilakattil, Radu Dobrin, Sasikumar Punnekkat Mälardalen Real-time Research Center, Mälardalen University Västerås, Sweden Towards Preemption.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
Hard Real-Time Scheduling for Low- Energy Using Stochastic Data and DVS Processors Flavius Gruian Department of Computer Science, Lund University Box 118.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.
Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.
Automatically Exploiting Cross- Invocation Parallelism Using Runtime Information Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
Progress Report 2013/11/07. Outline Further studies about heterogeneous multiprocessing other than ARM Cache miss issue Discussion on task scheduling.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
Determining Optimal Processor Speeds for Periodic Real-Time Tasks with Different Power Characteristics H. Aydın, R. Melhem, D. Mossé, P.M. Alvarez University.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.
Seth Pugsley, Jeffrey Jestes,
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
5.2 Eleven Advanced Optimizations of Cache Performance
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Architectural Interactions in High Performance Clusters
Intel Core I7 Pipeline Wei-Tse Sun.
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
The University of Adelaide, School of Computer Science
Presented by Florian Ettinger
Presentation transcript:

Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models Konstantinos Koukos David Black-Schaffer Vasileios Spiliopoulos Stefanos Kaxiras Institute for Information Technology Uppsala University

Informationsteknologi Institutionen för informationsteknologi | Decoupling for Energy Efficiency Motivation: range of voltage scaling shrinking  Can’t rely on DVFS any longer to provide quadratic energy decrease for at most linear performance decrease Goal: minimize performance degradation  Ideally: change DVFS per instruction  Low frequency when waiting for memory  Max frequency when computing  Impractical: requires instantaneous DVFS on cache misses Our solution: split the program into Access (prefetch) and Execute (compute)  DAE (Decoupled A-E)

Informationsteknologi Institutionen för informationsteknologi | DAE for Energy Efficiency Our solution: split the program into Access (prefetch) and Execute (compute)  DAE (Decoupled A-E) Access:  Prefetch data into lower-level cache  Eliminate most LLC & TLB misses  Memory waiting in access phase  Run at: low-f  maximize energy savings Execute:  Computation and stores  If access is successful  not many stalls in execute  Run at: high-f  maximize performance

Informationsteknologi Institutionen för informationsteknologi | How do we do it Parallel Workloads  Use all cores to do useful work Task-based programming model. Why?  Schedule tasks independently  Control task size  Easy to convert to DAE! How? Split each task into access phase and execute phase Access phase:  Remove stores  Remove arithmetic computation  Keep (replicate) address calculation  Turn loads into prefetches Execute phase: Original (unmodified) task  Scheduled to run on same core right after Access DVFS Access and Execute independently Low f High f

Informationsteknologi Institutionen för informationsteknologi | Estimate Access/Execute Energy Estimate Access/Execute Energy Limitations Current processors: Do not yet support low-latency, per-core DVFS Our task-based runtime supports per task-phase DVFS but impractical to do it for real for small tasks (w.s. size ~ L2) Modeling DAE for future processors:  Per-core DVFS (on-chip voltage regulator)  Reduce DVFS overhead 50×  How do we do this? Run-time Statistics Run-time Statistics Time IPC Power Model Power Model Estimate Accuracy Verify

Informationsteknologi Institutionen för informationsteknologi | Estimate per phase energy Power = f ×V 2 ×A×C Profiling for each f and V: IPC A IPC X Time A Time X E A = f min × V min 2 × C eff (IPC A ) × Time A E X = f max × V max 2 × C eff (IPC X ) × Time X Now we can model per-core instantaneous DVFS A×C : C eff  measured as a function of IPC Effective Capacitance IPC

Informationsteknologi Institutionen för informationsteknologi | Understanding DAE Time(sec) Energy(Joule) CoupledDecoupledCoupledDecoupled Execute phase f max  f min Execute phase f max  f min Coupled f max  f min Coupled f max  f min Access phase f min Access phase f min

Informationsteknologi Institutionen för informationsteknologi | Understanding DAE Time(sec) Energy(Joule) CoupledDecoupledCoupledDecoupled Performance is unaffected Performance is unaffected Energy is 25% reduced Energy is 25% reduced

Informationsteknologi Institutionen för informationsteknologi | Three Experiments Coupled at Optimal EDP f,V Decoupled Naïve  Access at f min  Execute at f max Decoupled Optimal EDP  Access at optimal f opt (for Access)  Execute at optimal f opt (for Execute) All results are Normalized to Coupled at f max

Informationsteknologi Institutionen för informationsteknologi | Evaluation: Coupled at Optimal EDP G.Mean Good EDP Bad Performance Good EDP Bad Performance Overall Slowdown : ≈ 12% EDP Improvement: ≈ 22% Overall Slowdown : ≈ 12% EDP Improvement: ≈ 22% Normalized Time Normalized EDP

Informationsteknologi Institutionen för informationsteknologi | Evaluation: Decoupled Naïve G.Mean No Slowdown DAE can improve EDP over coupled DAE can improve EDP over coupled Normalized Time Normalized EDP Better EDP No slowdown Better EDP No slowdown

Informationsteknologi Institutionen för informationsteknologi | Evaluation: Decoupled Opt. EDP G.Mean DAE Opt.EDP can further improve EDP on slight performance impact DAE Opt.EDP can further improve EDP on slight performance impact Normalized Time Normalized EDP Even Better EDP Performance Decrease Even Better EDP Performance Decrease

Informationsteknologi Institutionen för informationsteknologi | Conclusions Separating execute and access enables optimal DVFS Deliver f max performance at optimal EDP

Informationsteknologi Institutionen för informationsteknologi | Questions Thank You!