1 Estimating the Worst-Case Energy Consumption of Embedded Software Ramkumar Jayaseelan Tulika Mitra Xianfeng Li School of Computing National University.

Slides:



Advertisements
Similar presentations
Xianfeng Li Tulika Mitra Abhik Roychoudhury
Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Constraint Systems used in Worst-Case Execution Time Analysis Andreas Ermedahl Dept. of Information Technology Uppsala University.
Power Reduction Techniques For Microprocessor Systems
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.
Architectural Power Management for High Leakage Technologies Department of Electrical and Computer Engineering Auburn University, Auburn, AL /15/2011.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Computer Science 12 Design Automation for Embedded Systems ECRTS 2011 Bus-Aware Multicore WCET Analysis through TDMA Offset Bounds Timon Kelter, Heiko.
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.
Low Power Techniques in Processor Design
Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors.
Timing Analysis of Embedded Software for Speculative Processors Tulika Mitra Abhik Roychoudhury Xianfeng Li School of Computing National University of.
3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.
Hard Real-Time Scheduling for Low- Energy Using Stochastic Data and DVS Processors Flavius Gruian Department of Computer Science, Lund University Box 118.
F A S T Frequency-Aware Static Timing Analysis
Basics of Energy & Power Dissipation
EECE 476: Computer Architecture Slide Set #5: Implementing Pipelining Tor Aamodt Slide background: Die photo of the MIPS R2000 (first commercial MIPS microprocessor)
A Unified WCET Analysis Framework for Multi-core Platforms Sudipta Chattopadhyay, Chong Lee Kee, Abhik Roychoudhury National University of Singapore Timon.
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Real-time aspects Bernhard Weirich Real-time Systems Real-time systems need to accomplish their task s before the deadline. – Hard real-time:
JouleTrack - A Web Based Tool for Software Energy Profiling Amit Sinha and Anantha Chandrakasan Massachusetts Institute of Technology June 19, 2001.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Determining Optimal Processor Speeds for Periodic Real-Time Tasks with Different Power Characteristics H. Aydın, R. Melhem, D. Mossé, P.M. Alvarez University.
Timing Anomalies in Dynamically Scheduled Microprocessors Thomas Lundqvist, Per Stenstrom (RTSS ‘99) Presented by: Kaustubh S. Patil.
PipeliningPipelining Computer Architecture (Fall 2006)
CHaRy Software Synthesis for Hard Real-Time Systems
Memory Segmentation to Exploit Sleep Mode Operation
Variable Word Width Computation for Low Power
Multiscalar Processors
CS203 – Advanced Computer Architecture
A Review of Processor Design Flow
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers The Processor
CSCI1600: Embedded and Real Time Software
Computer Architecture Lecture 4 17th May, 2006
Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.
Circuit Design Techniques for Low Power DSPs
Overheads for Computers as Components 2nd ed.
A High Performance SoC: PkunityTM
Adapted from the slides of Prof
Pipelining: Basic Concepts
Lecture 5: Pipeline Wrap-up, Static ILP
FAST: Frequency-Aware Static Timing Analysis
CSCI1600: Embedded and Real Time Software
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

1 Estimating the Worst-Case Energy Consumption of Embedded Software Ramkumar Jayaseelan Tulika Mitra Xianfeng Li School of Computing National University of Singapore

2 Motivation Conventional scheduling techniques give timing guarantees  Processor cycles is the critical resource  WCET of the tasks are required input Battery life is equally important for mobile devices  Scheduling technique have to give energy guarantees  Worst-Case Energy Consumption (WCEC) of the tasks are required input

3 Remotely Deployed Systems Available energy unevenly distributed among nodes Spatio-temporal scheduling benefits from WCEC Local Station Sensor Network

4 Energy-Based Guarantees Scheduling critical and non-critical tasks in a battery-operated system Non-critical tasks can be run only if energy constraints for critical tasks are satisfied Worst-case energy estimation is crucial

5 Reward-Based Scheduling Energy consumption  Voltage Delay  (1 / Voltage) Reward-based scheduling attempts to satisfy constraints on energy and timing Energy guarantee only if worst-case energy consumption of tasks are known

6 Outline Background Relation between WCET and Worst-case energy consumption Estimation technique: Simplified model Instruction cache and speculation Experimental results Conclusion

7 Background Power and energy are often used interchangeably Power is energy consumed per unit time Energy consumed during program execution E = P × t Approximation as P is also a function of time

8 In reality when a program executes Energy is the area under the curve E = ∫P(t)dt E=P×T is an approximation Power Time

9 WCEC versus WCET Full Input Space Expansion for a 5-element Insertion Sort program

10 Cannot Estimate WCEC from WCET BenchmarkWCET×avg_power µJ Observed µJ isort fft fdct ludcmp matsum minver bsearch des matmult qsort qurt Possible underestimation using WCEC=WCET × power

11 WCEC versus WCET WCEC path need not be the same as the WCET path WCEC cannot be directly estimated from the WCET value

12 A closer look at Power Dynamic Power : Power Consumption due to switching of transistors Leakage Power: Power consumed independent of switching activity Dynamic power forms the bulk of power consumption in today’s processors

13 Dynamic Power P=(1/2) × A × V 2 × C × f V is supply voltage C is the capacitance of the circuit f is the frequency A is the activity factor V, C, f are independent of program execution Variation in P is due to the variation in A

14 Variation in Activity Factor (A) Not all parts of the processor are used in every cycle  e.g., data-cache is used only for loads/stores Clock gating disables unused components Activity factor (A) varies during the execution of the program Model variation in A through static analysis

15 Switch-off Energy An inactive component cannot be fully switched off  A certain portion of the peak energy is consumed even in idle cycles Switch-off energy is proportional to the number of idle cycles

16 Clock Energy and Leakage Energy Clock power: power consumed in clock distribution network Leakage power: power consumed due to leakage in transistors Clock energy and leakage energy are directly proportional to the execution time

17 Energy Components Summary Dynamic Energy  Switching of transistors during execution  Independent of execution time Switch-off Energy  Energy consumed in unused components  Depends on idle cycles Clock and Leakage energy  Directly proportional to execution time

18 WCEC versus WCET Full Input Space Expansion for a 5-element Insertion Sort program

19 Our Analysis: Overview Operate on the control flow graph Estimate worst-case energy of basic blocks Formulate estimation for whole program as an integer linear programming (ILP) problem

20 ILP Formulation Input: Control flow graph of the program Objective function: Need to estimate Worst-Case Energy Consumption( WCEC B ) for each basic block Worst Case Energy =  WCEC B  count B

21 Flow Constraints E 0,1 = B 0 = 1 E 2,3 + E 1,3 = B 3 = 1 E 0,1 + E 2,1 = E 1,2 + E 1,3 = B 1 E 1,2 = E 2,3 + E 2,1 = B 2 Loop bound: E 2,1 <= 100 B0 B1 B2 B3 Inflow = Basic Block Execution Count = Outflow Bounds on maximum loop iterations

22 Worst-Case Energy of a Basic Block Processor Model Energy Components  Instruction Specific Energy  Pipeline Specific Energy

23 Processor Model I-1I-4 I-2 I-3 IBUF ROB ALU MULT FPU I+1 I IF ID EX WB CM ISSUE

24 Pipelined Execution of Instructions ADD R1,R2,R3 MUL R4,R5,R6 SUB R7,R8,R CC ADD IFID ISEXWBCM MUL IF ID ISEXWBCM SUB IFID ISEXWB CM Difficult to statically predict the energy consumption in each cycle

25 Pipelined Execution of Instructions ADD R1,R2,R3 MUL R4,R5,R6 SUB R7,R8,R CC ADD IFID ISEXWBCM MUL IFIDISEX WB SUB IFIDIS EX Difficult to statically predict the energy consumption in each cycle Stall

26 Our Approach Determine the maximum energy consumed on a component by component basis Static analysis to determine the maximum energy consumed by a component in a specified interval

27 Execution of Instruction IF ID EX WB CM ISSUE

28 Instruction Specific Energy Energy consumed due to the sub-tasks associated with execution of an instruction  e.g., register file access, ALU usage, etc. Depends on the type of executed instruction No correlation with execution time

29 Pipeline Specific Energy During program execution energy is consumed due to  Switch-off power (idle cycles)  Leakage power (every cycle)  Clock network power (every cycle) Cannot be attributed to any instruction Energy consumed even in idle cycles

30 Energy Components Observation: Energy consumed can be separated out as  Instruction Specific energy Energy associated with the execution of a particular instruction Independent of execution time  Pipeline Specific energy Energy consumed in other components such as clock network, leakage etc. Related to execution time

31 Worst-case Energy of a Basic block dynamic BB : Instruction-Specific Energy for BB switchoff BB, leakage BB and clock BB are energy consumed in unused components, leakage and clock network during WCET BB

32 Instruction Specific Energy Energy consumed due to switching activity generated by the instructions in BB Sum of energy consumed by individual instructions in BB

33 Switch-off Energy Unused units consume 10% of peak energy Switch-off energy for a specific component (C) Switch-off energy for basic block BB

34 Clock Energy and Leakage Energy Clock Energy Leakage Energy

35 Overlap among basic blocks B1B2 BB B3 B1 B3 Time t1 t2 t3 t4 t5 WCET BB

36 Switch-off Energy Unused units consume 10% of peak energy Switch-off energy for a specific component (C) Switch-off energy for basic block BB

37 Instruction Cache Modeling Context based ILP formulation used in WCET analysis [Li et al RTSS 2004] Basic block divided into memory blocks A context comprises of mapping each of these memory blocks to hit/miss Estimate the worst-case energy of each context taking into account main memory access energy

38 Modeling Branch miss-prediction BB’ BB BB’ BX BB Time t1 t2 t3 BX

39 Objective function count(c,ω) is the number of times the basic block Bi is executed with path from Bj and the branch is predicted correctly count(m,ω) is similarly defined where the branch is miss- predicted In a similar manner energy(c,ω) and energy(m,ω) are defined The ILP problem is solved to generate values for count using constraints similar to WCET analysis

40 Results Platform: Simplescalar toolset Modified WCET analysis tool [Li et al RTSS 2004] to estimate worst-case energy Energy values for processor components derived from parameterized models in Wattch ILP problem is solved using CPLEX

41 Results Compare estimated WCEC against the observed values for eleven benchmarks Observed values are obtained using Wattch power simulator Actual inputs producing WCEC is unknown  Manually select inputs that might produce WCEC

42 Styles of Clock Gating Simple: Peak power is consumed even if there is one access to a specific component Ideal : Power consumed is proportional to the number of ports accessed Realistic: Same as ideal but unused components consume switch-off power

43 Results Results for ideal clock gating more accurate than simple because of distribution of accesses Benchmarks isort fft fdct ludcmp matsum minver bsearch des matmult qsort qurt Est(µJ)Obs(µJ)Ratio Ideal Clock Gating Est(µJ)Obs(µJ)Ratio Simple Clock Gating

44 Results Results for ideal clock gating more accurate than realistic because of conservative WCET estimation Benchmarks isort fft fdct ludcmp matsum minver bsearch des matmult qsort qurt Est(µJ)Obs(µJ)Ratio Realistic Clock Gating Est(µJ)Obs(µJ)Ratio Ideal Clock Gating

45 Conclusion Static worst-case energy estimation technique that takes into account pipelining, instruction cache and branch prediction Future work  Validation using commercial processors  Explore the possibility of providing thermal guarantees

46 Execution of an Add Instruction IF ID EX WB CM ISSUE I-Cache Access Instruction Decode + Rename Logic Wakeup + Selection logic Register File Read + Add unit access Result Bus ROB-retire + Register file Update ADD

47 Instruction Specific Energy Each Component Accessed once Selection logic maybe accessed multiple times Instruction Specific Energy is