10 years of research on Power Management (now called green computing) Rami Melhem Daniel Mosse Bruce Childers.

Slides:

Advertisements

Similar presentations

Computer Architecture

Advertisements

System Integration and Performance

Page Replacement Algorithms

Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.

DSPs Vs General Purpose Microprocessors

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Chapter 3 Pipelining. 3.1 Pipeline Model n Terminology –task –subtask –stage –staging register n Total processing time for each task. –T pl =, where t.

Introduction and Background  Power: A Critical Dimension for Embedded Systems  Dynamic power dominates; static /leakage power increases faster  Common.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

Computer Organization and Architecture

Power Aware Real-time Systems Rami Melhem A joint project with Daniel Mosse, Bruce Childers, Mootaz Elnozahy.

Minimizing Expected Energy Consumption in Real-Time Systems through Dynamic Voltage Scaling Ruibin Xu, Daniel Mosse’, and Rami Melhem.

1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Chapter 12 Pipelining Strategies Performance Hazards.

Chapter 1 and 2 Computer System and Operating System Overview

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Lecture 17: Virtual Memory, Large Caches

Chapter 12 CPU Structure and Function. Example Register Organizations.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Chapter 1 and 2 Computer System and Operating System Overview

Processor Frequency Setting for Energy Minimization of Streaming Multimedia Application by A. Acquaviva, L. Benini, and B. Riccò, in Proc. 9th Internation.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Chapter 3 Memory Management: Virtual Memory

Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Low-Power Wireless Sensor Networks

Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.

1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Probabilistic Preemption Control using Frequency Scaling for Sporadic Real-time Tasks Abhilash Thekkilakattil, Radu Dobrin and Sasikumar Punnekkat.

MODULE 5: Main Memory.

Scheduling policies for real- time embedded systems.

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

Hard Real-Time Scheduling for Low- Energy Using Stochastic Data and DVS Processors Flavius Gruian Department of Computer Science, Lund University Box 118.

Computer Architecture Lecture 32 Fasih ur Rehman.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Power Aware Real-time Systems A joint project with profs Daniel Mosse Bruce Childers Mootaz Elnozahy (IBM Austin) And students Nevine Abougazaleh Cosmin.

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

CprE 458/558: Real-Time Systems (G. Manimaran)1 Energy Aware Real Time Systems - Scheduling algorithms Acknowledgement: G. Sudha Anil Kumar Real Time Computing.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

NETW3005 Virtual Memory. Reading For this lecture, you should have read Chapter 9 (Sections 1-7). NETW3005 (Operating Systems) Lecture 08 - Virtual Memory2.

Power Aware Real-time Systems A joint project with profs Rami Melhem Bruce Childers Mootaz Elnozahy (IBM Austin) Trying to rope in Ahmed Amer Jose Brustoloni.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.

Determining Optimal Processor Speeds for Periodic Real-Time Tasks with Different Power Characteristics H. Aydın, R. Melhem, D. Mossé, P.M. Alvarez University.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Temperature and Power Management

Lecture: Large Caches, Virtual Memory

Cache Memory Presentation I

Babak Sorkhpour, Prof. Roman Obermaisser, Ayman Murshed

BIC 10503: COMPUTER ARCHITECTURE

Computer Architecture

Virtual Memory: Working Sets

Presentation transcript:

10 years of research on Power Management (now called green computing) Rami Melhem Daniel Mosse Bruce Childers

Introduction Power management in real-time systems Power management in multi-core processors Performance-Resilience-Power Tradeoff Management of memory power Phase Change Memory

Power management techniques Two common techniques: 1)Throttling Turn off (or change mode of) unused components (Need to predict usage patterns to avoid time and energy overhead of on/off or mode switching) 2)Frequency and voltage scaling Scale down core’s speed (frequency and voltage) Designing power efficient components is orthogonal to power management

Frequency/voltage scaling Gracefully reduce performance Dynamic power P d = C f 3 + P ind Static power: independent of f. power Static power time C f 3 P ind time When frequency is halved: Time is doubled C f 3 is divided by 8 Energy caused by C f 3 is divided by 4 Energy caused by P ind is doubled Idle time

Minimize total energy consumption - static energy decreases with speed - dynamic energy increases with speed Minimize the energy-delay product –Takes performance into consideration Minimize the maximum temperature Maximize performance given a power budget Minimize energy given a deadline Minimize energy given reliability constraints Different goals of power management Energy*delay f P ind / f 2 C f energy Speed (f) C f 2 total P ind / f

DVS in real-time systems CPU speed time deadline S max S min Worst case execution Remaining time Utilize slack to slow down future tasks (Proportional, Greedy, aggressive,…) time Static scaling (power management points) Dynamic scaling Remaining time

Implementation of Power Management Points Can be implemented as periodic OS interrupst Difficulty: OS does not know how much execution is remaining Compiler can insert code to provide hints to the OS minaveragemax branch loop

Example of compiler/OS collaboration Compiler records WCET based on the longest remaining path At a power management hint minaveragemax At a power management point OS uses knowledge about current load to set up the speed

Run-time information OS/HW (knows the system) Compiler/OS collaboration Compiler (knows the task) Static analysis Application Source Code PMHs: Power management hints PMPs: Power management points Interrupts for executing PMPs PMHs time

DVS for multiple cores Manage energy by determining: The speed for the serial section The number of cores used in the parallel section The speed in the parallel section One core Two cores Slowing down the cores Slowing down the parallel section To derive a simple analytical model, assume Amdahl’s law: - p % of computation can be perfectly parallelized. p Using more cores s

Streaming applications are prevalent –Audio, video, real-time tasks, cognitive applications Constrains: –Inter-arrival time (T) –End-to-end delay (D) Power aware mapping to CMPs –Determine speeds –Account for communication –Exclude faulty cores T D Mapping streaming applications to CMPs

Mapping a linear task graph onto a linear pipeline If the # of stages = # of cores Core Core Core Core minimize Subject to e i : energy for executing stage i  i : energy for moving data from stage i-1 to stage i t i : time for executing stage i  i : time for moving data from stage i-1 to stage i Find t stage

1) Group the stages so that the number of stages equals the number of cores 2) Use a dynamic programming approach to explore possible groupings 3) A faster solution may guarantee optimality within a specified error bound. CoreCoreCoreCore If the # of stages > # of cores Mapping a linear task graph onto a linear pipeline

Timing constraints are conventionally satisfied through load balanced mapping Additional constraint –Minimize energy consumption –Maximize performance for a given energy budget –Avoid faulty cores instance instance A B C E D F G H I J K Mapping a non-linear task graph onto CMP A B C D F E G H I J K Maximum speed Medium speed Minimum speed

Turn OFF some PEs Maximum speed/voltage (f max ) instance instance A B C E D F G H I J K A B C D F E G H I J K Medium speed/voltage Minimum speed/voltage (f min ) PE OFF

DVS using Machine Learning Characterize the execution state of a core by Rate of instruction execution (IPC) # of memory accesses per instruction Average memory access time (depends on other threads) During training, record for each state The core frequency The energy consumption Determine the optimal frequency for each state During execution, periodically, Estimate the current state (through run-time measurements) Assume that the future is a continuation of the present Set the frequency to the best recorded during training M MCMC core L1 $$ core L1 $$ L2 $$

17 Training phase Runtime Learning engine determine freq. & voltages Integrated DVS policy Auto. policy generator Statistical learning applied to DVS in CMPs.

deadline If you have a time slack: 1) add checkpoints 2) reserve recovery time 3) reduce processing speed For a given number of checkpoints, we can find the speed that minimizes energy consumption, While guaranteeing recovery and timeliness. S max Using time redundancy (checkpointing and rollbacks) Energy-Reliability tradeoff

More checkpoints = more overhead + less recovery slack D C r Optimal number of checkpoints For a given slack (C/D) and checkpoint overhead (r/C), we can find the number of checkpoints that minimizes energy consumption, and guarantee recovery and timeliness. # of checkpoints Energy

Faults are rare events If a fault occurs, may continue executing at S max after recovery.

Non-uniform check-pointing Observation: If a fault occurs, may continue executing at S max after recovery. Advantage: recovery in an early section can use slack created by execution of later sections at S max Disadvantage: increases energy consumption when a fault occurs (a rare event) Requires non-uniform checkpoints.

Triple Modular Redundancy vs. Duplex TMR: vote and exclude the faulty result Duplex: Compare and roll back if different Efficiency of TMR Vs. Duplex depends on static power ( ), checkpoint overhead and load Duplex is more Energy efficient TMR is more Energy efficient Load= checkpoint overhead

Add memory power to the mix Example : DRAM and SRAM modules can be switched between different power states (modes) – not free: - Mode transition power overhead - Mode transition time overhead Active (779.1 mW) Power-down (150 mW) Standby (275.0 mW) Self-refresh (20.87 mW) 5ns 1000ns5ns auto

keep a histogram for patterns of bank accesses and idle time distributions. Use machine learning techniques to select the optimal “threashold” to turn banks off. OS assisted Memory Power Management?

Example of compiler assisted Memory Power Management? …. Load x …. Store x …. Load z …. Load y …. Store z …. Store y …. Load x Load y …. Store y Load z …. Store z Compiler transformation Code transformations to increase the memory idle time (the time between memory accesses).

Declare A[], B[], C[], D[] …. Access A …. Access D …. Access B …. Access C …. Access B …. Memory allocation Algorithms that use the access pattern to allocate memory to banks in a way that maximizes bank idle times A[], B[] C[], D[] A[], D[] C[], B[] OR Example of compiler assisted Memory Power Management?

Phase Change Memory (PCM) A power saving memory technology Solid State memory made of germanium-antimony alloy Switching between states is thermal based (not electrical based) Samsung, Intel, Hitachi and IBM developed PCM prototypes (to replace Flash).

Properties of PCM Non-volatile but faster than Flash Byte addressable but denser and cheaper than DRAM No static power consumption and very low switching power Not susceptible to SEUs (single event upsets) and hence do not need error detecting or correcting codes o Errors occur only during write (not read) – use a simple read-after-write to detect errors

So, where is the catch? Slower than DRAM factor of 2 for read and 10 for write Low endurance A cell fails after 10 7 writes (as opposed to for DRAM) Asymmetric energy consumption write is more expensive than read Asymmetry in bit writing writing 0s is faster than writing 1s

CPU Memory Controller DRAM CPU AEB MM PCM Traditional architecture Proposed architecture AEB: acceleration/endurance buffer MM: memory manager Goal: use PCM as main memory Advantages: cheaper + denser + lower power consumption

Dealing with asymmetric read/write Use coherence algorithms in which “writes” are not on the critical path. Design algorithms with “read rather than write” in mind Take advantage of the fact that writing 0s is faster than 1s Pre-write a block with 1’s as soon as block is dirty in the cache On write back, only write 0’s.

Dealing with low write endurance (write minimization) Block (or page) allocation algorithms should not be oblivious to the status of the block – for wear minimization Modify the cache replacement algorithm ex. LRR replacement (least recently read) Lower priority to dirty pages in cache replacement Use coherence algorithms that minimize writes (write- through is not good) Read/compare/write == write a bit only if it is different than the current content

Wear leveling Memory allocation decisions should consider age of blocks (age = number of write cycles exerted) Periodically change the physical location of a page (write to a location different than the one read from) Consider memory as a consumable resource - can be periodically replaced

Memory Manager CPU Bus Interface Request Controller Tag Array V St CPU R/W Size Tag AEB PCM Requests Buffer In Flight Buffer (SRAM) FSM DRAM Controller/DMAC PCM Controller/DMAC V D Addr n In Flight Buffer Busy Bitmap AEB Page Cache In Flight Buffer Spare Table PCM Pages Area Spares V D Addr Control bus Data bus Detailed architecture

Conclusion Time constrains (deadlines or rates) Energy constrains Reliability constrains It is essential to manage the tradeoffs between Hardware CompilerOS