Temperature and Power Management

Slides:



Advertisements
Similar presentations
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Advertisements

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
Power Reduction Techniques For Microprocessor Systems
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
S. Reda EN160 SP’08 Design and Implementation of VLSI Systems (EN1600) Lecture 14: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
Institute of Digital and Computer Systems 1 Fabio Garzia / Finding Peak Performance in a Process23/06/2015 Chapter 5 Finding Peak Performance in a Process.
CSE477 L26 System Power.1Irwin&Vijay, PSU, 2002 Low Power Design in Microarchitectures and Memories [Adapted from Mary Jane Irwin (
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
Lecture 5 – Power Prof. Luke Theogarajan
Lecture 7: Power.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
EE466: VLSI Design Power Dissipation. Outline Motivation to estimate power dissipation Sources of power dissipation Dynamic power dissipation Static power.
CSE477 L26 System Power.1Irwin&Vijay, PSU, 2002 TKT-1527 Digital System Design Issues Low Power Techniques in Microarchitectures and Memories Mary Jane.
17 Sep 2002Embedded Seminar2 Outline The Big Picture Who’s got the Power? What’s in the bag of tricks?
Power Reduction for FPGA using Multiple Vdd/Vth
Low-Power Wireless Sensor Networks
1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Basics of Energy & Power Dissipation Lecture notes S. Yalamanchili, S. Mukhopadhyay. A. Chowdhary.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Low Power – High Speed MCML Circuits (II)
Thermal-aware Issues in Computers IMPACT Lab. Part A Overview of Thermal-related Technologies.
Leakage reduction techniques Three major leakage current components 1. Gate leakage ; ~ Vdd 4 2. Subthreshold ; ~ Vdd 3 3. P/N junction.
Basics of Energy & Power Dissipation
Processor Architecture
Lev Finkelstein ISCA/Thermal Workshop 6/ Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David)
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Patricia Gonzalez Divya Akella VLSI Class Project.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.
CS203 – Advanced Computer Architecture
LOW POWER DESIGN METHODS
PipeliningPipelining Computer Architecture (Fall 2006)
Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker : Chun-Chung Chen Single-ISA.
What’s going on here? Can you think of a generic way to describe both of these?
Computer Hardware What is a CPU.
Unit 2 Technology Systems
Overview Motivation (Kevin) Thermal issues (Kevin)
GCSE OCR Computing A451 The CPU Computing hardware 1.
Smruti R. Sarangi IIT Delhi
COMP541 Memories II: DRAMs
YASHWANT SINGH, D. BOOLCHANDANI
CS203 – Advanced Computer Architecture
Data Prefetching Smruti R. Sarangi.
LOW POWER DESIGN METHODS V.ANANDI ASST.PROF,E&C MSRIT,BANGALORE.
Central Processing Unit- CPU
Hot Chips, Slow Wires, Leaky Transistors
Assembly Language for Intel-Based Computers, 5th Edition
Architecture & Organization 1
Circuits and Interconnects In Aggressively Scaled CMOS
Architecture Background
Microarchitectural Techniques for Power Gating of Execution Units
CSCI1600: Embedded and Real Time Software
Multicultural Social Community Development Institute ( MSCDI)
Architecture & Organization 1
Computer Architecture Lecture 4 17th May, 2006
Energy Efficient Scheduling in IoT Networks
Overheads for Computers as Components 2nd ed.
Adaptive Single-Chip Multiprocessing
Memory Organization.
Data Prefetching Smruti R. Sarangi.
Lecture 7: Power.
Power and Heat Power Power dissipation in CMOS logic arises from the following sources: Dynamic power due to switching current from charging and discharging.
Lecture 7: Power.
Scheduling.
A Quasi-Delay-Insensitive Method to Overcome Transistor Variation
CSCI1600: Embedded and Real Time Software
Instructor: Michael Greenbaum
Presentation transcript:

Temperature and Power Management Smruti R. Sarangi

Outline Dynamic Power Management Leakage Power Management DVFS Clock gating big.LITTLE approach Fetch throttling Leakage Power Management Temperature Reduction

DVFS Scaling DVFS is one of the the most popular method of reducing power in processors. Every processor has a DVFS table: Pairs of: voltage and frequency It is possible to choose one among several discrete DVFS settings Internal Operation The processor gets cues from software (user or OS) regarding changing the DVFS settings The processor also might decide on its own

Chip’s Power Grid and Frequency Control System 3.3V Power Supply Voltage Regulator 0.8-1.2V PLLs Quartz clock The quartz clock generates a fixed 133 MHz signal PLL  phase locked loop It helps generate a clock signal that is synchronized with the quartz clock The frequency is a multiple of 133 MHz For example, we can use it to generate a frequency of 133MHz * 16 = 2.13 GHz The PLL takes 10s of micro-seconds to lock to a new frequency. During that time there is no usable clock signal.

Changing Voltage and Frequency PLL lock time PLL lock time Voltage V1 V0 Voltage conversion Voltage conversion

Hardware based DVFS Estimate the amount of CPU activity If it is low  reduce the frequency If it is high  increase the frequency (if you need performance) Estimating CPU activity Average L2 misses per instruction Commit(retirement) rate We essentially need a model to correlate frequency and performance Option 1: Get it by profiling. Run small phases of the program, and record the IPC. Option 2: Method of stall rates: assumes that the stall cycles due to LLC misses is proportional to the frequency. Decrease the frequency till the LLC miss stalls are below a certain threshold.

Software based DVFS Each frame needs to be processed in 33 ms If we can do it in 20 ms Reduce the frequency till we process it in 33 ms Need a model to relate processing time and frequency Video Codecs Regular programs Classify them: hard real time, soft real time, interactive, periodic, batch Real time tasks  set DVFS settings based on performance and deadlines Interactive  Take the user’s perception into account Periodic jobs  Take the periodicity into account Batch  Take the user’s requirements into account

Linux Speed Governors Use the cpufreq utility Performance  maximum possible frequency Powersave  always run at minimum frequency Ondemand  Tries to maintain a constant rate of CPU utilization. Uses a set of thresholds for each DVFS setting. Conservative  Much more conservative than ondemand Interactive  Similar to Ondemand, but does not use thresholds. Uses a formula that relates CPU utilization to frequency.

Clock Gating Recall Dynamic power is only consumed during a transition. Block 16 Block 1 32 31 30 29 4 3 2 1 G,P G,P G,P G,P Carry lookahead adder 32-31 30-29 4-3 2-1 G,P G,P 32-29 4-1 G,P G,P G,P G,P Assume bit #4 changes Only the small part of the circuit shown in red is affected The rest of the elements do not dissipate any dnamic power 32-25 24-17 16-9 8-1 G,P G,P 32-17 16-1 G,P 32-1

Typical Structure of a Circuit clock Pipeline Register Pipeline Register Logic What if the clock signal is 0? The output of the registers do not change There are no state transitions in the logic No current flow and thus no dynamic power dissipation

Circuit with clock gating S Pipeline Register Pipeline Register Logic If S = 0, the inputs to the logic circuit don’t change. The circuit is clock gated. If S = 1, normal operation

Clock Gating Present in almost all architectures Guess/predict/deduce if a unit is off For example, an add instruction will not use the divider Clock-gate the divider Note that the divider will still have leakage In processors such as Pentium 4 They try to ensure that there is absolutely no deviation in timing by enabling clock gating Some times, we can aggressively clock gate. Instructions will have to wait till the unit is enabled.

Other Architectural Techniques ARM big.LITTLE Architecture, or Samsung’s dual quad processor Have N big cores, and M small cores Depending on the nature of the task and its priority, choose: a big core  if it is important a little core  if it is not important, and power needs to be saved. Fetch throttling Dynamically adjust the fetch/issue/commit rate  Based on power constraints Idea 1: After fetching low-confidence branches, reduce the fetch rate (decreases the number of potential wrong-path instructions) Idea 2: Reduce the fetch rate in the shadow of an L2 miss

Outline Dynamic Power Management Leakage Power Management DVFS Clock gating big.LITTLE approach Fetch throttling Leakage Power Management Temperature Reduction

Need to have power switches at each connection to the power grid Power Gating Brute force method: Just turn off the power Easier said than done Power Grid Power controllers Functional Unit Need to have power switches at each connection to the power grid

Multiple Transistor Sizes Transistors with shorter channels and transistors with longer channels Normal transistors: power  1 unit, time  1 unit Longer channel transistors: power  0.3 units, time  1.1 units Use normal transistors on the critical path, and slower transistors off the critical path Gate sizing Delay 𝐴∝𝐴+𝐵/𝑊 , Power ∝𝑊 Slower transistors: smaller W/L ratio Same idea: Slower transistors off the critical path, Faster transistors on the critical path.

Adaptive Body Biasing Vth = Vth1 – K1 ⋅ Vdd – K2 ⋅ Vbs Forward body biasing  Increase Vbs Reduce Vth Increase power, increase performance Reverse body biasing  Decrease Vbs (even –ve) Increase Vth Decrease power, decrease performance Same idea: forward body biasing in the critical path, reverse body biasing off the critical path

Drowsy Caches drowsy mode Maintain the value, accesses not allowed Allows read/writes Vdd = 1V Vdd = 0.3 V row of SRAM cells row of SRAM cells Drowsy mode  Runs at 0.3 V. Maintains the value. Access it not allowed Takes 1-2 cycles to enter/exit drowsy mode Treat a set of lines as 1 unit Turn it on/off as 1 unit Once a set is turned on  Keep it on 1000-2000 cycles Take temporal and spatial locality into account

Outline Dynamic Power Management Leakage Power Management DVFS Clock gating big.LITTLE approach Fetch throttling Leakage Power Management Temperature Reduction

Dynamic Thermal Management Place thermal sensors all over the chip Once a temperature hot-spot forms Traditional mechanisms: DVFS, power reduction, fetch throttling Many new techniques for CMP (multicore) processors Stop-n-go Temporarily stop a core (let it cool down) Heat and run thread assignment Don’t allow hot cores to be close to each other If a thread’s activity increases, migrate it to a colder region of the chip