Energy and Power Lecture notes S. Yalamanchili and S. Mukhopadhyay.

Slides:



Advertisements
Similar presentations
Topics Electrical properties of static combinational gates:
Advertisements

Computer Structure Power Management Lihu Rappoport and Adi Yoaz Thanks to Efi Rotem for many of the foils.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Computer Abstractions and Technology
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
Power Reduction Techniques For Microprocessor Systems
Penn ESE534 Spring Mehta & DeHon 1 ESE534 Computer Organization Day 6: February 12, 2014 Energy, Power, Reliability.
Introduction to CMOS VLSI Design Lecture 18: Design for Low Power David Harris Harvey Mudd College Spring 2004.
S. Reda EN160 SP’08 Design and Implementation of VLSI Systems (EN1600) Lecture 14: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
Lecture 5 – Power Prof. Luke Theogarajan
Architectural Power Management for High Leakage Technologies Department of Electrical and Computer Engineering Auburn University, Auburn, AL /15/2011.
Temperature-Aware Design Presented by Mehul Shah 4/29/04.
Lecture 7: Power.
Lecture 7: Power.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Power-aware Computing n Dramatic increases in computer power consumption: » Some processors now draw more than 100 watts » Memory power consumption is.
Lecture 21, Slide 1EECS40, Fall 2004Prof. White Lecture #21 OUTLINE –Sequential logic circuits –Fan-out –Propagation delay –CMOS power consumption Reading:
Low Power Design of Integrated Systems Assoc. Prof. Dimitrios Soudris
Power, Energy and Delay Static CMOS is an attractive design style because of its good noise margins, ideal voltage transfer characteristics, full logic.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
CS 423 – Operating Systems Design Lecture 22 – Power Management Klara Nahrstedt and Raoul Rivas Spring 2013 CS Spring 2013.
Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Power Management Lecture notes S. Yalamanchili and S. Mukhopadhyay.
6.893: Advanced VLSI Computer Architecture, September 28, 2000, Lecture 4, Slide 1. © Krste Asanovic Krste Asanovic
EE466: VLSI Design Power Dissipation. Outline Motivation to estimate power dissipation Sources of power dissipation Dynamic power dissipation Static power.
1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.
ENGG 6090 Topic Review1 How to reduce the power dissipation? Switching Activity Switched Capacitance Voltage Scaling.
17 Sep 2002Embedded Seminar2 Outline The Big Picture Who’s got the Power? What’s in the bag of tricks?
Low Power Techniques in Processor Design
Power Saving at Architectural Level Xiao Xing March 7, 2005.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Low-Power Wireless Sensor Networks
1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Basics of Energy & Power Dissipation Lecture notes S. Yalamanchili, S. Mukhopadhyay. A. Chowdhary.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
® 1 VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005.
1 CS/EE 6810: Computer Architecture Class format:  Most lectures on YouTube *BEFORE* class  Use class time for discussions, clarifications, problem-solving,
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Computational Sprinting on a Real System: Preliminary Results Arun Raghavan *, Marios Papaefthymiou +, Kevin P. Pipe +#, Thomas F. Wenisch +, Milo M. K.
경종민 Low-Power Design for Embedded Processor.
Basics of Energy & Power Dissipation
© Digital Integrated Circuits 2nd Inverter Digital Integrated Circuits A Design Perspective The Inverter Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.
Bi-CMOS Prakash B.
0 1 Thousand Core Chips A Technology Perspective Shekhar Borkar Intel Corp. June 7, 2007.
FPGA-Based System Design: Chapter 2 Copyright  2004 Prentice Hall PTR Topics n Logic gate delay. n Logic gate power consumption. n Driving large loads.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.
11/15/05ELEC / Lecture 191 ELEC / (Fall 2005) Special Topics in Electrical Engineering Low-Power Design of Electronic Circuits.
CS203 – Advanced Computer Architecture
LOW POWER DESIGN METHODS
PipeliningPipelining Computer Architecture (Fall 2006)
CS203 – Advanced Computer Architecture
Temperature and Power Management
Hot Chips, Slow Wires, Leaky Transistors
Basics of Energy & Power Dissipation
Architecture & Organization 1
Intel Atom Architecture – Next Generation Computing
Reading: Hambley Ch. 7; Rabaey et al. Sec. 5.2
Architecture & Organization 1
Circuit Design Techniques for Low Power DSPs
Chapter 1 Introduction.
Computer Evolution and Performance
Lecture 7: Power.
Lecture 7: Power.
Presentation transcript:

Energy and Power Lecture notes S. Yalamanchili and S. Mukhopadhyay

Some Useful Reading http://en.wikipedia.org/wiki/CPU_power_dissip ation http://en.wikipedia.org/wiki/CMOS#Power:_sw itching_and_leakage http://www.xbitlabs.com/articles/cpu/display/c ore-i5-2500t-2390t-i3-2100t-pentium- g620t.html http://www.cpu-world.com/info/charts.html

Historical Scaling

Technology Scaling GATE GATE SOURCE BODY DRAIN SOURCE DRAIN tox L 30% scaling down in dimensions  doubles transistor density Power per transistor Vdd scaling  lower power Transistor delay = Cgate Vdd/ISAT Cgate, Vdd scaling  lower delay

Fundamental Trends High Volume Manufacturing 2004 2006 2008 2010 2012 2014 2016 2018 Technology Node (nm) 90 65 45 32 22 16 11 8 Integration Capacity (BT) 2 4 64 128 256 Delay = CV/I scaling 0.7 ~0.7 >0.7 Delay scaling will slow down Energy/Logic Op scaling >0.35 >0.5 Energy scaling will slow down Bulk Planar CMOS High Probability Low Probability Alternate, 3G etc Low Probability High Probability Variability Medium High Very High ILD (K) ~3 <3 Reduce slowly towards 2-2.5 RC Delay 1 Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation Source: Shekhar Borkar, Intel Corp.

ITRS Roadmap for Logic Devices From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008

Where Does the Power Go in CMOS? Dynamic Power Consumption Charging and discharging capacitance Short Circuit Power Short circuit path between supply rails during switching Nominally 10%-20% of dynamic power and can be ignored for a first order analysis Leakage Leaky transistors

Dynamic Power Dynamic power is used in charging and discharging the capacitances in the CMOS circuit. Time VDD Voltage T Output Capacitor Charging Output Capacitor Discharging Input to CMOS inverter iDD CL PDYNAMIC = CL x VDD x VDD x Frequency

Static Power Technology scaling has caused transistors to become smaller and smaller. As a result, static power has become a substantial portion of the total power. Gate Leakage Junction Leakage Sub-threshold Leakage Input = 0 Output = VDD PSTATIC = VDD x ISTATIC

Energy-Delay Interaction EDP Energy or delay VDD VDD Delay decreases with supply voltage but energy/power increases

Static Energy-Delay Interaction leakage or delay Vth leakage delay tox SOURCE DRAIN L GATE Static energy increases exponentially with decrease in threshold voltage Delay increases with threshold voltage

Same Energy = area under the curve Power Vs. Energy P2 Power(watts) P1 P0 Same Energy = area under the curve Time Power(watts) P0 Time Energy is a rate of expenditure of energy One joule/sec = one watt Both profiles use the same amount of energy at different rates or power

Optimizing Power vs. Energy Maximize battery life  minimize energy Thermal envelopes  minimize peak power

The Problem Historically performance scaling was accompanied by power scaling This is no longer true  power densities are increasing

The End of Dennard Scaling GATE SOURCE DRAIN tox L Voltage is no longer scaling at the same rate Slower scaling in power per transistor  increasing power densities From R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.

Chip Power Densities From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008

What is the Problem? Based on scaling using Pentium-class cores Mukhopadhyay and Yalamanchili (2009) Dark silicon Based on scaling using Pentium-class cores While Moore’s Law continues, scaling phenomena have changed Power densities are increasing with each generation

The Power Wall Power per transistor scales with frequency but also scales with Vdd Lower Vdd can be compensated for with increased pipelining to keep throughput constant Power per transistor is not same as power per area  power density is the problem! Multiple units can be run at lower frequencies to keep throughput constant, while saving power

The Advent of Dark Silicon? In-order core Out of-order core Cannot afford to turn on all devices at once How do we manage the power and thermals? 64-core asymmetric chip multiprocessor layout and failure probability distribution

What are my Options? Better technology Manufacturing New Devices  non-CMOS? Be more efficient – activity management Clock gating Power gating Power management Improved architecture Simpler pipelines Parallelism

Activity Management Clock Gating Power Gating Vdd Combinational Logic clk cond input Power gate transistor     Core 0 Core 1 Turn off clock to a block of logic Eliminate unnecessary transitions/activity Clock distribution power Turn off power to a block of logic, e.g., core No leakage

Power Management Software controlled power management Optimize power and/or energy Orchestrated by the operating system or application libraries Industry standard interfaces for power management Advanced Configuration and Power Interface (ACPI) https://www.acpica.org/ http://www.acpi.info/ Hardware power management Optimized power/energy Failsafe operation, e.g., protect against thermal emergencies

Processor Power States Performance States – P-states Operate at different voltage/frequencies Recall delay-voltage relationship Lower voltage  lower leakage Lower frequency  lower power (not the same as energy!) Lower frequency  longer execution time Idle States - C-states Sleep states Differ is how much state is saved SW or HW managed transitions between states!

Multiple Voltage Frequency Domains Intel Sandy Bridge Processor Cores and ring in one DVFS domain Graphics unit in another DVFS domain Cores and portion of cache can be gated off From E. Rotem et. Al. HotChips 2011

Power States From: http://www.intel.com/content/www/us/en/processors/core/2nd-gen-core-family-mobile-vol-1-datasheet.html

Intel Sandy Bridge Processor Power Gating Turn off components that are not being used Lose all state information Costs of powering down Costs of powering up Smart shutdown Models to guide decisions Intel Sandy Bridge Processor

Simplify Core Design AMD Bulldozer Core Support for out of order execution, schedulers, branch prediction, etc. consumes more energy per instruction Can fit many more simpler cores on a dies ARM A7 Core (arm.com)

Parallelism and Power How much of the chip area is devoted to compute? IBM Power5 AMD Trinity Source: forwardthinking.pcmag.com Source: IBM How much of the chip area is devoted to compute? Run many cores slower. Why does this reduce power?

Parallelism Concurrency + lower frequency  greater energy efficiency Example Core Cache Core Cache Core Cache 4X #cores 0.75x voltage 0.5x Frequency 1X power 2X in performance Core Cache Core Cache

Microarchitectural Level Models How can we study power consumption without building circuits? Models Models can are available at multiple levels of abstraction. We are interested in microarchitectural models

Processor Microarchitecture Fetch Decode Execute/Writeback Register Files ALU MUL Instruction Cache Instruction Decoder Fetch Queue Instruction Queue FPU LD Branch Prediction Instruction TLB ST Data TLB L1 Data Cache Network Memory On-Chip Network L2 Data Cache NoC Router

Energy/Power Calculation How do we calculate energy or power dissipation for a given microarchitecture? Energy/Power varies between: Different ISA; ARM vs Intel x86 Different microarchitecture; in-order vs out-of-order Different applications; memory vs compute-bound Different technologies; 90nm vs 22nm technology Different operation conditions; frequency, temperature

Architecture Activity (1) icache.read++; fbuffer.write++; Register Files ALU Activity 1: Instruction Fetch MUL Instruction Cache Instruction Decoder Fetch Queue Instruction Queue FPU LD Branch Prediction Instruction TLB ST Collect activity counts of each architecture component (through simulation or measurement). List of components differs between microarchitectures. Activity counts at each component differs between applications. Data TLB L1 Data Cache On-Chip Network L2 Data Cache NoC Router

Architecture Activity (2) fbuffer.read++; idecoder.logic++; Register Files ALU Activity 2: Instruction Decode MUL Instruction Cache Instruction Decoder Fetch Queue Instruction Queue FPU LD Branch Prediction Instruction TLB ST Read/write accesses to caches, buffers, etc. Logical accesses to logic blocks such as decoder, ALUs, etc. Tradeoff of differentiating more access types (accuracy) vs simulation speed (complexity). Data TLB L1 Data Cache On-Chip Network L2 Data Cache NoC Router

Power and Architecture Activity For example, At nth clock cycle, collected counters are: Data cache: read = 20, write = 12; per-read energy = 0.5nJ; per-write energy = 0.6nJ; Read energy = read*per-read energy = 10nJ Write energy = write*per-write energy = 7.2nJ Total activity energy = read+write energies = 17.2nJ If n = 50th clock cycle and clock frequency = 2GHz, Total activity power = energy*clock_freq/n = 688mW *Note: n/clock_freq = n clock periods in sec power = time average of energy

Things to consider (1) Circuit-level Estimation Tool How do we calculate per-read/write energies? Per-access energies can be estimated from circuit-level designs and analyses. There are various open-source tools for this. Architecture Specification Technology Parameters Circuit-level Estimation Tool Estimation Results: Area, Energy, Timing, etc.

Things to consider (2) Is per-access energy always the same? Per-access energy in fact depends on: how many bits are switching how they are switching (0→1 or 1→0) It is reasonable to assume constant per-access energy in long-term observation (e.g., n = 1M clock cycles); the number of switching bits are averaged (e.g., 50% of bits are switching). Most architecture simulators do not capture bit- level details due to simulation complexity.

Things to consider (3) If a register file didn’t have read/write accesses but held data, what is the energy dissipation? Energy (or power) is largely comprised of dynamic and static dissipations. Dynamic (or switching) energy refers to energy dissipation due to switching activities. Static (or leakage) energy is dissipation to keep the electronic system turned on. In this case, the register file has no dynamic energy dissipation but consumes static energy.

Thermal Issues Heat can cause damage to the chip Need failsafe operation Thermal fields change the physical characteristics Leakage current and therefore power increases Delay increases Device degradation becomes worse Cooling solution determines the permitted power dissipation

Thermal Design Power (TDP) This is the maximum power at which the part is designed to operate Dictates the design of the cooling system Max temperature  Tjmax Typically fixed by worst case workload Parts are typically operating below the TDP Opportunities for turbo mode? AMD Trinity APU http://ecs.vancouver.wsu.edu/thermofluids-research

Trinity TDP Source: http://www.anandtech.com/show/6347/amd-a10-5800k-a8-5600k-review-trinity-on-the-desktop-part-2

Exploiting the Physics Most of time the part is operating well below its thermal limit Leaving performance on the table Can temporarily boost frequency (and therefore power dissipation) for short periods of time, e.g., seconds Temperature changes slowly

Low power – build up thermal credits Boosting Intel Sandy Bridge Exploit package physics Temperature changes on the order of milliseconds Use the thermal headroom Turbo boost region Max Power TDP Power 10s of seconds Low power – build up thermal credits

Conclusions Power/energy is the leading driver of modern architecture design Power and energy management is key to scalability Need integrated power/energy, performance, thermal management in fielded systems What about energy/power efficient algorithms?

Study Guide Explain the difference between energy dissipation and power dissipation Distinguish between static power dissipation and dynamic power dissipation Be able to apply the simplified McPAT power model to a simple datapath and instruction sequence Explain dynamic voltage frequency scaling What are power states? Why is this an advantage? What is the impact of DVFS on i) energy, ii) execution time, and iii) power

Study Guide (cont.) How is thermal design power (TDP) calculated? When using boost algorithms, what determines the duration of the high frequency operation? How does a power virus work? Describe how throttling works Know the power dissipation in some modern processor-memory systems drawn from the embedded, server, and high performance computing segments