Low-power computer architecture

Low-power computer architecture
Dr. Avi Mendelson

Disclaimer No Intel proprietary information is disclosed.
Every future estimate or projection is only a speculation Responsibility for all opinions and conclusions falls on the author only.  It does not means you cannot trust them…  © Dr. Avi Mendelson

Out of the box thinking is needed
Before we start The focus of my classes during the summer school is on understanding the power problem, current solutions and research directions. Personal observation: focusing on low-power resembles Alice through the looking-glass: We are looking at the same old problems, but from the other side of the looking glass, and the landscape appears much different... Out of the box thinking is needed © Dr. Avi Mendelson

Schedule of the course First day: Introduction for Low-power .
Second day: Circuits modeling and simulation. Third day: Circuit level solutions. Fourth day: Architectures for Low Power. Fifth day: Thermal and system issues. © Dr. Avi Mendelson

Agenda The power crisis General solutions and directions
Power consumption Power density and thermal limitations General solutions and directions © Dr. Avi Mendelson

Moore’s law Memory Microprocessor
“Doubling the number of transistors on a manufactured die every year” - Gordon Moore, Intel Corporation Source: Intel Transistors Per Die ’70 ’73 ’76 ’79 ’82 ’85 ’88 ’91 ’94 '97 2000 108 107 106 105 104 103 102 4M Memory Microprocessor 109 64K 1M 1K 256K 4K 16K 16M 64M 4004 8080 8086 80286 i386™ i486™ Pentium® 256M Pentium® Pro Pentium®III Pentium®4 Pentium® II © Dr. Avi Mendelson

In the Last 25 Years Life was Easy(*)
Doubling of transistor density every 30 months Increasing die sizes, allowed by Increasing Wafer Size Process technology moving from “black art” to “manufacturing science”  Doubling of transistors every 18 months 486 shrink was 0.7 (area) Pentium shrink 0.5 (area) Implications: (in the same technology) 1. New mArch ~ 2-3X die area of the last mArch 2. Provides X integer performance of the last mArch (*) source Fred Pollack, Micro-32 © Dr. Avi Mendelson

Suddenly, the power monster appears in all different market segments
© Dr. Avi Mendelson

Processor Power Evolution
Max Power (Watts) i386 i486 Pentium® w/MMX tech. 1 10 100 1.5m 1m 0.8m 0.6m 0.35m 0.25m 0.18m 0.13m Pentium® Pro Pentium® II Pentium® 4 ? Pentium® III Traditionally: new generation always increase power Compactions: higher performance at lower power Used to be “One size fits all”: start with high power and shrink to Mobile © Dr. Avi Mendelson

The power crisis – power consumption
Sourse: cool-chips, Micro 32 © Dr. Avi Mendelson

Power challenges per segment
Handhelds Mobile Desktops Servers Form Factor Battery size Battery cost Thermal cost Delivery cost Form factor Power related system cost drivers Performance Battery life Noise Perf/Kg. Perf/$$ Perf/inch^3 Price drivers Max battery life Max perf/power to meet application’s need Max thermal constraint Optimization point © Dr. Avi Mendelson

Power & Energy Power Dynamic power: consumed by transistors during switching. P = aCV2f - Work done per time unit (Watts) (a: activity, C: capacitance, V: voltage, f: frequency) Static Power (Leakage): consumed by all “inactive transistors”, it depends on temperature and voltage. Power aware architectures -> aim to reduce peak power Energy Power consume during some period of time. Energy aware architectures -> aims to reduce average power consumption © Dr. Avi Mendelson

Power Evolution (Theoretical)
250 Leakage Power Active Power 200 150 Watts 100 50 0.25m 0.18m 0.13m 0.1m For a 15mm/side die (225mm2) Assume 2X frequency increase each generation Future process numbers are estimated © Dr. Avi Mendelson

Why high power matters Power Limitations Higher power  higher current
Cannot exceed platform power delivery constraints Higher power  higher temperature Cannot exceed the thermal constraints (e.g., Tj < 100oC) Increases leakage. The heat must be controlled in order to avoid electric migration and other “chemical” reactions of the silicon Energy Affects battery life. Consumer devices – the processor may consume most of the energy Mobile computers (Laptops) - the system (display, disk, cooling, energy supplier, etc) consumes most of the energy Affects the cost of Electricity © Dr. Avi Mendelson

Power Density Sun's Surface Rocket Nozzle Watts/cm Nuclear Reactor
1000 Rocket Nozzle Nuclear Reactor 100 Watts/cm 2 Pentium® 4 Hot plate Pentium® III Pentium® II 10 Pentium® Pro Pentium® i386 i486 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note © Dr. Avi Mendelson

Why power and power density increase over time ?

How do we keep up with the Moore’s Law?
Every 18 month in average we introduce a new process The new process shrinks the dimension of the transistors by 0.7 (ideal shrink) As a result, on the same die area, we can have more transistors, each of them running at higher frequency One may mistakenly think that this is the reason for the increase in power and power density. © Dr. Avi Mendelson

Scaling theory--1 of 2 Lateral and vertical dimensions reduce by 30%
Capacitance--area and fringing—reduce by 30% Die area reduces 50% © Dr. Avi Mendelson

Scaling theory--2 of 2 Capacitance per transistor reduces 30%
Capacitance per unit area increases 43% Delay reduces 30%, power reduces 50% © Dr. Avi Mendelson

Ideal Scenarios... Ideal “Shrink” Ideal New march Same march
1X #Xistors 0.5X size 1.5X frequency 0.5X power 1X IPC (instr./cycle) 1.5X performance 1X power density Ideal New march Same die size 2X #Xistors 1X size 1.5X frequency 1X power 2X IPC 3X performance 1X power density © Dr. Avi Mendelson

Process Technologies – Reality
But in reality: New process is not ideal anymore New designs squeeze frequency to 2X per process New designs use more transistors (2X-3X to get 1.5X-1.7X perf) So, every new process and architecture generation: Power goes up about 2X Power density goes up 30%~80% This is bad, and… Will get worse in future process generations: Voltage (Vdd) will scale down less Leakage is going to the roof © Dr. Avi Mendelson

Die increases in order to maintain performance boost
Silicon Process Technology 1.5µ 1.0µ 0.8µ 0.6µ 0.35µ 0.25µ 0.18µ 0.13µ Intel386™ DX Processor Intel486™ DX Processor Pentium® Processor Pentium® Pro Processor Pentium® II Processor P-III sizes – 0.25u 140 mm^2, 9.5 M trans 0KB L2; mm^2, 28M Trans, 128KB L2; mm^2, 40M Trans, 512KB L2 P4P sizes – W 1.75V 2GHZ 75 Watts 217mm^2 42M trans, 256KB L2 N 1.5V 2GHZ 52W, 2.2GHZ 55W 146mm^2 55M tran 512KB L2 Pentium® III Processor Pentium® 4 Processor © Dr. Avi Mendelson

Put it all together: Power and Power density are real threat to the Moore’s law
Complex algorithms lead to denser power: Dense random logic Timing pressure leads to faster/bigger/power-hungrier gates Designers put together units that communicate with each other. It creates “regions” with high activity factors -> hot spots. Power is not distributed evenly over the chip. A failure can happen if a single point reach the max power point. Many of the modern processors are power limited © Dr. Avi Mendelson

Some implications We can’t build microprocessors with ever increasing power density and die sizes The constraint is power – not manufacturability The design of any future micro-processor should take power into consideration. We need to distinguish between different aspects of power: Power delivery Max power (TJ) Power density - hot spots Energy – static + dynamic Power and Energy aware design should take care of each of these aspects One-size does not fit all anymore © Dr. Avi Mendelson

General solutions and directions
Assume that one size does not fit all. For different segments there may be different solutions (although many of them share the same principle of operation). © Dr. Avi Mendelson

Embedded systems vs. Laptops
Most of the power is consumed by the CPU Usually not thermally limited. What we really care about is battery life and meeting the timing limitations. In real time systems we can take advantage of known “deadlines” Laptops (Mobile systems) We are thermally limited. We can not use deadlines (most of the time). We need to optimize for max battery life and max performance in a given power envelope. © Dr. Avi Mendelson

How to extend Battery life: Voltage Scaling
Within a given voltage range, higher voltage allows higher freq. Used for trading power and frequency. Either Statically, at manufacturing time Dynamically, at run time (e.g., Intel’s SpeedStep® Technology) Actual range depends on specific design and process technology Examples*: Intel® XScale™ processors runs from 0.75V (150MHz/50mW) to 1.65V (800MHz/900mW) Intel mobile Pentium® III processor sells from 1.1V (600MHz) to 1.7V (1GHz) XScale proc. freq & power vs voltage * Source: Intel Corp. ( © Dr. Avi Mendelson

Voltage Scaling (cont.)
Huge effect on Dynamic Power: 20% freq reduction  20% voltage reduction 35% energy reduction. (aCV2 = aC*0.82 = aC*0.64)  50% power reduction. (aCV2f = aC*0.83 = aC*0.51) Even more impressive if we recall: 20% freq hit  only 10%-15% performance hit* Voltage scaling can be used to trade performance for power Reduce the power consumption when performance needs can be released e.g., if deadlines known and if we have enough “dead time”, we can extend the execution time on the expense of lowering the voltage. BUT it has technology limitations * Depends mainly on core to bus frequency ratio and caches size. © Dr. Avi Mendelson

How to extend battery life: energy Efficiency
Energy per task Proportional to # of processed instructions per task Proportional to the average work consumed per instruction “Energy per (retired) instruction” = b*W, where b: Ratio of Total to Retired number of processed instructions W: Average energy spent in processing an instruction Both figures deteriorate with every new microarchitecture Since speculation increases and complexity grows In that respect: high performance modern microarchitectures are less energy-efficient © Dr. Avi Mendelson

Improving Hot Spots Clustering
Build your system as clustered architecture (e.g., Alpha) Design your system so that when all clusters are active the system exceeds the Max-Power allowed Most of the time, not all the clusters are active “Smart scheduling” will spread the thermal hot-spots among different clusters. In VLIW based architectures, compilers can help © Dr. Avi Mendelson

Alpha hot spots Source - CoolChips-99 Area 30% Freq. 50% Power 67%

Power Complexity Metrics
Power  C V2 f Metrics: suppose we introduce new feature that consumes extra x power and gain y performance: Power/Perf ( Energy), assuming same technology (same C) and same voltage For battery life, energy bills. For a given power envelope – without voltage scaling. Power/Perf2 ( Energy*Delay) Balance performance and power needs. Power/Perf3 ( Energy*Delay2) For a given power envelope – with voltage scaling. assuming that we can (1) trade frequency and voltage scaling, and (2) we can lower the voltage as much as we wish © Dr. Avi Mendelson

E*D product (lower is better)
E = energy / instruction = Power * sec / instruction = Watt / MIPS D = sec / instruction = 1 / MIPS E *D ~ Watt / MIPS2 © Dr. Avi Mendelson

Leakage control Leakage depends on: technology, area voltage and temperature. High temperature  high leakage  high power  higher temperature Leakage will be very significant in future micro-architectures. Large caches contributes to the performance but may increase the power due to leakage. Larger caches: better performance higher leakage -> slower clock -> lower performance. Leakage make the major difference between clock gating and deep sleep modes (where power is disconnected) © Dr. Avi Mendelson

Design for power: Out Of Order Execution
OOO architecture was found to be very efficient in masking the effect L1 cache misses. Aggressive OOO, and wider machines require more registers and memory ports It consumes a lot of power Can we slow down the access to the cache and let the OOO solve the performance problem? Can we simplify the OOO mechanisms, assuming that the memory subsystem limits the performance? How aggressive we should be as speculation (branch prediction, value prediction, etc) © Dr. Avi Mendelson

Pentium Pro Power Breakdown
Actual computation: less than 25%! What can be done: Trace cache Many low-level improvements © Dr. Avi Mendelson

SMT Single CPU µArch augmented to look as 2 or more CPUs to the software Adds ~10% logic to CPU (Alpha experience) Average power increases <10%. Can increase performance of two threads by % in respect of running the same applications sequentially. Looks like a good tradeoffs between power and performance. © Dr. Avi Mendelson

MT - Implications on power
The area and the power consumption of register files and memory elements within the processor increases significantly due to aggressive out-of-order and aggressive SMT (Alpha, CoolChip, 99’) Increase the power at the hotspot, not fit to thermally limited segments (where performance is needed). May better tolerate cache misses, so power aware caches can be used Hot-spots may force us to use more aggressive clustering © Dr. Avi Mendelson

Low-power computer architecture

Similar presentations

Presentation on theme: "Low-power computer architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Low-power computer architecture

Similar presentations

Presentation on theme: "Low-power computer architecture"— Presentation transcript:

Similar presentations

About project

Feedback