Power Management.

Power Management

Introduction to Basics

Background Reading Goal: Understand
ge i3-2100t-pentium-g620t.html Goal: Understand The sources of power dissipation in combinational and sequential circuits Power vs. energy Options for controlling power/energy dissipation

Moore’s Law Performance scaled with number of transistors
Goal: Sustain Performance Scaling Performance scaled with number of transistors Dennard scaling*: power scaled with feature size From wikipedia.org *R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp , Oct

Where Does the Power Go in CMOS?
Dynamic Power Consumption Caused by switching transitions  cost of switching state Static Power Consumption Caused by leakage currents in the absence of any switching activity Power consumption per transistor changes with each technology generation No longer reducing at the same rate What happens to power density? Vin Vout Vdd PMOS Ground NMOS AMD Trinity APU

n-channel MOSFET tox GATE SOURCE DRAIN L GATE SOURCE BODY DRAIN L Vgs < Vt transistor off - Vt is the threshold voltage Vgs > Vt transistor on Impact of threshold voltage Higher Vt, slower switching speed, lower leakage Lower Vt, faster switching speed, higher leakage Actual physics is more complex but this will do for now!

Charge as a State Variable
For computation we should be able to identify if each of the variable (a,b,c,x,y) is in a ‘1’ or a ‘0’ state. We could have used any physical quantity to do that Voltage Current Electron spin Orientation of magnetic field ……… a b c x y All nodes have some capacitance associated with them We choose voltage distinguish between a ‘0’ and a ‘1’. a b c x y Logic 1: Cap is charged Logic 0: Cap is discharged + -

Abstracting Energy Behavior
How can we abstract energy consumption for a digital device? Consider the energy cost of charge transfer Modeled as an on/off resistance Vin Vout Vdd PMOS Ground NMOS Vin Vout 1 Modeled as an output capacitance

Switch from one state to another
To perform computation, we need to switch from one state to another Connect the cap to GND thorough an ON NMOS Vin Vout Vdd PMOS Ground NMOS Logic 1: Cap is charged Logic 0: Cap is discharged + - Connect the cap to VCC thorough an ON PMOS The logic dictates whether a node capacitor will be charged or discharged.

Same Energy = area under the curve
Power Vs. Energy P2 Power(watts) P1 P0 Same Energy = area under the curve Time Power(watts) P0 Time Energy is a rate of expenditure of energy One joule/sec = one watt Both profiles use the same amount of energy at different rates or power

Dynamic Power vs. Dynamic Energy
Dynamic power: consider the rate at which switching (energy dissipation) takes place Time VDD Voltage T Output Capacitor Charging Output Capacitor Discharging Input to CMOS inverter iDD CL activity factor = fraction of total capacitance that switches each cycle

Energy-Delay Interaction
Energy or delay Energy-Delay Product (EDP) Power State Target of optimization VDD VDD Delay decreases with supply voltage but energy/power increases

Static Power Technology scaling has caused transistors to become smaller and smaller. As a result, static power has become a substantial portion of the total power. GATE SOURCE DRAIN Gate Leakage Junction Leakage Sub-threshold Leakage

Static Energy-Delay Interaction
leakage or delay Vth leakage delay tox SOURCE DRAIN L GATE Static energy increases exponentially with decrease in threshold voltage Delay increases with threshold voltage

Higher Level Blocks Vdd A B C A C B Vdd A B C A B C

Temperature Dependence
As temperature increases static power increases1 Technology Dependent Normalized Leakage Current Supply voltage #Transistors 1J. Butts and G. Sohi, “A Static Power Model for Architects, MICRO 2000

The World Today Yesterday scaling to minimize time (max F)
Maximum performance (minimum time) is too expensive in terms of power Today: trade/balance performance for power efficiency

Technology Factors Affecting Power
Transistor size Affects capacitance (CL) Rise times and fall times (delay) Affects short circuit power (not in this course) Threshold voltage Affects leakage power Temperature Switching activity Frequency (F) and number of switching transistors ( ) Vin Vout Vdd PMOS Ground NMOS

Low Power Design: Options?
Reduce Vdd Increases gate delay Note that this means it reduces the frequency of operation of the processor! Compensate by reducing threshold voltage? Increase in leakage power Reduce frequency Computation takes longer to complete Consumes more energy (but less power) if voltage is not scaled

Example AMD Trinity A10- 5800 APU: 100W TDP CPU P-state Voltage (V)
Freq (MHz) HW Only (Boost) Pb0 1 2400 Pb1 0.875 1800 SW-Visible P0 0.825 1600 P1 0.812 1400 P2 0.787 1300 P3 0.762 1100 P4 0.75 900

Optimizing Power vs. Energy
Maximize battery life  minimize energy Should we reduce clock frequency by 2 (reduce dynamic power) or reduce voltage and frequency by 2 (reduce both static and dynamic power). Let power be P and time T. Option 1: (p/2 + P)2T = 3PT Option 2: (P/8 + P/2)2T = 1.25PT But takes twice as long Thermal envelopes  minimize peak power Example:

What About Wires? Lumped RC Model Resistance per unit length Capacitance per unit length We will not directly address delay or energy expended in the interconnect in this class Simple architecture model: lump the energy/power with the source component

Power Management Basics

Parallelism and Power How much of the chip area is devoted to compute?
IBM Power5 AMD Trinity Source: forwardthinking.pcmag.com Source: IBM How much of the chip area is devoted to compute? Run many cores slower. Why does this reduce power?

The Power Wall Power per transistor scales with frequency but also scales with Vdd Lower Vdd can be compensated for with increased pipelining to keep throughput constant Power per transistor is not same as power per area  power density is the problem! Multiple units can be run at lower frequencies to keep throughput constant, while saving power

What is the Problem? Based on scaling using Pentium-class cores
Mukhopadhyay and Yalamanchili (2009) Based on scaling using Pentium-class cores While Moore’s Law continues, scaling phenomena have changed Power densities are increasing with each generation Dark silicon

ITRS Roadmap for Logic Devices
From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008

What are my Options? Better technology
Manufacturing Better devices (FinFet) New Devices  non-CMOS?  this is the future Be more efficient – activity management Clock gating – dynamic energy/power Power gating – static energy/power Power state management - both Improved architecture Simpler pipelines Parallelism

Activity Management Clock Gating Power Gating
Vdd Combinational Logic clk cond input Power gate transistor Core 0 Core 1 Turn off clock to a block of logic Eliminate unnecessary transitions/activity Clock distribution power Turn off power to a block of logic, e.g., core No leakage

Multiple Voltage Frequency Domains
Intel Sandy Bridge Processor Cores and ring in one DVFS domain Graphics unit in another DVFS domain Cores and portion of cache can be gated off From E. Rotem et. Al. HotChips 2011

Processor Power States
Performance States – P-states Operate at different voltage/frequencies Recall delay-voltage relationship Lower voltage  lower leakage Lower frequency  lower power (not the same as energy!) Lower frequency  longer execution time Idle States - C-states Sleep states Differ is how much state is saved SW or HW managed transitions between states!

Example of P-states AMD Trinity A10-5800 APU: 100W TDP
Software Managed Power States Changing Power States is not free CPU P-state Voltage (V) Freq (MHz) HW Only (Boost) Pb0 1 2400 Pb1 0.875 1800 SW-Visible P0 0.825 1600 P1 0.812 1400 P2 0.787 1300 P3 0.762 1100 P4 0.75 900

Example of P-states From:

Management Knobs Each core can be in any one of a multiple of states
How do I decide what state to set each core? Who decides? HW? SW? How do I decide when I can turn off a core? What am I saving? Static energy or dynamic energy?

Power Management Software controlled power management
Optimize power and/or energy Orchestrated by the operating system or application libraries Industry standard interfaces for power management Advanced Configuration and Power Interface (ACPI) Hardware power management Optimized power/energy Failsafe operation, e.g., protect against thermal emergencies

Boosting Exploit package physics Use the thermal headroom
Intel Sandy Bridge Exploit package physics Temperature changes on the order of milliseconds Use the thermal headroom Turbo boost region Max Power TDP Power 10s of seconds Low power – build up thermal credits

Intel Sandy Bridge Processor
Power Gating Turn off components that are not being used Lose all state information Costs of powering down Costs of powering up Smart shutdown Models to guide decisions Intel Sandy Bridge Processor

Parallelism Concurrency + lower frequency  greater energy efficiency
Example Core Cache Core Cache Core Cache 4X #cores 0.75x voltage 0.5x Frequency 1X power 2X in performance Core Cache Core Cache

Simplify Core Design AMD Bulldozer Core Support for branch prediction, schedulers, etc. consumes more energy per instruction Can fit many more simpler cores on a die ARM A7 Core (arm.com)

Metrics Power efficiency Energy efficiency Composite
MIPS/watt Ops/watt Energy efficiency Joules/instruction Joules/op Composite Energy-delay product Energy-delay2 Why are these useful?

Thermal Issues

Thermal Issues Heat can cause damage to the chip
Need failsafe operation Thermal fields change the physical characteristics Leakage current and therefore power increases Delay increases Device degradation becomes worse Cooling solution determines the permitted power dissipation

Thermal Design Power (TDP)
This is the maximum power at which the part is designed to operate Dictates the design of the cooling system Max temperature  Tjmax Typically fixed by worst case workload Parts are typically operating below the TDP Opportunities for turbo mode? AMD Trinity APU

Heat Sink Limits on Performance
Thermal design power (TDP) Determines the cooling solution & package limits Performance depends on effective utilization of this thermal headroom Thermal Headroom Max Die Temp Workload Temp Instructions/cycle Time Boost power TDP Power Convert thermal headroom to higher performance through boosting HW Boost states SW visible states Power

Trinity TDP Source:

Coordinated Energy Management in Heterogeneous Processors SC13
Indrani Paul1,2, Vignesh Ravi1, Srilatha Manne1, Manish Arora1,3, Sudhakar Yalamanchili2 1 Advanced Micro Devices, Inc. 2 Georgia Institute of Technology 3 University of California, San Diego

Goal Goal: Optimize energy efficiency under power and performance constraints in a heterogeneous processor Outline: Problem State-of-the-Art Power Management HPC Application Characteristics and Frequency Sensitivity Run-time Coordinated Energy Management Results We examine the relationships between HPC and Exascale workloads and power efficiency in a tightly coupled state-of-the-art heterogeneous processor consisting of a set of CPU and GPU cores. Since HPC applications are mostly uncompromising in performance, our goal is to optimize energy efficiency in this architecture model under the fixed performance requirements. We first look at state of the art power management algorithms and their potential for inefficiencies. We then examine time-varying HPC workload characteristics and demonstrate phase behavior considerations in a CPU-GPU architecture that must be taken into account for effective coordinated execution. We then propose a model of frequency sensitivity in such architectures, that is return on performance for additional power to CPU and GPU. Then we propose a dynamic coordinated energy-management approach called DynaCo to allocate power dynamically between CPU and GPU based on their sensitivities. Lastly we present a subset of our results from the paper.

State-of-the-art Heterogeneous processor
Shared Northbridge  access to overlapping CPU/GPU physical address spaces Graphics processing unit (GPU): 384 AMD Radeon™ cores Multi-threaded CPU cores Trinity contains two dual-core x86 modules or compute units (CU), and Radeon™ GPU cores along with miscellaneous other logic components such as a NorthBridge and a Unified Video Decoder (UVD). Each CU is composed of two, out-of-order cores that share the front-end and floating-point units. In addition, each CU is paired with a 2MB L2 cache that is shared between the cores. The GPU consists of 384 Radeon ™ cores, each capable of one single-precision fused multiply-add computation (FMAC) operation per cycle. The GPU is organized as six SIMD units, each containing sixteen processing units that are 4-way VLIW. The memory controller is shared between the CPU and the GPU. Accelerated processing unit (APU) Many resources are shared between the CPU and GPU For example, memory hierarchy, power, and thermal capacity

OpenCL™ or other Software Stack
Programming model Each OpenCL kernel User Application N-Dimensional Range Host Tasks GPU Tasks OpenCL™ or other Software Stack Operating System APU Hardware CPU GPU Now lets see what state of the art programming models look like for heterogeneous architectures. GPU is a data parallel model where a number of threads operate on different data partitions In this model the CPU can act as a feeder to GPU by offloading computation or it can share the computation along with the GPU So we not only have sharing from the tightly coupled hardware, we also have sharing from coupled programming models – ie. performance is a shared entity here. Host threads, command queue, launch kernels – additional sharing because performance is a shared entity and coupled programming models Set up the stage for CPU feeds the GPU ie. performance coupling, but at some point in future CPU and GPU will both do concurrent computation Add balanced workload GPU is a data parallel model where you offload computation to the GPU, APU model makes offload latency less, then we have a future emergent model that will make the shared computation easier Grid of threads, each operating over a data partition Coupled programming model  Offload compute intensive tasks to the GPU

CPU-GPU Phase behavior in an Exascale Proxy Application (Lulesh)
CPU-GPU coupled execution  time-varying redistribution of compute intensity Energy efficient operation  coordinated distribution of power to CPU vs. GPU Coordinated power states  sensitivity of performance to CPU and GPU power state (frequency) Need to characterize ROI: Return (performance) on investment (power) But the relationship is not so obvious, in order to set these devices properly we need to understand whats going on This is the problem statement…. How to decide when to allocate power to each component How much perf we are going to deliver for additional frequency: frequency sensitivity

Challenge: CPU-GPU Coupling effects
User Application Host Tasks GPU Tasks Direct Performance Coupling Performance Constraint Performance Coupling Effects Indirect Performance Coupling: Shared Resources Coordinated Energy Management Power Efficiency Interactions are non-trivial Red liune – constant perf, you have these interactions and trade-offs. And hpc is uncompromising Power eff and perf is related. Perf can be traded off for power eff. These coupling interactions make this problem worse. If you want to meet a certain power eff your perf actually goes down unless you understand these coupling interactions. So you need coordinated energy management that is aware of these effects in order to not hurt performance Understand these trade-offs and manage them so that you can get same perf and more power efficinecy Arrow goes from the intersection from black and red and blue and red You can introduce frequency sensitivity in this slide HPC applications have uncompromising performance requirements! Need more efficient energy management

State of the Art Power Management

State-of-the-art: Bi-directional application power management (BAPM)
Chip is divided into BAPM- controlled thermal entities (TEs) GPU TE CU0 TE CU1 TE Power management algorithm Calculate digital estimate of power consumption Convert power to temperature - RC network model for heat transfer Assign new power budgets to TEs based on temperature headroom TEs locally control (boost) their own DVFS states to maximize performance Now that we’ve looked at the problem lets look at the current state of the art Thermal entity is defined as what BAPM manages to and controls – not what generates heat (not from the perspective of physics or physical constructs) This allows boosting and exceeding TDP for short periods of time We allocate DVFS states based on thermal headroom and here is the DVFS table

Power Management Performance and energy efficiency depend on effective utilization of power and thermal headroom 3.0 CPU DVFS-state HW Only (Boost) Pb0 Pb1 SW-Visible P0 P1 P2 - - - Pmin Max Die Temp Thermal Headroom Temperature APU Die Time GPU DVFS-state HW Only High Medium Low HW Boost states Instructions/cycle Time Performance APU Convert thermal headroom to higher performance through boost SW visible states

Key observations Overall application performance is a function of both the CPU and the GPU State of the practice: Manage to thermal limits by locally boosting when power and thermal headroom are available  utilize all of the available headroom Pitfall: boosting may not lead to proportional performance improvement energy inefficient Need a concept of performance sensitivity to power states

Application Characteristics

Frequency sensitivity of gpu kernels
Some kernels are more sensitive to GPU frequency than others. This creates opportunities for DVFS management to improve energy efficiency. Some kernels are more sensitive to GPU frequency than others  more power efficient

Sensitivity of gpu kernel execution to cpu frequency
Some kernels are more tightly coupled to CPU’s performance Smaller kernels such as Comm have high overheads in launching and feeding the GPU

Sensitivity to Shared resource interference
Performance actually limited by GPU memory demand Power management locally boosts CPU to highest DVFS states Why this leads to inefficiency – lets take a look at some examples. What are the characteristics of HPC that introduce these inefficiencies miniMD – Neighbor kernel Wasted energy  power inefficient Need online estimates of sensitivity to interference

Computation and control divergence
GPU_freq_sensitivity: unit performance gain for unit frequency increase GPU_ALUBusy%: measured hardware compute utilization Graph Algorithm – BFS Control divergence  increased thread serialization  increased frequency sensitivity

Key Observations HPC applications exhibit varying degrees of CPU and GPU frequency sensitivities due to Control divergence Interference at shared resources Performance coupling between CPU and GPU Efficient energy management requires metrics that can predict frequency sensitivity (power) in heterogeneous processors Sensitivity metrics drive the coordinated setting of CPU and GPU power states

Energy Management

Performance metrics for APU frequency sensitivity
We take all these perf counters we get, did some correlation analysis to understand which ones are best. Linear regression model using the above metrics to compute measures of GPU Compute Interference CPU Compute Performance Coupling

DynaCO: Run-time system for coordinated energy management
Performance Metric Monitor CPU-GPU Frequency Sensitivity Computation CPU-GPU Power State Decision GPU Frequency Sensitivity CPU Frequency Sensitivity Decision High Low Shift power to GPU Proportional power allocation Shift power to CPU Reduce power of both CPU and GPU DynaCo-1levelTh: Lowest CPU DVFS-state limited to P2 DynaCo-multilevelTh: Lowest CPU DVFS-state allowed to use up to Pmin based on degree of performance coupling

Key observations Coordinated CPU-GPU execution
Linear combination of three key high level performance metrics proposed to model APU frequency sensitivity behavior Run-time coordinated energy management scheme DynaCo to manage CPU and GPU DVFS states dynamically based on measured frequency sensitivities

Experimental Set-Up Trinity A10-5800 APU: 100W TDP
CPU: Managed by HW or SW GPU: Managed by sending software messages through driver layer GPU P-state Freq (MHz) GPU-high 800 GPU-med 633 GPU-low 304 CPU P-state Voltage (V) Freq (MHz) HW Only (Boost) Pb0 1 2400 Pb1 0.875 1800 SW-Visible P0 0.825 1600 P1 0.812 1400 P2 0.787 1300 P3 0.762 1100 P4 0.75 900 DynaCo implemented as a run-time software policy overlaid on top of BAPM in real hardware

Benchmarks BM (Description) Problem Size miniMD 32 x 32 x 32 elements
miniFE 100 x 100 x 100 elements Lulesh Sort 2,097,152 elements Stencil2D 4,096 x 4,096 elements S3D SHOC default for integrated GPU BFS 1,000,000 nodes

Energy Efficiency (ED2 product)
Significant opportunities for power management under performance constraints in heterogeneous architecture Varying power-performance requirements for kernels from both CPU and GPU DynaCo adapts to varying compute and memory demands both at kernel granularity and even within a kernel -> eg: miniFE where memory bandwidth saturates about 80% of the time the MATVEC kernel is executed Save energy from both CPU and GPU -> more energy savings came from CPU Average energy efficiency improvement of 24% and 30% with DynaCo-1levelTh and DynaCo-multilevelTh respectively

Execution time Impact Highlight worst case performance loss in miniMD – 2%-4% loss due to fine grained phase behavior shorter than monitoring and control intervals Average performance slow down of 0.78% and 1.61% with DynaCo-1levelTh and DynaCo-multilevelTh respectively

Power Savings Average power savings of 24% and 31% with DynaCo-1levelTh and DynaCo-multilevelTh respectively

Conclusions Note effects of shared resource interference, control divergence and performance coupling on energy management for HPC applications Importance and scope of frequency sensitivity in characterizing energy behaviors in tightly coupled heterogeneous architecture Dynamic power shifting power to the entity that can best utilize it Two key points: Managing power and performance in heterogeneous processor is critical for coordinated execution We have an algorithm that accounts for coordinated execution and frequency sensitivities to manage CPU and GPU dynamically

Cooperative Boosting: Needy versus Greedy Power Management
Indrani Paul1,2, Srilatha Manne1, Manish Arora1,3, W. Lloyd Bircher1, Sudhakar Yalamanchili2 June 2013 1 Advanced Micro Devices, Inc. 2 Georgia Institute of Technology 3 University of California, San Diego

Goal & Outline Goal: Optimize performance under power and thermal constraints in heterogeneous architecture Outline: State-of-the-Art Power and Thermal Management Thermal Coupling Performance Coupling Cooperative Boosting Results We examine the interaction between thermal management techniques and power boosting in a state-of-the-art heterogeneous processor consisting of a set of CPU and GPU cores. Our goal is to optimize performance under power and thermal constraints in this architecture model We first look at state of the art power and thermal management algorithms. We then demonstrate that boost algorithms that greedily boost performance based on thermal headroom available can degrade performance due to their interaction with thermal coupling. We examine the causes of this behavior and explain the interaction between thermal coupling, performance coupling, and workload behavior. Then we propose a dynamic power-management approach called cooperative boosting (CB) to allocate power dynamically between CPU and GPU in a manner that balances thermal coupling against the needs of performance coupling to optimize performance under a given thermal constraint.

State-of-the-art Heterogeneous processor
Shared Northbridge  access to overlapping CPU/GPU physical address spaces Graphics processing unit (GPU): 384 AMD Radeon™ cores Multi-threaded CPU cores Trinity contains two dual-core x86 modules or compute units (CU), and Radeon™ GPU cores along with miscellaneous other logic components such as a NorthBridge and a Unified Video Decoder (UVD). Each CU is composed of two, out-of-order cores that share the front-end and floating-point units. In addition, each CU is paired with a 2MB L2 cache that is shared between the cores. The GPU consists of 384 Radeon ™ cores, each capable of one single-precision fused multiply-add computation (FMAC) operation per cycle. The GPU is organized as six SIMD units, each containing sixteen processing units that are 4-way VLIW. The memory controller is shared between the CPU and the GPU. Accelerated processing unit (APU) Many resources are shared between the CPU and GPU For example, memory hierarchy, power, and thermal capacity

What is Thermal design power?
Thermal design power: TDP Upper bound for the sustainable power draw Determines the cooling solution and package limits Usually set by determining worst-case execution profile Performance depends on effective utilization of thermal headroom Instructions/cycle Time Set this up to discuss BAPM in next slide TDP power is based on power utilization, excess thermal capacity available which can be translated to performance TDP – thermal design point – we need to keep thermals in the die under a particular temp. Power associated with the temp is something we have to support. Two things determine how you allocate the power. 1) power supply – how much power it can deliver, 2) temperature When you have homogeneous systems in the chip, assigning power behaves similarly. However, in heterogeneous system how the power translates to heat is diff between CPU and GPU – it’s a function of what type of component it is and what is going around it Time varying workload – performance is proportional to power sometimes….performance is constant. Relationship between time-varying workload and thermal headroom What TDP is? TDP is proportional to max performance. That’s an assumption that most of these systems are built of. The key is what is worst case execution profile which is not very easy to define in APU. When you are not working on that profile you have different headroom to play with

Key Observations Power and thermals are shared resources in a heterogeneous processor  thermal coupling Overall application performance is a function of both the CPU and the GPU  performance coupling State of the practice: Managing to thermal limits by locally boosting when thermal headroom is available  utilize all of the headroom!

Thermal Coupling

Thermal signatures: CPU & GPU
Steady-state thermal fields produced by BAPM on a 19W AMD Trinity APU High-power CPU benchmark, idle GPU Worst-case CPU: 18.8 W High-power GPU benchmark Worst-case GPU: 19.7 W GPU acts as a thermal heat sink. CPU acts as a facilitator – power distribution is much wider in GPU. SIMD structures vs. complex out of order cores create very different thermal signatures and thermal densities Watt on a GPU is not a watt on a CPU – heterogeneity in physical properties Thermal signatures, thermal pollution and power supply current limitation are the three primary things we need to manage to Two points: 1) thermal density of CPU higher than GPU and consequently the gradient, 2) because of higher densities, CPU tends to consumer thermal headroom faster then GPU This is where you bring heterogeneity of the system – different structure produces different density and gradient and hence consumes the headroom in different way. This concepts need to be used towards management Draw an ellipse around the bottom part of the CPU – show thermal density is different. You can sustain higher power consumption on the GPU than on the CPU Higher thermal density of CPUs  steeper thermal gradients Faster consumption of thermal headroom on the CPU

Thermal Time Constant Idle GPU temperature rose by ~20oC Significant rise in temperature of the idle component due to thermal coupling and pollution from the active components within a die CPU consumes thermal headroom more rapidly (4X faster)  GPU can sustain higher power boosts longer

Thermal Coupling: Headroom Availability
Temp rises, reaches steady state and thermal headroom available

Thermal coupling: Consumption of Thermal Headroom
GPU rises, CU0 is at the corner of the die 6oC rise in GPU temperature once CPU power limit was removed and both CUs were allowed to boost

Thermal Coupling: Thermal Throttling
6C rise in GPU temperature once CPU power limit was removed and both CUs were allowed to boost Put an arrow between orange and purple saying gradient between CU0 and CU1 and GPU acting as thermal heatsink All the concepts introduced earlier manifest in this measurement We need better management that’s cognizant of the thermal coupling effects Minimize detrimental effects of thermal coupling by capping maximum CPU P-state  P-state limiting

Residency in Different Power States
BAPM P2 Capping the max CPU DVFS state at P2 There is a dependency between CPU and GPU and when you statically lower CPU power we see less throttling. If GPU perf is needed then going to P4 works great, however if CPU perf is needed then we need to balance the power Capping the max CPU DVFS state at P4

Key Observtions Thermal signatures different between CPU and GPU  Heterogeneity in physical properties High thermal density leads to faster consumption of thermal headroom in the CPU cores Significant thermal coupling from active to idle components Near the thermal limit, boosting based on available thermal headroom introduces inefficiencies Reduce the CPU P-state limit

Performance Coupling

OpenCL™ or other Software Stack
Programming model Each OpenCL kernel User Application N-Dimensional Range Host Tasks GPU Tasks OpenCL™ or other Software Stack Operating System APU Hardware CPU GPU Now lets see what state of the art programming models look like for heterogeneous architectures. GPU is a data parallel model where a number of threads operate on different data partitions In this model the CPU can act as a feeder to GPU by offloading computation or it can share the computation along with the GPU So we not only have sharing from the tightly coupled hardware, we also have sharing from coupled programming models – ie. performance is a shared entity here. Host threads, command queue, launch kernels – additional sharing because performance is a shared entity and coupled programming models Set up the stage for CPU feeds the GPU ie. performance coupling, but at some point in future CPU and GPU will both do concurrent computation Add balanced workload GPU is a data parallel model where you offload computation to the GPU, APU model makes offload latency less, then we have a future emergent model that will make the shared computation easier Grid of threads, each operating over a data partition Coupled programming model  Offload compute intensive tasks to the GPU

Managing thermals for performance-coupled applications

P-state sensitivity Not all programs suffer from reduced CPU P-state. In the case of Needle you see continuous improvement in performance. CPU not fast enough to make use of the GPU

Determining Critical CPU P-state
Find the inflection point in performance as a function of CPU P-state  critical P-state Critical P-state is determined by interference (CPU vs. GPU) in the memory system Critical CPU P-state Limit Indirect effect on shared resources in a critical factor. Critical P-state has to be determined. Memory BW increase is an indication to see what kind of perf coupling requirement you have for GPU. IPC is a proxy for CPU perf. We are also making sure we don’t let the performance completely controlled by the GPU and hence we also look at CPU IPC. GPU

Key Observations Performance coupling – CPU-GPU performance dependency
Balance between detrimental effects of thermal coupling and needs of performance coupling CPU critical P-state limit is determined by performance coupling and thermal coupling GPU memory bandwidth gradients as a function of CPU frequency along with CPU IPC serve as a measure of performance coupling

Cooperative Boosting

Cooperative Boosting (CB)
Overlaid on top of BAPM – invoked periodically when thermal coupling is detrimental i.e. when thermal limit is approached Dynamic algorithm that monitors performance requirements Layered on top of BAPM Control invoked only when BAPM reaches thermal limits. When we reach thermal limits we try to manage temp holistically by changing the CPU P-state limit dynamically Monitors peak die temp, per core IPC and mem BW Reduces/increases highest freq. P-state limit until optimal BW is reached Disable control part of CB when there is a phase change as evidenced by a high CPU IPC phase. Adjust for CPU centric workloads with relatively high IPC (> 0.4) Talk about monitoring and control intervals – In our design the control phase is 500ms due to rc constant

Experimental Set-up Trinity A8-4555M APU: 19W TDP
CPU: Managed by HW or SW GPU: Managed by HW only GPU-high: 423 MHz GPU-med: 320 MHz Cooperative Boosting implemented as a system software policy overlaid on top of BAPM in real hardware P-state Voltage (V) Freq (MHz) HW Only (Boost) Pb0 1 2400 Pb1 0.875 1800 SW-Visible P0 0.825 1600 P1 0.812 1400 P2 0.787 1300 P3 0.762 1100 P4 0.75 900 You cannot sustain Pb0 and Pb1 always, only when certain parts are idle or power gated. P0 is the highest possible sustainable state

Benchmarks BM (Description) Problem Size Type NDL (Needleman-Wusch)
4096x4096 data points, 1K iterations Performance-coupled HS (HotSpot) 1024x1024 data points, 100K iterations BF (BoxFilter SAT) 1Kx1K input image, 6x6 filter,10K iterations FAH (Folding at Home) Synthesis of large protein: spectrin$ BS (Binary Search) 4096 inputs, 256 segments, 1M iterations Viewdle (Haar facial recognition) Image 1920x1080, 2K iterations Lbm (CPU2006) 4 threads, Ref input CPU-centric Gcc (CPU2006)

Performance Improvement with Cooperative Boosting
Static P-state limiting requires profiling and a priori information of workload An average of 15% performance gain for performance- coupled applications with CB Emphasize why static P-state limit does not work: Essentially its an oracle method where you depend on profiling and a priori information Static limiting requires profiling information and is not practical. We also notice it does not work well where there is periodic CPU-GPU synchronization points or concurrent computation. The other drawback is it hurts CPU centric non-coupled applications.

Power Savings Average 10% power savings across performance-coupled applications 5oC reduction in peak temperature for BS -> large percentage of leakage power savings Here we look at power but what is important is power at the context of perf. So go onto the next slide. Power needs to be looked at the context of perf, but point out one app which is bs where we lower temp and translate that to perf

Energy*Delay^2 Average 33% energy-delay2 savings across performance-coupled applications

Conclusions Effects of thermal and performance coupling on performance
Applications with high GPU compute-to-load ratio are more susceptible to detrimental effects of thermal coupling Emergent balanced workloads with split CPU-GPU computation are tightly performance-coupled Cooperative Boosting (CB): balance effects of thermal coupling with needs of performance coupling Shifts power to CPU only when needed Two key points: Managing power, thermals and performance in heterogeneous processor is critical Balancing thermal and performance coupling effects is critical and we have an algorithm that takes performance, power and thermals in the context and manages them dynamically

Power Management.

Similar presentations

Presentation on theme: "Power Management."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Power Management.

Similar presentations

Presentation on theme: "Power Management."— Presentation transcript:

Similar presentations

About project

Feedback