Ramya (UCSB), Parthasarathy et al (HP Labs)
Overview Power delivery, consumption and cooling problems in a data center are being tackled currently by several systems that address separate aspects of these problems either locally/globally, in hardware/software. When these systems are deployed simultaneously, the policies of one tends to interfere with the others
Overview… The lack of coordination amongst such systems leads to undesirable consequences. This paper proposes a Global Power Management Solution that coordinates these individual solutions.
Classifying the existing power management solutions.. Approach used: localized/distributed resource management, VMs Power control : voltage scaling, power states, turning off machines Implementation scope: server/cluster/data center level Optimization requirements and constraints: accept performance loss?, allow power budget violation ?
In a nutshell.. Tracking problem – optimize power consumption while delivering performance. Capping problem – Optimize power provisioning and cooling so as not to violate the power budget. Optimization problem – maximize power saving while minimizing performance loss. (ACPIs, VMs, etc)
Representative Power Management Solutions Efficiency Controller (EC -tracking) – optimize per server avg. power consumption. Adjusts ACPI P- states based on past resource usage to manage estimated future demand. Server Manager (SM – capping) – Reduce P-state of a server on violation of Power budget.
Representative solutions.. Enclosure Manager (EM ) – thermal power capping at blade level Group Manager (GM ) – at rack or data center level These two monitor power usage on sets of machines and re-provision power to maintain group power budget (determined manually or mandated by higher level power managers)
Representative solutions.. Virtual Machine Controller (VMC) – reduce average power usage across a set of machines by workload consolidation, turning of idling machines, etc.
Power Struggles.. What happens if these solutions are deployed simultaneously ?
Power Struggles - examples EC and the SM both operate on the same knob/actuator (P-state) but for different metrics. If uncoordinated, the EC can potentially overwrite the SM leading to power budget violations and eventual thermal failover! – A correctness issue.
Examples.. If the VMC and group cappers are uncoordinated, the VMC can consolidate more capacity onto a collection of servers than allowed by the group power budget. In addition to excessive performance violations (inefficiency), the VMC can potentially react to the lower utilization (because of power capping) and pack even more workloads onto the server, leading to a vicious cycle and system instability
Design Challenges of a Coordination System Interaction between different controllers (EC, SM, EM, etc) must maintain correctness, stability and efficiency. Global Awareness of the presence of other controllers while having minimal/zero knowledge of their properties. Adaptability and Scalability – new controllers with same/different properties, new applications, etc.
Design Challenges - Sensitivity Issues. Overlapping functionalities and policies of controllers – can they be mitigated ? Is the Coordinated Management System agnostic to the deployed systems and applications (workloads) ?
The Design
The Design.. Use of feedback control loops. Measure the required metric, compare with the reference value and manipulate the actuator based on the error so that the output follows the reference.
Details.. Diagram Efficiency Controller EC: Reference utilization r ref Actual utilization r i If r i < r ref adjust Actuator A (P-State) ie reduce from say P0 to P4, resulting in higher utilization and lower power usage.
Details.. Diagram Server Manager SM: Power Capping by measuring per server power consumption If current consumption exceeds power budget, SM INCREASES r ref thereby allowing the EC to reduce the P-State of the machine In effect, EC and SM use r ref as communication channel.
Design.. EM & GM: Same principle as SM. Compare current power usage against ref. power budget and assign new values to lower level servers ( EM ->SM, GM->EM) based on some policy (FIFO, random, etc). The lower level servers pick the minimum of upper level recommendation and their own local power budget.
Design.. VMCs: Use Actual utilization instead of apparent utilization (100% at P0 is not same as 100% at P3). Supplied with data about approx power budget at various levels. Also supplied with data about current power budget violations at various levels (through CIM) The above three enable the VMCs to consolidate right workloads and making sure that the consolidated servers dont violate the power budgets nor fall into the vicious cycle mentioned earlier.
Summary of changes to the controllers
Modeling the Controllers Power – Performance Model – run actual workloads on hardware at different utilization levels and measure the power and performance. Through curve-fitting of the simulation data, obtain linear models that represent the controller behavior.
Modeling.. EC - scaled up or down by λ (changes proportional to error in utilization). r_ref is increased by SM in case of power budget violation cap_loc, resulting in EC lowering the power states of the machines.
Modeling.. SM: manipulates r_ref of EC if its power budget violates cap_loc, subject to a cap determined by β loc factor. EM & GM – operate on a fair share policy, power allocated to a component is proportional to power consumed in last interval
Modeling.. VMCs – Constrained Optimization Problem to map n VMs to m servers (decision variable matrix X). Include total power consumption and migration overhead (α M ) in the calculation Consider Server capacity constraints
Modeling VMCs.. Consider local, enclosure and group level power budget constraints The level of consolidation is tuned by tuning the power budget buffers based on the violations at different levels.
Modeling VMCs.. Equations 1 to 6 depict a 0-1 integer optimization problem. The authors use a greedy bin packing algorithm that yields an approximate optimal solution for the placement of VMs
Evaluation How? Real time deployment in Data Center or a full-system simulation ? Impractical, limits the set of use case scenarios that can be studied due to the actual system being tested Use of trace-driven simulation – use real world traces of enterprise deployments that would enable detailed workload modeling and evaluation of tradeoffs at policy and system levels. -?
Metrics used Aggregate Power Saving, performance loss and power budget violation at SM, EM and GM levels. No peak power saving is measured. No workload queuing i.e. if workload exceeds capacity, there is performance loss due to power capping. No demand carry over.
Experimentation 180 workload traces (databases, web servers, remote desktops, e-commerce, etc). Create different types of mixes (real & synthetic) from this set to exercise different utilization scenarios. SUT – A low power Blade server A and an entry level 2U server B. Experiment with different power budgets and also study the sensitivity of this architecture by varying the time constants.
Power – Performance models for Blade A and Server B
Results Baseline: No power management
Results.. Base Results: Coordinated – 64% reduction in power consumption, 3% performance degradation and 5% power budget violation Uncoordinated – 12 % performance loss and 7% budget violation. Sensitivity towards different Systems: Blade A - 5 p-states over higher power range Server B - 6 p-states over low power range. Blade As absolute power saving > Server B. Implies, Range of Power control is more important than its granularity
Results.. Variation for different workloads At low utilization, VMC is major contributor to savings (assuming idle machines are turned off). As utilization increases, benefits of VMC decrease while the combination of EC & VMC is better (i.e. a Coordinated Solution is better than a single one). If idle m/c are not switched off, savings drop significantly!