Automated Cost-Aware Data Center Management (part 4) Justin Moore Advisor: Jeff Chase Committee Parthasarathy Ranganathan, Carla Ellis, Alvin Lebeck, Jun Yang
Heat Recirculation Rebalance power, reduce cooling costs δQ CRAC Q Rebalance power, reduce cooling costs Hot exhaust air mixes with cold incoming air Cause: inefficient data center layout, air flow [Sharma2003]
TARP: Results Less heat recirculation lower cooling power cost “Hot spots” can be OK and beneficial, provided heat is exiting Workload inversely proportional to recirculation Avoid until the last minute servers whose exhaust recirculates Moore05Usenix
Cost-Aware Management Identify primary sources of costs Fixed: Acquisition Dynamic: Power and Cooling Cumulative: “Wear and tear” Combine thermal costs with workload costs Utility functions, SLAs, penalties, etc
- Reliability, utility, profitability Ongoing Work: Models Candidate Settings Policy Workload Weatherman Cost Model - Reliability, utility, profitability
Ongoing Work: Costs Per Day
Ongoing Work: Costs Per Day Max inlet of 25C: even = 12.25C ($3259.00 / day) ; wman = 15.75C ($3140.15 / day) Diff = $118.85 / day ; $43,380.25 / year 0 wear-and-tear: even = 7.25C ($3541.91 / day) ; wman = 10.75C ($3260.28 / day) Diff = $281.63 / day ; $102,794.95 / year
Conclusions Need for comprehensive management Address new challenges Power and thermal considerations are significant Unified cost-aware management architecture Extend MAPE architecture Construct accurate models for instrumentation Model and predict data-center-wide conditions Formulate cost-aware management policies
Planning: Management Policies Ad-hoc Independent and limited in scope Priority-based scheduling, thermal kill switches, … Heuristic Qualitative decisions based on “proxy” variables i.e., Reduce power to reduce cooling costs Informed Model and predict system behavior Prevent, Detect, Recover
TARP: Evaluation Methodology 7 racks/row 40 servers/rack 150W idle 285W at 100% Redline @ 25C 4 CRAC units 90 KW/CRAC Flovent – CFD Simulator Validated in [Sharma03] Gives Tin and Tout for each object Tsup = 15 + (Tred – max(Tin))
Minimize Heat Recirculation 2-phase calibration creates power budgets Baseline workload measures Qb and δQb I.e., all servers on but idle Bin servers into non-overlapping pods Utilize each pod 100%; re-measure Q and δQ Heat Recirculation Factor (HRF) HRFj = (Qj – Qb) / (δQj – δQb)