Download presentation
Presentation is loading. Please wait.
Published byBrent Dennis Modified over 8 years ago
1
Thermal-aware Task Placement in Data Centers Qinghui Tang Sandeep K S Gupta Georgios Varsamopoulos IMPACT Lab http://impact.asu.edu/ Arizona State University
2
Growth Trends in data centers ► Power density increases Circuit density increases by a factor of 3 every 2 years Energy efficiency increases by a factor of 2 every 2 years Effective power density increases by a factor of 1.5 every 2 years [Keneth Brill: The Invisible Crisis in the Data Center] ► Maintenance/TCO rising Data Center TCO doubles every three years By 2009, the three-year cost of electricity will exceed the purchase cost of the server Virtualization/Consolidation is a 1-time/short term solution [Uptime Institute] ► Thermal management corresponds to an increasing portion of expenses Thermal-aware solutions becoming prominent Increasing need for thermal awareness
3
Related Work (extended domain) IC Case/chassis room firmware O/S Application (middleware) Dynamic voltage scaling Dynamic frequency scaling Circuitry redundancy Fan speed scaling CPU Load balancing Thermal-aware VM Thermal-aware data center job scheduling software dimension physical dimension
4
Thermal issues in dense computer rooms (i.e. Data centers, Computer Clusters, Data warehouses) ► Heat recirculation Hot air from the equipment air outlets is fed back to the equipment air inlets ► Hot spots Effect of Heat Recirculation Areas in the data center with alarmingly high temperature ► Consequence Cooling has to be set very low to have all inlet temperatures in safe operating range Courtesy: Intel Labs
5
Conceptual overview of thermal-aware task placement Task placement determines temperature distribution Temperature distribution determines the equipment peak air inlet temperature Peak air inlet temperature determines upper bound to CRAC temperature setting CRAC temperature setting determines it’s efficiency (Coefficient of Performance) bottom line There is a task placement that maximizes cooling efficiency. Find it! The lower the peak inlet temperature the higher the CRAC efficiency Coefficient of Performance (source: HP)
6
Prerequisites for thermal management ► Task profiling CPU utilization, I/O activity etc ► Equipment power profiling CPU consumption, disk consumption etc ► Heat recirculation modeling ► Task management technologies ► Need for a comprehensive research framework
7
Thermal-aware job scheduling On-line job scheduling algorithm to minimize peak air inlet temperature, thus minimizing the cost of cooling. Thermal Models To enable on-line real-time thermal-aware job scheduling ► fast (analytical, non CFD based) ► non-evasive (machine-learning) Characterization Characterize the power consumption of a given workload (CPU, memory, disk etc) on a given equipment Thermal management research framework Model the thermal impact of multicore systems http://impact.asu.edu/ Sandeep Gupta Qinghui Tang Tridib Mukherjee Michael Jonas Georgios Varsamopoulos
8
Task Profiling measurements at ASU HPC Data Center (one chassis)
9
Power Model and Profiling ► Power Consumption is mainly affected by the CPU utilization ► Power consumption is linear to the CPU utilization P = a U + b
10
Linear Thermal Model ► Heat Recirculation Coefficients Analytical Matrix-based ► Properties of model Granularity at air inlets (discrete/simplified) Assumes steadiness of air flow = + × inlet temperatures supplied air temperatures heat distribution power vector T in T sup DP
11
Benefit: fast thermal evaluation Give workload Run CFD simulation (days) Extract temperatures Give workloadCompute vector (seconds) + × T in T sup DP Yields temperatures Courtesy: Flometrics
12
Thermal-aware Task Placement Problem Given an incoming task, find a task partitioning and placement of subtasks to minimize the (increase of) peak inlet temperature = + × inlet temperatures supplied air temperatures heat distribution utilization vector T in T sup DU (a(a + ) bbbbbbbbbbbbbbb XInt Algorithm Approximation solution (genetic algorithm) ► Take a feasible solution and perform mutations until certain number of iterations P = a U + b
13
Inlet Temperature Contrasted scheduling approaches ► Uniform Outlet Profile (UOP) Assigning tasks in a way that tries to achieve uniform outlet temperature distribution Assigning more task to nodes with low inlet temperature (water filling process) ► Minimum computing energy Assigning tasks in a way that keeps the number of active (power-on) chassis as few as possible Server with coolest inlet temperature first ► Uniform Task (UT) Assigning all chassis the same amount of tasks (power consumptions) All nodes experience the same power consumption and temperature rise Outlet Temperature
14
Simulated Environment ► ► Used Flometrics Flovent ► ► Simulated a small scale data center ► ► physical dimensions 9.6m 8.4m 3.6m ► ► two rows of industry standard 42U racks arranged ► ► CRAC supply at 8 m 3 /s ► ► There are 10 racks each rack is equipped with 5 chassis ► ► 1000 processors in data center. 232KWatts at full utilization
15
Performance Results ► Xint outperforms other algorithms ► Data Centers almost never run at 100% Plenty of room for benefits!
16
Performance Results ► Xint outperforms other algorithms ► Data Centers almost never run at 100% Plenty of room for benefits!
17
Power Vector Distribution key Xint contradicts “rule of thumb” placement at bottom
18
Supply Heat Index (SHI) ► Supply Heat Index Metric developed by HP Labs quantifies the overall heat recirculation of data center ► Xint consistently has the lowest SHI
19
Conclusions ► Thermal-aware task placement can significantly reduce heat recirculation XInt performance thrives at around 50% CPU utilization ► Not much can be done at 100% utilization Cooling savings can exceed 30% (in comparison to other schemes) ► Cost of operation reduces by 15% (if initially 1:1 ratio of computing-2-cooling)
20
Related Work in Progress ► Waiving simplifying assumptions Equipment heterogeneity [INFOCOM 2008] Stochastic task arrival ► Thermal maps thru machine learning Automated, non-invasive, cost-effective [GreenCom 2007] ► Implementations Thermal-aware Moab scheduler Thermal-aware SLURM SiCortex product thermal management
21
Algorithm Assumptions ► HPC model in mind Long-running jobs (finish time is the same — infinity) ► One-time arrival (starting time is the same) ► Utilization homogeneity (same utilization throughout task’s length) ► Non preemptive/movable tasks ► Data Center equipment homogeneity power consumption computational capability ► Cooling is self-controlled
22
Thank You ► Questions? ► Comments? ► Suggestions? http://impact.asu.edu/
23
Additional Slides
24
Functional model of scheduling ► Tasks arrive at the data center ► Scheduler figures out the best placement Placement that has minimal impact on peak inlet temperatures ► Assigns task accordingly Scheduler Task Tasks
25
Architectural View Scheduler (Moab, SLURM) dispatch Machine Learning create/update provide Monitoring Processes Thermal Model report control
26
A simple thermal model ► Basic Idea: We don’t need an extensive CFD model We only need to know the effect of recirculation at specific points ► Express recirculation as “coefficients” Courtesy: Intel Labs N1N1 N2N2 N3N3 N4N4 N5N5
27
Recirculation coefficients: a fast thermal model ► Reduce/Simplify the “thermal map” concept to points of interest: equipment air inlets ► Can be computed from CFD models/simulations Matrix A a ij : portion of heat exhausted from node i that directly goes to node j A recirculation coefficients
28
Opportunities & Challenges ► Data centers don’t run at fulll unitilization Can choose among multiple CPUs to allocate a job Different thermal impact per CPU ► Need for fast thermal evaluation ► Temporal and spatial Heterogeneity of Data Centers In equipment In workload Thermal issues ► Heat recirculation Increases as equipment density exceeds cooling capacity as planned ► Hot spots Effect of Heat Recirculation ► Impact: Cooling has to be set low enough to have all inlet temperatures in safe operating range Data Center Thermal Management Increasing need for thermal awareness ► Power density increases Circuit density increases by a factor of 3 every 2 years Energy efficiency increases by a factor of 2 every 2 years Effective power density increases by a factor of 1.5 every 2 years [Keneth Brill: The Invisible Crisis in the Data Center] ► Maintenance/TCO rising Data Center TCO doubles every three years By 2009, the three-year cost of electricity will exceed the purchase cost of the server Virtualization/Consolidation is a 1-time/short term solution ► Thermal management corresponds to an increasing portion of expenses Thermal-aware solutions becoming prominent ICCase/chassisroom firmware O/S Application (middleware) Dynamic voltage scaling Dynamic frequency scaling Circuitry redundancy Fan speed scaling CPU Load balancing Thermal-aware VM Data center job scheduling software dimension physical dimension Thermal-aware solutions at various levels A dynamic thermal- aware control platform is necessary for online thermal evaluation A dynamic thermal- aware control platform is necessary for online thermal evaluation without thermal-aware management With thermal-aware management computation cooling $1M $10M $100M year
29
Scheduling Impacts Cooling Setting Inlet temperature distribution without Cooling 25 C Inlet temperature distribution with Cooling Scheduling 1 Scheduling 2 Different demands for cooling capacity
30
Results(1) ► Recirculation Coefficients Consistent with datacenter observations Large values are observed along diagonal Strong recirculation among neighboring servers, or between bottom servers and top servers 1 2 3 4 5 6 7 8 9 10
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.