Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thermal-aware Task Placement in Data Centers Qinghui Tang Sandeep K S Gupta Georgios Varsamopoulos IMPACT Lab Arizona State University.

Similar presentations


Presentation on theme: "Thermal-aware Task Placement in Data Centers Qinghui Tang Sandeep K S Gupta Georgios Varsamopoulos IMPACT Lab Arizona State University."— Presentation transcript:

1 Thermal-aware Task Placement in Data Centers Qinghui Tang Sandeep K S Gupta Georgios Varsamopoulos IMPACT Lab http://impact.asu.edu/ Arizona State University

2 Growth Trends in data centers ► Power density increases  Circuit density increases by a factor of 3 every 2 years  Energy efficiency increases by a factor of 2 every 2 years  Effective power density increases by a factor of 1.5 every 2 years [Keneth Brill: The Invisible Crisis in the Data Center] ► Maintenance/TCO rising  Data Center TCO doubles every three years  By 2009, the three-year cost of electricity will exceed the purchase cost of the server  Virtualization/Consolidation is a 1-time/short term solution [Uptime Institute] ► Thermal management corresponds to an increasing portion of expenses  Thermal-aware solutions becoming prominent  Increasing need for thermal awareness

3 Related Work (extended domain) IC Case/chassis room firmware O/S Application (middleware) Dynamic voltage scaling Dynamic frequency scaling Circuitry redundancy Fan speed scaling CPU Load balancing Thermal-aware VM Thermal-aware data center job scheduling software dimension physical dimension

4 Thermal issues in dense computer rooms (i.e. Data centers, Computer Clusters, Data warehouses) ► Heat recirculation  Hot air from the equipment air outlets is fed back to the equipment air inlets ► Hot spots  Effect of Heat Recirculation  Areas in the data center with alarmingly high temperature ► Consequence  Cooling has to be set very low to have all inlet temperatures in safe operating range Courtesy: Intel Labs

5 Conceptual overview of thermal-aware task placement Task placement determines temperature distribution Temperature distribution determines the equipment peak air inlet temperature Peak air inlet temperature determines upper bound to CRAC temperature setting CRAC temperature setting determines it’s efficiency (Coefficient of Performance) bottom line There is a task placement that maximizes cooling efficiency. Find it! The lower the peak inlet temperature the higher the CRAC efficiency Coefficient of Performance (source: HP)

6 Prerequisites for thermal management ► Task profiling  CPU utilization, I/O activity etc ► Equipment power profiling  CPU consumption, disk consumption etc ► Heat recirculation modeling ► Task management technologies ► Need for a comprehensive research framework

7 Thermal-aware job scheduling On-line job scheduling algorithm to minimize peak air inlet temperature, thus minimizing the cost of cooling. Thermal Models To enable on-line real-time thermal-aware job scheduling ► fast (analytical, non CFD based) ► non-evasive (machine-learning) Characterization Characterize the power consumption of a given workload (CPU, memory, disk etc) on a given equipment Thermal management research framework Model the thermal impact of multicore systems http://impact.asu.edu/ Sandeep Gupta Qinghui Tang Tridib Mukherjee Michael Jonas Georgios Varsamopoulos

8 Task Profiling measurements at ASU HPC Data Center (one chassis)

9 Power Model and Profiling ► Power Consumption is mainly affected by the CPU utilization ► Power consumption is linear to the CPU utilization P = a U + b

10 Linear Thermal Model ► Heat Recirculation Coefficients  Analytical  Matrix-based ► Properties of model  Granularity at air inlets (discrete/simplified)  Assumes steadiness of air flow = + × inlet temperatures supplied air temperatures heat distribution power vector T in T sup DP

11 Benefit: fast thermal evaluation Give workload Run CFD simulation (days) Extract temperatures Give workloadCompute vector (seconds) + × T in T sup DP Yields temperatures Courtesy: Flometrics

12 Thermal-aware Task Placement Problem Given an incoming task, find a task partitioning and placement of subtasks to minimize the (increase of) peak inlet temperature = + × inlet temperatures supplied air temperatures heat distribution utilization vector T in T sup DU (a(a + ) bbbbbbbbbbbbbbb XInt Algorithm Approximation solution (genetic algorithm) ► Take a feasible solution and perform mutations until certain number of iterations P = a U + b

13 Inlet Temperature Contrasted scheduling approaches ► Uniform Outlet Profile (UOP)  Assigning tasks in a way that tries to achieve uniform outlet temperature distribution  Assigning more task to nodes with low inlet temperature (water filling process) ► Minimum computing energy  Assigning tasks in a way that keeps the number of active (power-on) chassis as few as possible  Server with coolest inlet temperature first ► Uniform Task (UT)  Assigning all chassis the same amount of tasks (power consumptions)  All nodes experience the same power consumption and temperature rise Outlet Temperature

14 Simulated Environment ► ► Used Flometrics Flovent ► ► Simulated a small scale data center ► ► physical dimensions 9.6m  8.4m  3.6m ► ► two rows of industry standard 42U racks arranged ► ► CRAC supply at 8 m 3 /s ► ► There are 10 racks   each rack is equipped with 5 chassis ► ► 1000 processors in data center.   232KWatts at full utilization

15 Performance Results ► Xint outperforms other algorithms ► Data Centers almost never run at 100%  Plenty of room for benefits!

16 Performance Results ► Xint outperforms other algorithms ► Data Centers almost never run at 100%  Plenty of room for benefits!

17 Power Vector Distribution key Xint contradicts “rule of thumb” placement at bottom

18 Supply Heat Index (SHI) ► Supply Heat Index  Metric developed by HP Labs  quantifies the overall heat recirculation of data center ► Xint consistently has the lowest SHI

19 Conclusions ► Thermal-aware task placement can significantly reduce heat recirculation  XInt performance thrives at around 50% CPU utilization ► Not much can be done at 100% utilization  Cooling savings can exceed 30% (in comparison to other schemes) ► Cost of operation reduces by 15% (if initially 1:1 ratio of computing-2-cooling)

20 Related Work in Progress ► Waiving simplifying assumptions  Equipment heterogeneity [INFOCOM 2008]  Stochastic task arrival ► Thermal maps thru machine learning  Automated, non-invasive, cost-effective [GreenCom 2007] ► Implementations  Thermal-aware Moab scheduler  Thermal-aware SLURM  SiCortex product thermal management

21 Algorithm Assumptions ► HPC model in mind  Long-running jobs (finish time is the same — infinity) ► One-time arrival (starting time is the same) ► Utilization homogeneity (same utilization throughout task’s length) ► Non preemptive/movable tasks ► Data Center equipment homogeneity  power consumption  computational capability ► Cooling is self-controlled

22 Thank You ► Questions? ► Comments? ► Suggestions? http://impact.asu.edu/

23 Additional Slides

24 Functional model of scheduling ► Tasks arrive at the data center ► Scheduler figures out the best placement  Placement that has minimal impact on peak inlet temperatures ► Assigns task accordingly Scheduler Task Tasks

25 Architectural View Scheduler (Moab, SLURM) dispatch Machine Learning create/update provide Monitoring Processes Thermal Model report control

26 A simple thermal model ► Basic Idea:  We don’t need an extensive CFD model  We only need to know the effect of recirculation at specific points ► Express recirculation as “coefficients” Courtesy: Intel Labs N1N1 N2N2 N3N3 N4N4 N5N5

27 Recirculation coefficients: a fast thermal model ► Reduce/Simplify the “thermal map” concept to points of interest: equipment air inlets ► Can be computed from CFD models/simulations Matrix A a ij : portion of heat exhausted from node i that directly goes to node j A recirculation coefficients

28 Opportunities & Challenges ► Data centers don’t run at fulll unitilization  Can choose among multiple CPUs to allocate a job  Different thermal impact per CPU ► Need for fast thermal evaluation ► Temporal and spatial Heterogeneity of Data Centers  In equipment  In workload Thermal issues ► Heat recirculation  Increases as equipment density exceeds cooling capacity as planned ► Hot spots  Effect of Heat Recirculation ► Impact: Cooling has to be set low enough to have all inlet temperatures in safe operating range Data Center Thermal Management Increasing need for thermal awareness ► Power density increases  Circuit density increases by a factor of 3 every 2 years  Energy efficiency increases by a factor of 2 every 2 years  Effective power density increases by a factor of 1.5 every 2 years [Keneth Brill: The Invisible Crisis in the Data Center] ► Maintenance/TCO rising  Data Center TCO doubles every three years  By 2009, the three-year cost of electricity will exceed the purchase cost of the server  Virtualization/Consolidation is a 1-time/short term solution ► Thermal management corresponds to an increasing portion of expenses  Thermal-aware solutions becoming prominent ICCase/chassisroom firmware O/S Application (middleware) Dynamic voltage scaling Dynamic frequency scaling Circuitry redundancy Fan speed scaling CPU Load balancing Thermal-aware VM Data center job scheduling software dimension physical dimension Thermal-aware solutions at various levels A dynamic thermal- aware control platform is necessary for online thermal evaluation A dynamic thermal- aware control platform is necessary for online thermal evaluation without thermal-aware management With thermal-aware management computation cooling $1M $10M $100M year

29 Scheduling Impacts Cooling Setting Inlet temperature distribution without Cooling 25  C Inlet temperature distribution with Cooling Scheduling 1 Scheduling 2 Different demands for cooling capacity

30 Results(1) ► Recirculation Coefficients  Consistent with datacenter observations  Large values are observed along diagonal  Strong recirculation among neighboring servers, or between bottom servers and top servers 1 2 3 4 5 6 7 8 9 10


Download ppt "Thermal-aware Task Placement in Data Centers Qinghui Tang Sandeep K S Gupta Georgios Varsamopoulos IMPACT Lab Arizona State University."

Similar presentations


Ads by Google