Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thermal-aware Task Placement in Data Centers

Similar presentations


Presentation on theme: "Thermal-aware Task Placement in Data Centers"— Presentation transcript:

1 Thermal-aware Task Placement in Data Centers
SANDEEP GUPTA Department of Computer Science and Engineering School of Computing and Informatics Ira A. Fulton School of Engineering Arizona State University Tempe, Arizona, USA

2 Tempe, Fulton School of Engg & CSE

3 Department of Computer Science & Engineering, Tempe, Arizona
IMPACT (Intelligent Mobile Pervasive Autonomic Computing & Technologies) LAB Research Goals Enable Context-Aware Pervasive Applications Dependable Distributed Sensor Networking Projects Wireless Solution for Smart Sensors Biomedical Applications (NSF - ITR) Context-Aware Middleware for Pervasive Computing (NSF – NMI) Thermal Management Datacenters (SFAZ, NSF) Location Based Access Control (CES) Identity Assurance (NSF, CES) Mobility-Tolerant Multicast (NSF) Ayushman – Infrastructure Testbed for Sensor Based Health Monitoring (Mediserve Inc.) Group Faculty: Dr. Sandeep K. S. Gupta 1 PostDoc + 7 PhD + 2MS + 1 UG Department of Computer Science & Engineering, Tempe, Arizona Datacenter project ????? Sponsors

4 Pervasive Health Monitoring Criticality Aware-Systems
IMPACT: Research Use-inspired research in pervasive computing & wireless sensor networking Goal: Protocols for mobile ad-hoc networks Features: Energy efficiency Increased lifetime Data aggregation Localization Caching Multicasting Sponsor: Mobile Ad-hoc Networks Goal: Protect people’s identity & consumer computing from viral threats Features: PKI based Non-tamperable, non-programmable personal authenticator Hardware and VM based trust management Sponsor: ID Assurance Pervasive Health Monitoring Criticality Aware-Systems Thermal Management for Data Centers Intelligent Container Goal: Pervasive Health monitoring Evaluation of medical applications Features: Secure, Dependable and Reliable data collection, storage and communication Sponsor: Goal: Evaluation of crisis response management Features: Theoretical model Performance evaluation Access control for crisis management Sponsor: Goal: Increasing computing capacity for datacenters Energy efficiency Features: Online thermal evaluation Thermal Aware Scheduling Sponsor: Goal: Container Monitoring for Homeland Security Dynamic Supply Chain Management Features: Integration of RFID and environmental sensors Energy management Communication security Sponsor: Medical Devices, Mobile Pervasive Embedded Sensor Networks BOOK: Fundamentals of Mobile and Pervasive Computing, Publisher: McGraw-Hill  Dec. 2004

5 Growth Trends in data centers
Power density increases Circuit density increases by a factor of 3 every 2 years Energy efficiency increases by a factor of 2 every 2 years Effective power density increases by a factor of 1.5 every 2 years [Keneth Brill: The Invisible Crisis in the Data Center] Maintenance/TCO rising Data Center TCO doubles every three years By 2009, the three-year cost of electricity will exceed the purchase cost of the server Virtualization/Consolidation is a 1-time/short term solution [Uptime Institute] Thermal management corresponds to an increasing portion of expenses Thermal-aware solutions becoming prominent Increasing need for thermal awareness

6 Motivation Cooling is the chief driver of increased data center construction cost, costing up to $5000 per square foot in initial purchase price.[1] Cooling is one of the leading contributors to ongoing total cost of ownership, costing one half to one watt per watt spent on computation.[2] If we can eliminate even 25% of total cooling costs, that can translate to a $1-$2 million annual cost reduction in a single large data center.[3]

7 Related Work (extended domain)
IC Case/chassis room firmware O/S Application (middleware) Dynamic voltage scaling Dynamic frequency scaling Circuitry redundancy Fan speed scaling CPU Load balancing Thermal-aware VM Thermal-aware data center job scheduling software dimension physical dimension

8 Contributions Developed Thermal models of Data Centers
Developed analytical thermal models using theoretical thermodynamic formulations Developed online thermal models using machine learning techniques Designed thermal aware task placement algorithms Designed genetic algorithm based task placement algorithm that minimizes the heat recirculation among the servers and the peak inlet temperature. Created a software architecture for dynamic thermal management of Data Centers Developed CFD Models for Real World Data Centers for testing and validation of thermal models and task placement algorithms

9 Thermal issues in dense computer rooms
(i.e. Data centers, Computer Clusters, Data warehouses) Heat recirculation Hot air from the equipment air outlets is fed back to the equipment air inlets Hot spots Effect of Heat Recirculation Areas in the data center with alarmingly high temperature Consequence Cooling has to be set very low to have all inlet temperatures in safe operating range Courtesy: Intel Labs

10 Conceptual overview of thermal-aware task placement
Peak air inlet temperature determines upper bound to CRAC temperature setting Task placement determines temperature distribution Temperature distribution determines the equipment peak air inlet temperature CRAC temperature setting determines it’s efficiency (Coefficient of Performance) The lower the peak inlet temperature the higher the CRAC efficiency Coefficient of Performance (source: HP) bottom line There is a task placement that maximizes cooling efficiency. Find it!

11 Prerequisites for thermal management
Task profiling CPU utilization, I/O activity etc Equipment power profiling CPU consumption, disk consumption etc Heat recirculation modeling Task management technologies Need for a comprehensive research framework

12 Thermal-aware job scheduling
Thermal management research framework Characterization Characterize the power consumption of a given workload (CPU, memory, disk etc) on a given equipment Thermal Models To enable on-line real-time thermal-aware job scheduling fast (analytical, non CFD based) non-evasive (machine-learning) Model the thermal impact of multicore systems Thermal-aware job scheduling On-line job scheduling algorithm to minimize peak air inlet temperature, thus minimizing the cost of cooling. Sandeep Gupta Qinghui Tang Tridib Mukherjee Michael Jonas Georgios Varsamopoulos

13 Task Profiling measurements at ASU HPC Data Center (one chassis)

14 Power Model and Profiling
Power Consumption is mainly affected by the CPU utilization Power consumption is linear to the CPU utilization P = a U + b

15 Linear Thermal Model Heat Recirculation Coefficients
Analytical Matrix-based Properties of model Granularity at air inlets (discrete/simplified) Assumes steadiness of air flow Tin Tsup D P = + × heat distribution power vector inlet temperatures supplied air temperatures

16 Benefit: fast thermal evaluation
Extract temperatures Give workload Run CFD simulation (days) Courtesy: Flometrics D P Tsup Tin × + Yields temperatures Give workload Compute vector (seconds)

17 Thermal-aware Task Placement Problem
Given an incoming task, find a task partitioning and placement of subtasks to minimize the (increase of) peak inlet temperature P = a U + b Tin Tsup D U XInt Algorithm Approximation solution (genetic algorithm) Take a feasible solution and perform mutations until certain number of iterations bb b b b b (a + ) = + × heat distribution inlet temperatures supplied air temperatures utilization vector

18 Contrasted scheduling approaches
Uniform Outlet Profile (UOP) Assigning tasks in a way that tries to achieve uniform outlet temperature distribution Assigning more task to nodes with low inlet temperature (water filling process) Minimum computing energy Assigning tasks in a way that keeps the number of active (power-on) chassis as few as possible Server with coolest inlet temperature first Uniform Task (UT) Assigning all chassis the same amount of tasks (power consumptions) All nodes experience the same power consumption and temperature rise Outlet Temperature Inlet Temperature

19 Simulated Environment
Used Flometrics Flovent Simulated a small scale data center physical dimensions 9.6m  8.4m  3.6m two rows of industry standard 42U racks arranged CRAC supply at 8 m3/s There are 10 racks each rack is equipped with 5 chassis 1000 processors in data center. 232KWatts at full utilization

20 Performance Results Xint outperforms other algorithms
Data Centers almost never run at 100% Plenty of room for benefits! diagonal

21 Performance Results Xint outperforms other algorithms
Data Centers almost never run at 100% Plenty of room for benefits! diagonal

22 Power Vector Distribution
Xint contradicts “rule of thumb” placement at bottom key

23 Supply Heat Index (SHI)
Metric developed by HP Labs quantifies the overall heat recirculation of data center Xint consistently has the lowest SHI

24 Conclusions Thermal-aware task placement can significantly reduce heat recirculation XInt performance thrives at around 50% CPU utilization Not much can be done at 100% utilization Cooling savings can exceed 30% (in comparison to other schemes) Cost of operation reduces by 15% (if initially 1:1 ratio of computing-2-cooling)

25 Related Work in Progress
Waiving simplifying assumptions Equipment heterogeneity [INFOCOM 2008] Stochastic task arrival Thermal maps thru machine learning Automated, non-invasive, cost-effective [GreenCom 2007] Implementations Thermal-aware Moab scheduler Thermal-aware SLURM SiCortex product thermal management

26 Algorithm Assumptions
HPC model in mind Long-running jobs (finish time is the same — infinity) One-time arrival (starting time is the same) Utilization homogeneity (same utilization throughout task’s length) Non preemptive/movable tasks Data Center equipment homogeneity power consumption computational capability Cooling is self-controlled

27 Thank You Questions? Comments? Suggestions?

28 References 1) AMD – Power and Cooling in the Data Center A_PC_WP_en.pdf 2) HP Labs - Going beyond CPUs: The Potential of Temperature-Aware Solutions for the Data Center. 3) HP Labs - Making Scheduling Cool: Temperature-Aware Workload Placement in Data Centers.

29 Additional Slides

30 Functional model of scheduling
Tasks arrive at the data center Scheduler figures out the best placement Placement that has minimal impact on peak inlet temperatures Assigns task accordingly Tasks Scheduler Task Task

31 Scheduler (Moab, SLURM)
Architectural View Monitoring Processes Thermal Model report control Machine Learning create/update provide Scheduler (Moab, SLURM) dispatch

32 A simple thermal model Basic Idea:
We don’t need an extensive CFD model We only need to know the effect of recirculation at specific points Express recirculation as “coefficients” N5 Courtesy: Intel Labs N4 N3 N2 N1

33 Recirculation coefficients: a fast thermal model
Reduce/Simplify the “thermal map” concept to points of interest: equipment air inlets Can be computed from CFD models/simulations A Matrix A aij: portion of heat exhausted from node i that directly goes to node j recirculation coefficients

34 Data Center Thermal Management
Increasing need for thermal awareness Power density increases Circuit density increases by a factor of 3 every 2 years Energy efficiency increases by a factor of 2 every 2 years Effective power density increases by a factor of 1.5 every 2 years [Keneth Brill: The Invisible Crisis in the Data Center] Maintenance/TCO rising Data Center TCO doubles every three years By 2009, the three-year cost of electricity will exceed the purchase cost of the server Virtualization/Consolidation is a 1-time/short term solution Thermal management corresponds to an increasing portion of expenses Thermal-aware solutions becoming prominent Thermal-aware solutions at various levels IC Case/chassis room firmware O/S Application (middleware) Dynamic voltage scaling Dynamic frequency scaling Circuitry redundancy Fan speed scaling CPU Load balancing Thermal-aware VM Data center job scheduling software dimension physical dimension A dynamic thermal-aware control platform is necessary for online thermal evaluation Thermal issues Heat recirculation Increases as equipment density exceeds cooling capacity as planned Hot spots Effect of Heat Recirculation Impact: Cooling has to be set low enough to have all inlet temperatures in safe operating range Opportunities & Challenges Data centers don’t run at fulll unitilization Can choose among multiple CPUs to allocate a job Different thermal impact per CPU Need for fast thermal evaluation Temporal and spatial Heterogeneity of Data Centers In equipment In workload downplaying the advantages of dense server deployment without thermal-aware management With thermal-aware management $100M cooling $10M $1M computation year

35 Scheduling Impacts Cooling Setting
Inlet temperature distribution without Cooling Inlet temperature distribution with Cooling Different demands for cooling capacity Scheduling 1 25C Scheduling 2 25C

36 Results(1) Recirculation Coefficients
Consistent with datacenter observations Large values are observed along diagonal Strong recirculation among neighboring servers, or between bottom servers and top servers 1 2 3 4 5 10 diagonal 9 8 7 6


Download ppt "Thermal-aware Task Placement in Data Centers"

Similar presentations


Ads by Google