Thermal-aware Issues in Computers IMPACT Lab
Part A Overview of Thermal-related Technologies
Importance of thermal management ► Cooling cost very high: at providing cool air: equals the power consumed in computation at bring the cool medium (air/liquid) to the circuitry: new density requires $2 Watt of material/equipment if 40+ Watts of IC ► Excessive heat accelerates material degradation ► Power density only to increase in the future
Thermal management at various levels ► Physical dimension At IC level At chassis/case level At room level ► Software dimension Firmware level Operating system level Middleware level Application level Source: Intel Source: Apple Source: Berkeley Lab
At integrated circuit level ► Issues Higher temperature Increased power leakage Increased power leakage Higher temperature Heat density – hot spots ► Applied Solutions Dynamic Voltage Scaling Dynamic Frequency Scaling Clock gating (“pause” mode) ► Research solutions Redundant circuitry ► Redundant “cores” [Chapparro 2004] ► Redundant pipelines [Lim 2002] ► Switch from one circuitry to the other either regularly or when temperature exceeds levels
At chassis/case level At chassis/case level ► Issues Fan capacity at low RPMs not enough for generated heat Fan noise level at high RPMs too high ► Solutions Dynamic Fan Speed CPU load balancing Activity Adjustments ► Dynamic Memory bandwidth scaling [Apple TN2156] ► Dynamic FSB frequency scaling Layout forces flow of air in a linear fashion Source: Intel Source: Apple Terms: inlets, outlets
At room level ► Solutions: Pause execution of tasks Turn machines off ► Performance impacts Degraded performance Source: Source: Elibo, Hong Kong Terms: hot aisle, cold aisle, raised floors, CRAC/HVAC
A typical data center Source: Siemens Terms: hot aisle, cold aisle, raised floors, CRAC/HVAC
CRAC & thermal maps: knowing where the hot spots are ► Purpose Knowing air temperature at any 3-D point Adjust CRAC operation Adjust computer operation ► Obtaining by Strategically placed sensors On-board sensors ► Predicting by Thorough testing CFD simulations
Thermal issues in dense computer rooms (Data centers, Computer Clusters, Data warehouses) ► Heat recirculation Hot air from the equipment outlets is fed back to the equipment inlets ► Hot spots Effect of Heat Recirculation Areas in the data center with alarmingly high temperature ► Impact Cooling has to be set well low to have all inlet temperatures in safe operating range Courtesy: Intel Labs Terms: heat recirculation, hot spots, inlet temperatures, outlet temperatures, redline temperature, peak temperature
Thermal Management solutions ICCase/chassisroom firmware O/S Application (middleware) Dynamic voltage scaling Dynamic frequency scaling Circuitry redundancy Fan speed scaling CPU Load balancing Thermal-aware JVM Data center job scheduling software dimension physical dimension
Part B Reducing Heat Recirculation (at room level)
Reducing heat recirculation (1) ► Heat Recirculation is the only reason for increase inlet temperatures Without recirculation, the inlet temperatures would be equal to supplied air temp. ► The peak inlet temperature defines the CRAC operational temperature Inlet temperature distribution without Cooling 25 C Inlet temperature distribution with Cooling
Reducing heat recirculation (2) ► First things first Find the causes of it Find ways to predict it ► What is causing it 1.The air flow from the CRAC is not adequate to feed all inlets 2.Imperfect layout ► Usually 1. and 2. are not adjustable once the equipment is bought and in place Find other ways to reduce it
Reducing heat recirculation (3) ► Other ways to reduce it Find who is contributing the most heat recirculation Mitigate the heat recirculation by throttling activity at main contributors of recirculation (contributor = equipment unit that is generating heat) ( throttling activity = change the jobs or the execution of them) ► How to know how much heat each equipment contributes? But: how to know how much heat each equipment generates? (i.e. power profile)
Reducing heat recirculation (general plan of action) Assess the effect of a task on the equipment (cpu, memory, I/O) Assess the heat generated by the equipment from the task Assess how much of that heat is recirculated Assess the inlet temperatures given the heat recirculation ► If we had a mechanism like this we could predict the effects of a running (or potentially running) job and decide about its fate according to its effects Terms: task profile, power profile, thermal map prediction
Task profiling (1) ► Task profiling Assess how much CPU utilization, memory activity, disk I/O, network traffic etc, the application generates ► Task profiling can be done Offline, by code analyzers, or Online, by test runs ► Dirty (and convenient) fact about HPC (high- performance computing): Incoming jobs have highly predictable profile
Power profiling ► Power Profiling Assess how much heat is generated from each component (i.e. CPU, memory, disk I/O, network etc) Assess how much power is consumed from each component (i.e. CPU, memory, disk I/O, network etc) ► Power profiling is usually preformed offline
Example results of power profiling ► Power Consumption is mainly affected by the CPU utilization ► Power consumption is linear to the CPU utilization P = a U + b
A simple thermal model From A/C To A/C Power consumed From other machines to other machines
Effect of CPU utilization to outlet temperature ► Task profiling Assess how much CPU utilization the application generates ► Outlet Temperature is a function of utilization plus input T outlet = f (U) + T inlet
Assessing recirculation for the given computational tasks ► Assessing Recirculation Obtaining the thermal map for the given task assignment ► Compare with offline measurements ► But we don’t need to know the temperature at every point in the air Only at the inlets and the outlets Courtesy: Intel Labs N1N1 N2N2 N3N3 N4N4 N5N5
Recirculation coefficients ► Purpose Knowing air temperature at any 3-D point Adjust CRAC operation Adjust computer operation ► Obtaining by Strategically placed sensors On-board sensors ► Predicting by Thorough testing CFD simulations
How scheduling impacts cooling cost Inlet temperature distribution without Cooling 25 C Inlet temperature distribution with Cooling Scheduling 1 Scheduling 2 Different demands for cooling capacity
Part C Integrated Thermal-aware Management
Functional model of scheduling ► Tasks arrive at the data center ► Scheduler figures out the best placement Placement that has minimal impact on peak inlet temperatures ► Assigns task accordingly Scheduler Task Tasks
Architectural View Scheduler (SLURM)
Part D Potential Term Projects
Scheduling Algorithms ► Current work assumed incoming jobs that Are Identical (same profile) Are long-running ► Enhance scheduling algorithm to work with Heterogeneous data center Asynchronous job arrival Jobs have non-identical execution time
Scheduler Programming ► Enhance existing job management software (Moab, SLURM etc) to work with Gathering thermal data Assigning jobs according to policy