Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thermal-aware Issues in Computers IMPACT Lab. Part A Overview of Thermal-related Technologies.

Similar presentations


Presentation on theme: "Thermal-aware Issues in Computers IMPACT Lab. Part A Overview of Thermal-related Technologies."— Presentation transcript:

1 Thermal-aware Issues in Computers IMPACT Lab

2 Part A Overview of Thermal-related Technologies

3 Importance of thermal management ► Cooling cost very high:  at providing cool air: equals the power consumed in computation  at bring the cool medium (air/liquid) to the circuitry: new density requires $2  Watt of material/equipment if 40+ Watts of IC ► Excessive heat accelerates material degradation ► Power density only to increase in the future

4 Thermal management at various levels ► Physical dimension  At IC level  At chassis/case level  At room level ► Software dimension  Firmware level  Operating system level  Middleware level  Application level Source: Intel Source: Apple Source: Berkeley Lab

5 At integrated circuit level ► Issues  Higher temperature  Increased power leakage  Increased power leakage  Higher temperature  Heat density – hot spots ► Applied Solutions  Dynamic Voltage Scaling  Dynamic Frequency Scaling  Clock gating (“pause” mode) ► Research solutions  Redundant circuitry ► Redundant “cores” [Chapparro 2004] ► Redundant pipelines [Lim 2002] ► Switch from one circuitry to the other either regularly or when temperature exceeds levels

6 At chassis/case level At chassis/case level ► Issues  Fan capacity at low RPMs not enough for generated heat  Fan noise level at high RPMs too high ► Solutions  Dynamic Fan Speed  CPU load balancing  Activity Adjustments ► Dynamic Memory bandwidth scaling [Apple TN2156] ► Dynamic FSB frequency scaling Layout forces flow of air in a linear fashion Source: Intel Source: Apple Terms: inlets, outlets

7 At room level ► Solutions:  Pause execution of tasks  Turn machines off ► Performance impacts  Degraded performance Source: www.cix.ie Source: Elibo, Hong Kong Terms: hot aisle, cold aisle, raised floors, CRAC/HVAC

8 A typical data center Source: Siemens Terms: hot aisle, cold aisle, raised floors, CRAC/HVAC

9 CRAC & thermal maps: knowing where the hot spots are ► Purpose  Knowing air temperature at any 3-D point  Adjust CRAC operation  Adjust computer operation ► Obtaining by  Strategically placed sensors  On-board sensors ► Predicting by  Thorough testing  CFD simulations

10 Thermal issues in dense computer rooms (Data centers, Computer Clusters, Data warehouses) ► Heat recirculation  Hot air from the equipment outlets is fed back to the equipment inlets ► Hot spots  Effect of Heat Recirculation  Areas in the data center with alarmingly high temperature ► Impact  Cooling has to be set well low to have all inlet temperatures in safe operating range Courtesy: Intel Labs Terms: heat recirculation, hot spots, inlet temperatures, outlet temperatures, redline temperature, peak temperature

11 Thermal Management solutions ICCase/chassisroom firmware O/S Application (middleware) Dynamic voltage scaling Dynamic frequency scaling Circuitry redundancy Fan speed scaling CPU Load balancing Thermal-aware JVM Data center job scheduling software dimension physical dimension

12 Part B Reducing Heat Recirculation (at room level)

13 Reducing heat recirculation (1) ► Heat Recirculation is the only reason for increase inlet temperatures  Without recirculation, the inlet temperatures would be equal to supplied air temp. ► The peak inlet temperature defines the CRAC operational temperature Inlet temperature distribution without Cooling 25  C Inlet temperature distribution with Cooling

14 Reducing heat recirculation (2) ► First things first  Find the causes of it  Find ways to predict it ► What is causing it 1.The air flow from the CRAC is not adequate to feed all inlets 2.Imperfect layout ► Usually 1. and 2. are not adjustable once the equipment is bought and in place  Find other ways to reduce it

15 Reducing heat recirculation (3) ► Other ways to reduce it  Find who is contributing the most heat recirculation  Mitigate the heat recirculation by throttling activity at main contributors of recirculation (contributor = equipment unit that is generating heat) ( throttling activity = change the jobs or the execution of them) ► How to know how much heat each equipment contributes?  But: how to know how much heat each equipment generates? (i.e. power profile)

16 Reducing heat recirculation (general plan of action) Assess the effect of a task on the equipment (cpu, memory, I/O) Assess the heat generated by the equipment from the task Assess how much of that heat is recirculated Assess the inlet temperatures given the heat recirculation ► If we had a mechanism like this  we could predict the effects of a running (or potentially running) job and  decide about its fate according to its effects Terms: task profile, power profile, thermal map prediction

17 Task profiling (1) ► Task profiling  Assess how much CPU utilization, memory activity, disk I/O, network traffic etc, the application generates ► Task profiling can be done  Offline, by code analyzers, or  Online, by test runs ► Dirty (and convenient) fact about HPC (high- performance computing):  Incoming jobs have highly predictable profile

18 Power profiling ► Power Profiling  Assess how much heat is generated from each component (i.e. CPU, memory, disk I/O, network etc)  Assess how much power is consumed from each component (i.e. CPU, memory, disk I/O, network etc) ► Power profiling is usually preformed offline

19 Example results of power profiling ► Power Consumption is mainly affected by the CPU utilization ► Power consumption is linear to the CPU utilization P = a U + b

20 A simple thermal model From A/C To A/C Power consumed From other machines to other machines

21 Effect of CPU utilization to outlet temperature ► Task profiling  Assess how much CPU utilization the application generates ► Outlet Temperature is a function of utilization plus input T outlet = f (U) + T inlet

22 Assessing recirculation for the given computational tasks ► Assessing Recirculation  Obtaining the thermal map for the given task assignment ► Compare with offline measurements ► But we don’t need to know the temperature at every point in the air  Only at the inlets and the outlets Courtesy: Intel Labs N1N1 N2N2 N3N3 N4N4 N5N5

23 Recirculation coefficients ► Purpose  Knowing air temperature at any 3-D point  Adjust CRAC operation  Adjust computer operation ► Obtaining by  Strategically placed sensors  On-board sensors ► Predicting by  Thorough testing  CFD simulations

24 How scheduling impacts cooling cost Inlet temperature distribution without Cooling 25  C Inlet temperature distribution with Cooling Scheduling 1 Scheduling 2 Different demands for cooling capacity

25 Part C Integrated Thermal-aware Management

26 Functional model of scheduling ► Tasks arrive at the data center ► Scheduler figures out the best placement  Placement that has minimal impact on peak inlet temperatures ► Assigns task accordingly Scheduler Task Tasks

27 Architectural View Scheduler (SLURM)

28 Part D Potential Term Projects

29 Scheduling Algorithms ► Current work assumed incoming jobs that  Are Identical (same profile)  Are long-running ► Enhance scheduling algorithm to work with  Heterogeneous data center  Asynchronous job arrival  Jobs have non-identical execution time

30 Scheduler Programming ► Enhance existing job management software (Moab, SLURM etc) to work with  Gathering thermal data  Assigning jobs according to policy


Download ppt "Thermal-aware Issues in Computers IMPACT Lab. Part A Overview of Thermal-related Technologies."

Similar presentations


Ads by Google