Adaptable Approach to Estimating Thermal Effects in a Data Center Environment Corby Ziesman IMPACT Lab Arizona State University
Outline Introduction to the thermal model Overview of architecture and neural net Determining metrics Learning phase effects Software architecture Summary Future Work
Introduction
Thermal model part 1: : Air intake heat for this node i : Supplied cool air : Air heat from other nodes’ exhaust that reaches node i (cross interference) In the formal abstract heat model, there are coefficients to model the cross interference levels, but they are hard to determine without disrupting data center operation, and will vary from data center to data center. It is necessary to come up with a practical way to estimate these values in order to accurately determine the thermal characteristics of the data center and the effects of smart thermal-aware scheduling decisions.
Introduction Thermal model part 2: : Air exhaust heat from this node i : Air intake heat ( ) : Heat from consumed electrical power We can try to use a learning algorithm to estimate the cross interference from other nodes, and we can use sensors that monitor measurable values such as air exhaust temperatures and intake temperatures (and possibly power consumption) to fine tune our estimation. In simulation, where we can use the previous model and known cross-interference coefficients, we can compare how close this approach is to the actual values (future work).
Overview Architecture Neural Net Process Cycle
Outline Introduction to the thermal model Overview of architecture and neural net Determining metrics Learning phase effects Software architecture Summary Future Work
Finding a Proper Metric In order to check on the effects and re- adjust the weights of the neural nets, proper criteria must be determined. Metrics: –Number and severity of any “hot spots” in the data center environment –Average data center ambient temperature
Finding a Proper Metric Hot spot? –How do we define what constitutes a hot spot, and how large of an area is a spot? 5 degrees above surrounding areas for a 10 foot radius? 10 degrees above surrounding areas for a 5 foot radius? –These values will need to be determined through experimentation (future work)
Outline Introduction to the thermal model Overview of architecture and neural net Determining metrics Learning phase effects Software architecture Summary Future Work
Can Only Improve? It would be a great advantage if during the learning phase of the neural net, the performance is no worse than being ignorant of the thermal effects. (i.e. it will not harm performance overall during the learning phase, it can only get better).
Can Only Improve? Case 1: Neural net weights result in an action that worsens the thermal environment –This effect is detected, weights are adjusted, and the neural net will not reproduce this decision next time –Not taking into account thermal effects in the first place (a thermal- agnostic approach) would also produce negative effects, but would not avoid making the same mistake in the future Case 2: Neural net weights result in an action that improves the thermal environment –This effect is also detected, and the weights are adjusted so that this action is even more likely to be reproduced in the future Overall, the learning phase should have a negligible negative impact that is comparable to, or better than, the random negative outcomes that result from a thermal-agnostic approach.
Outline Introduction to the thermal model Overview of architecture and neural net Determining metrics Learning phase effects Software architecture Summary Future Work
Software Architecture To achieve high performance, create a Node data structure for each node in the data center. Each Node will contain a function update() that runs continuously, monitoring the node’s eligibility to receive new jobs to be scheduled. The update() function will also rely on other functions such as a check() function that checks for thermal effects after some length of time when this node has had a job scheduled. Each update() function will run in a separate thread, so that every node’s information is autonomously and continuously updated and corrected so that it can be called upon as needed by the scheduler without a high processing time overhead. –Easier to continually perform little calculations rather than do them all at once: reduces latency –Every Node is reasonably up-to-date regarding whether or not it is in a good situation to take on a new job.
Summary This approach aims to approximate the formal heat model developed by Qinghui Tang. By using neural nets and multithreaded programming, we hope to create an adaptable, self-configuring, and time-efficient system to avoid excess heat, reducing cooling costs. In theory, this approach can only be beneficial (if the evaluation criteria is accurate).
Future Work Write new code and compare results in simulation with the formal model. Evaluate performance (response time) and scalability. Determine the proper metrics that will accurately detect and avoid hot spots. Test on real hardware / in real environment.
Questions