Forecasting with Cyber-physical Interactions in Data Centers (part 3) Lei Li leili@cs.cmu.edu 9/28/2011 PDL Seminar
Big Picture: Predictive AC Control and Server Management Server/workload management Computing energy model Sensor measuring Model of computing energy Temperature prediction Cooling energy model CRAC control (c) Lei Li 2012
Outline Overview of time series mining Motivation Experimental setup Time series examples What problems do we solve Motivation Experimental setup ThermoCast: the forecasting model Results Other time series models and algorithms (c) Lei Li 2012
Experimental setup Tested in JHU data center with 171 1U servers, instrumented with a network of 80 sensors (c) Lei Li 2012
Sample measurements (c) Lei Li 2012
Observations Temperature difference cycle (max/min temp. on the same rack) is in anti-phase with air velocity cycle. Middle and bottom sections are coldest; Top is hottest Shutting down under-utilized servers could reduce energy consumption. (c) Lei Li 2012
What happens when shutting down servers? Shut down (c) Lei Li 2012
Outline Overview of time series mining Motivation Experimental setup Time series examples What problems do we solve Motivation Experimental setup ThermoCast: the forecasting model Results Other time series models and algorithms (c) Lei Li 2012
ThermoCast [Li et al, KDD 2011] Given: intake temperatures, outtake temperatures, workload for each server , and floor air speed Goal: forecasting temperature distribution and thermal aware placement of workload Approach: a zonal forecasting model divide the machine room into zones, and each rack into sections. (c) Lei Li 2012
Assumptions A0: incompressible air A1: environmental temperature is constant A2: supply air temperature is constant within a period A3: constant server fan speed A4: vertical air flow at the outtake is negligible A5: vertical air flow at the intake is linear to height (c) Lei Li 2012
Sensor measurements & Air interactions (c) Lei Li 2012
ThermoCast (c) Lei Li 2012
ThermoCast Model outlet temp Inlet temp floor air speed Derived from fluid dynamics and thermodynamics together with assumptions [Li et al, KDD 2011] (c) Lei Li 2012
Parameter Learning s.t. (c) Lei Li 2012
Outline Overview of time series mining Motivation Experimental setup Time series examples What problems do we solve Motivation Experimental setup ThermoCast: the forecasting model Results Other time series models and algorithms (c) Lei Li 2012
ThermoCast Results Q1: How accurately can a server learn its local thermal dynamics for prediction? 2x better using 90 minutes as training, predicting 5 minutes away AR ThermoCast 75% 100% shutdown (c) Lei Li 2012
ThermoCast Results Q2: How long ahead can ThermoCast forecast thermal alarms? 2x faster Baseline ThermoCast Recall 62.8% 71.4% FAR 45% 43.1% MAT 2.3min 4.2 min FAR=false alarm rate MAT=mean look-ahead time (c) Lei Li 2012
Implication on Capacity Gain Preliminary results comparing workload placement strategies: 5 minutes forecast length With the same cooling: Inlet temp with ThermoCast: 13.75 C Inlet temp with Static profiling: 16.5 C Assume the servers consume 200W on average (Dell PowerEdge 1950), we gain extra 26% computing power with the same cooling (c) Lei Li 2012
Contributions and Impact Predictability: a hybrid approach to integrate the thermodynamics and sensor data Scalable learning/training thanks to the zonal thermal model Real data and instrument in a data center with practical workload Projected impact: can handle extra 26% workload (e.g. PUE 1.5 PUE 1.4) (c) Lei Li 2012
Outline Overview of time series mining Motivation Experimental setup Time series examples What problems do we solve Motivation Experimental setup ThermoCast: the forecasting model Results Other time series models and algorithms (c) Lei Li 2012
DynaMMo: imputation/forecasting Time sensor 1 sensor 2 … sensorm blackout Goal: recover the missing values Details in [Li et al, KDD 2009] (c) Lei Li 2012
DynaMMo result Ideal Reconstruction error Our DynaMMo better Average missing length Spline MSVD [Srebro’03] Linear Interpolation Our DynaMMo better Average length of successive missing values, Why there is drop at 100? Because it is average of 10 repeats, and each time we make random missing values, there is variance. Ideal Dataset: CMU Mocap #16 mocap.cs.cmu.edu harder (c) Lei Li 2012 more results in [Li et al, KDD 2009]
PLiF and CLDS for clustering BGP data: hierarchical clustering + PLiF features Details in [Li et al, VLDB 2010] and [Li & Prakash, ICML 2011] (c) Lei Li 2012
CLDS Clustering Mocap Data CLDS two features PCA top 2 components Accuracy = 93.9% Accuracy = 51.0% (c) Lei Li 2012 walking motion running motion
WindMine Goal: find patterns and anomalies from user-click streams (c) Lei Li 2012
Discoveries by WindMine Job website weather kids health (c) Lei Li 2012
Conclusion time series mining with many applications Numbers for energy consumption in DC, and cooling costs much Sensor networks find use in data center monitoring ThermoCast: the forecasting model Other time series models and algorithms DynaMMo for imputation PLiF & CLDS for clustering WindMine for web clicks
References Lei Li, et al. ThermoCast: A Cyber-Physical Forecasting Model for Data Centers KDD 2011 Lei Li, et al. Time Series Clustering: Complex is Simpler. ICML 2011 Yasushi Sakurai, Lei Li, et al, WindMine: Fast and Effective Mining of Web-click Sequences, SDM, 2011. Lei Li, et al. Parsimonious Linear Fingerprinting for Time Series. VLDB 2010. Lei Li, et al. DynaMMo: Mining and Summarization of Coevolving Sequences with Missing Values. ACM KDD 2009. (c) Lei Li 2012
Thanks! contact: Lei Li (leili@cs.cmu.edu) papers, software, datasets on http://www.cs.cmu.edu/~leili