Download presentation
Presentation is loading. Please wait.
1
Resource Signal Prediction and Its Application to Real-time Scheduling Advisors or How to Tame Variability in Distributed Systems Peter A. Dinda Carnegie Mellon University
2
2 Outline Bird’s eye view Highly variable resource availability Real-time scheduling advisor Predicting task running times Characterizing variability with confidence intervals Performance results (feasible, practical, useful) Prototype system Host load prediction Traces, structure, linear models, evaluation RPS Toolkit Conclusion
3
3 A Universal Challenge in High Performance Distributed Applications Highly variable resource availability Shared resources No reservations No globally respected priorities Competition from other users - “background workload” Running time can vary drastically Adaptation
4
4 A Universal Problem Task ? Which host should the application send the task to so that its running time is appropriate? Known resource requirements
5
5 Real-time Scheduling Advisor Distributed interactive applications Examples: CMU Dv/QuakeViz, BBN OpenMap Assumptions Sequential tasks initiated by user actions Aperiodic arrivals Resilient deadlines (soft real-time) Compute-bound tasks Known computational requirements Best-effort semantics Recommend host where deadline is likely to be met Predict running time on that host No guarantees
6
6 Running Time Advisor Predicted Running Time Task ? Application notifies advisor of task’s computational requirements (nominal time) Advisor predicts running time on each host Application assigns task to most appropriate host nominal time
7
7 Real-time Scheduling Advisor Predicted Running Time Task deadline nominal time ? deadline Application notifies advisor of task’s computational requirements (nominal time) and its deadline Advisor acquires predicted task running times for all hosts Advisor recommends one of the hosts where the deadline can be met
8
8 Variability and Prediction t High Resource Availability Variability Low Prediction Error Variability Characterization of variability Exchange high resource availability variability for low prediction error variability and a characterization of that variability t resource t error ACF t Prediction
9
9 Task deadline nominal time ? Confidence Intervals to Characterize Variability Predicted Running Time deadline Application specifies confidence level (e.g., 95%) Running time advisor predicts running times as a confidence interval (CI) Real-time scheduling advisor chooses host where CI is less than deadline CI captures variability to the extent the application is interested in it “3 to 5 seconds with 95% confidence” 95% confidence
10
10 Confidence Intervals And Predictor Quality Bad Predictor No obvious choice Good Predictor Two good choices Predicted Running Time Good predictors provide smaller CIs Smaller CIs simplify scheduling decisions Predicted Running Time deadline
11
11 Overview of Research Results Predicting CIs is feasible Host load prediction using AR(16) models Running time estimation using host load predictions Predicting CIs is practical RPS Toolkit (inc. in CMU Remos, BBN QuO) Extremely low-overhead online system Predicting CIs is useful Performance of real-time scheduling advisor Statistically rigorous analysis and evaluation Measured performance of real system
12
12 Experimental Setup Environment –Alphastation 255s, Digital Unix 4.0 –Workload: host load trace playback –Prediction system on each host Tasks –Nominal time ~ U(0.1,10) seconds –Interarrival time ~ U(5,15) seconds Methodology –Predict CIs / Host recommendations –Run task and measure
13
13 Predicting CIs is Feasible 3000 randomized tasks Near-perfect CIs on typical hosts
14
14 Predicting CIs is Practical - RPS System 1-2 ms latency from measurement to prediction 2KB/sec transfer rate <2% of CPU At Appropriate Rate
15
15 Predicting CIs is Useful - Real-time Scheduling Advisor 16000 tasks Predicted CI < Deadline Random Host Host With Lowest Load
16
16 Outline Bird’s eye view Highly variable resource availability Real-time scheduling advisor Predicting task running times Characterizing variability with confidence intervals Performance results (feasible, practical, useful) Prototype system Host load prediction Traces, structure, linear models, evaluation RPS Toolkit Conclusion
17
17 Design Space Can the gap between the resources and the application can be spanned? yes!
18
18 Resource Signals Characteristics Easily measured, time-varying scalar quantities Strongly correlated with resource availability Periodically sampled (discrete-time signal) Examples Host load (Digital Unix 5 second load average) Network flow bandwidth and latency Leverage existing statistical signal analysis and prediction techniques
19
19 Prototype System RPS components can be composed in other ways
20
20 Research Results Host load on real hosts has exploitable structure –Strong autocorrelation, self-similarity, epochal behavior –Trace database and host load trace playback Host load is predictable using simple linear models –Recommendation: AR(16) models or better for 1-30 sec predictions –RPS Toolkit for low overhead systems (<2% of CPU) C++, ported to 5 OSes, incorporated in CMU Remos, BBN QuO Running time CIs can be computed from load predictions –Load discounting, error covariances Effective real-time scheduling advice can be based on CIs –Know if deadline will be met before running task Statistically rigorous analysis and evaluation
21
21 Outline Bird’s eye view Highly variable resource availability Real-time scheduling advisor Predicting task running times Characterizing variability with confidence intervals Performance results (feasible, practical, useful) Prototype system Host load prediction Traces, structure, linear models, evaluation RPS Toolkit Conclusion
22
22 Questions What are the properties of host load? Is host load predictable? What predictive models are appropriate? Are host load predictions useful?
23
23 Overview of Answers Host load exhibits complex behavior Strong autocorrelation, self-similarity, epochal behavior Host load is predictable 1 to 30 second timeframe Simple linear models are sufficient Recommend AR(16) or better Predictions are useful Can compute effective CIs from them
24
24 Host Load Traces DEC Unix 5 second exponential average Full bandwidth captured (1 Hz sample rate) Long durations
25
25 If Host Load Was “Random” (White Noise)... Time domainAutocorrelation SpectrogramFrequency domain
26
26 Host Load Has Exploitable Structure Time domainAutocorrelation SpectrogramFrequency domain
27
27 Linear Time Series Models (2000 sample fits, largest models in study, 30 secs ahead) Pole-zero / state-space models capture autocorrelation parsimoniously
28
28 Evaluation Methodology Ran ~190,000 randomly chosen testcases on the traces –Evaluate models independently of prediction/evaluation framework No monitoring –~30 testcases per trace, model class, parameter set Data-mine results Offline and online systems implemented using RPS Toolkit
29
29 Testcases Models –MEAN, LAST/BM(32) –Randomly chosen model from: AR(1..32), MA(1..8), ARMA(1..8,1..8), ARIMA(1..8,1..2,1..8), ARFIMA(1..8,d,1..8)
30
30 Evaluating a Testcase Measurements in Fit Interval Model Modeler Load Predictor Evaluator Measurements in Test Interval Prediction Stream z t+n-1,…, z t+1, z t z’ t,t+w z’ t,t+1 z’ t,t+2... z’ t+1,t+1+w z’ t+1,t+2 z’ t+1,t+3... z’ t+2,t+2+w z’ t+2,t+3 z’ t+2,t+4... Model Type Error Metrics Error Estimates One-time use Production Stream Characterization of variation Measurement of variation
31
31 Measured Prediction Variance: Mean Squared Error …, z t+1, z t z’ t,t+w z’ t,t+1 z’ t,t+2... z’ t+1,t+1+w z’ t+1,t+2 z’ t+1,t+3... z’ t+2,t+2+w z’ t+2,t+3 z’ t+2,t+4... 1 step ahead predictions 2 step ahead predictions w step ahead predictions... ( - z t+i ) 2 Variance of z... (z’ t+i,t+i+1 - z t+i+1 ) 2 (z’ t+i,t+i+2 - z t+i+2 ) 2 (z’ t+i,t+i+w - z t+i+w ) 2 1 step ahead mean squared error 2 step ahead mean squared error w step ahead mean squared error... aw = a1 = a2 = z = Load Predictor Good Load Predictor : a1, a2,…, aw z
32
32 Unpaired Box Plot Comparisons Good models achieve consistently low error Mean Squared Error Model AModel BModel C Inconsistent low error Consistent low error Consistent high error 2.5% 25% 50% Mean 75% 97.5%
33
33 1 second Predictions, All Hosts 2.5% 25% 50% Mean 75% 97.5% Predictive models clearly worthwhile
34
34 30 second Predictions, All Hosts 2.5% 25% 50% Mean 75% 97.5% Predictive models clearly beneficial even at long prediction horizons
35
35 30 Second Predictions, High Load, Dynamic Host 2.5% 25% 50% Mean 75% 97.5% Predictive models clearly worthwhile Begin to see differentiation between models
36
36 RPS Toolkit Extensible toolkit for implementing resource prediction systems Easy “buy-in” for users C++ and sockets (no threads) Prebuilt prediction components Libraries (sensors, time series, communication) Users have bought in Incorporated in CMU Remos, BBN QuO Used in research by Bruce Lowekamp, Nancy Miller, LeMonte Green
37
37 Outline Bird’s eye view Highly variable resource availability Real-time scheduling advisor Predicting task running times Characterizing variability with confidence intervals Performance results (feasible, practical, useful) Prototype system Host load prediction Traces, structure, linear models, evaluation RPS Toolkit Conclusion
38
38 Related Work Distributed interactive applications QuakeViz/ Dv, Aeschlimann [PDPTA’99] Quality of service QuO, Zinky, Bakken, Schantz [TPOS, April 97] QRAM, Rajkumar, et al [RTSS’97] Distributed soft real-time systems Lawrence, Jensen [assorted] Workload studies for load balancing Mutka, et al [PerfEval ‘91] Harchol-Balter, et al [SIGMETRICS ‘96] Resource signal measurement systems Remos [HPDC’98] Network Weather Service [HPDC‘97, HPDC’99] Host load prediction Wolski, et al [HPDC’99] (NWS) Samadani, et al [PODC’95] Hailperin [‘93] Application-level scheduling Berman, et al [HPDC’96] Stochastic Scheduling, Schopf [Supercomputing ‘99]
39
39 Conclusions Tame variability in distributed systems Resource signal prediction Predict running times as confidence intervals –Predicting CIs is feasible Host load prediction using AR(16) models Running time estimation using host load predictions –Predicting CIs is practical RPS Toolkit (inc. in CMU Remos, BBN QuO) Extremely low-overhead online system –Predicting CIs is useful Performance of real-time scheduling advisor
40
40 Future Work (Near Term) New resource signals –Network bandwidth and latency (Remos) New prediction approaches –Wavelets, nonlinearity Resource scheduler models –Better Unix scheduler model –Network models Adaptation advisors Applications and workloads –QuakeViz/DV, GIMP, Instrumentation
41
41 Tools/Venues for Future work Resource signal methodolgy RPS Toolkit Remos QuakeViz/DV Grid Forum
42
42 Future Work (Long Term) Experimental computer science research Application-oriented view Measurement studies and analysis Statistical approach Application services Systems building systems X applications X statistics
43
43 Teaching “Signals, systems, and statistics for computer scientists” “Performance data analysis” “Introduction to computer systems”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.