Download presentation
Presentation is loading. Please wait.
1
On Dynamic Resource Availability in Grids
A. Iosup, O. Sonmez, D.H.J. Epema M. Jan PDS Group, ST/EWI, TU Delft LRI/INRIA Futurs Paris, INRIA November 24, 2018 IEEE Grid 2007, Austin, TX, USA
2
Grids are far from being reliable job execution environments
Server % reliable Small Cluster 99.999% reliable In today’s grids, reliability is more important than performance! 99.9% reliable 5x decrease in failure rate after first year [Sch06] Production Cluster CERN LCG jobs now 25.29% failures that’s after 5 years (3 ~production) Source: dboard-gr.cern.ch, May’07. >10% jobs fail [Ios06] DAS-2 20-45% failures [Kha06] TeraGrid 27% failures, retries [Dum05] Grid3 We see each day that a commercially acceptable server needs to be % reliable. That’s one error every … hours! As soon as we increase in size, failures become more frequent. A top cluster manufacturer advertises only “three nines” reliability. We are not unhappy with this value. [goal for grids] It is well known that the reliability of a system is lower in the beginning, and towards the end of its life (the famous bath-tub shape). Production clusters have been found to be xx to yy % reliable, after passing the ??-year margin [Sch06]. With Grids, we seem to still be in the beginning: errors occur very frequently. Recent studies have shown that ... [Ios06, Kha06] [Ios06] A. Iosup and D. H. J. Epema. Grenchmark: A framework for analyzing, testing, and comparing grids. In CCGRID, pages 313–320. IEEE Computer Society, 2006. [Kha06] O. Khalili, Jiahue He, et al. Measuring the performance and reliability of production computational grids. In GRID. IEEE Computer Society, 2006. [Dum05] C. Dumitrescu, I. Raicu, and I. T. Foster. Experiences in running workloads over grid3. In H. Zhuge and G. Fox, editors, GCC, volume 3795 of LNCS, pages 274–286, 2005. Large scale clusters – 1 error per day for a few hours (4 hours?) at 1000 resources -> 999 resources + (24-4)/24 resources per day = 999.8 November 24, 2018
3
Sources of failure in grids and what to do about them
Failures [HLi07] Error propagation [Tha02] User applications 1 We don’t know much about our work-horse, the cluster-based grid! Grid + User Middleware 2 Grid middleware tests [Ios07] Machine + OS (resource availability) Desktop grids [Kon04] Cluster-based grids? 3 November 24, 2018
4
Research Questions Q1: What are the characteristics of the resource (un)availability in grids? Q2: What is the performance impact of the dynamic grid resource availability? November 24, 2018
5
Outline Introduction and Motivation Resource Availability in Grids
Environment and Traces Availability Characteristics Availability Model Performance Impact of Resource Availability Conclusion and Future Work November 24, 2018
6
2.1. Grid’5000: Environment and Traces
Experimental platform Grid’5000: 9 sites, 15 clusters Target: 5000 nodes, >5000 CPUs All clusters managed by OAR Traces: jobs 05/ /2006 (30 mo.) resource availability 05/ /2006 (18 mo.) Jobs: 951 K Availability status change events: 600 K Users: 473, Groups: 10 CPUs: ~ 2500, Consumed CPU time: 651 years November 24, 2018
7
2.2. Availability Characteristics (1/3) Number of Resources Over Time
Average availability Grid-level view Grid-level availability Availability range: 35-92% Average availability: 69% Cluster-level availability Availability range: 32-98% November 24, 2018
8
2.2. Availability Characteristics (2/3) Mean Time Between Failures (MTBF) Mean Time to Repair (MTTR)
Grid-level ~12 minutes Node-level ~2 days MTTR Node-level ~half day CDF MTBF Grid-level view TBF [s] November 24, 2018
9
2.2. Availability Characteristics (3/3) Correlated Failures
Maximal set of failures (ordered according to increasing event time), of time parameter in which for any two successive failures E and F, where returns the timestamp of the event; = s. Grid-level view Range: 1-339 Average: 11 Cluster span Range: 1-3 Average: 1.06 Failures “stay” within cluster Average CDF Grid-level view Size of correlated failures November 24, 2018
10
see article for per-cluster parameter values
2.3. Availability Model MTBF MTTR Correl. Assume no correlation of failure occurrence between clusters Which site/cluster? fs, fraction of failures at cluster s Weibull distribution for IAT Shape parameter > 1: increasing hazard rate the longer a node is online, the higher the chances that it will fail Reboot machines just to keep them in “safe zone”? Software recovers from accumulating problems Hardware may crash during reboot; otherwise won’t help November 24, 2018
11
Outline Introduction and Motivation Resource Availability in Grids
Performance Impact of Resource Availability Models of Availability Information Performance Evaluation Conclusion and Future Work November 24, 2018
12
3.1. Models of Availability Information
4 types of systems: Resource availability Static Dynamic Availability Information Delay On-Time (0) Short period Long period SA KA AMA HMA Availability Info. Delay Real? 1 SA Steady Availability Static On-Time N 2 KA Known Availability Dynamic On-Time N Y 3 AMA Automated Monitoring of Availability Dynamic Short-Time Y Grids vs. clusters: resource availability information is not always present 4 HMA Human Monitoring of Availability Dynamic Long-Time Y November 24, 2018
13
3.2. Performance Evaluation (1/2) Overview
Scenario: SA vs. KA vs. AMA vs. HMA AMA: information update 60s, 1h HMA: information update 1 week, 1 month, never (fixed availability information) Metrics Traditional: AWT, ART Adapted for dynamic availability: utilization, throughput, goodput Failures, number and reason (submission or execution) Experimental setup Custom trace-driven discrete event simulator Simulate Grid’5000 Use job and availability traces: 06/ /2006 (739K jobs) November 24, 2018
14
3.2. Performance Evaluation (2/2) Sample Results
see article for detailed results Avg. Norm. G’put. [cpus/day/proc] Goodput decreases with intervention delay Avg. Norm. T’put. [jobs/day/proc] Throughput decreases with intervention delay KA = AMA > HMA 1wk > HMA 1mo >> SA Job Failures [%] Human intervention leads to sub. failures Any intervention better than No intervention November 24, 2018 SA KA AMA 60s AMA 1h HMA 1w HMA 1mo HMA Never Model
15
4. Conclusion and Future Work
Q1: What are the characteristics of the resource (un)availability in grids? Location, Inter-arrival (MTBF), Duration (MTTR), Size (Cor.Failures) Model for grid resource availability Q2: What is the performance impact of the dynamic grid resource availability? Four models for grid resource availability information Trace-based performance evaluation through simulations KA = AMA > HMA >> SA Future Work Validate the model for other grids Model-based performance evaluation Availability-aware scheduling policies for grids November 24, 2018
16
Thank you! Questions? Remarks? Observations?
Help building our community’s Grid Workloads Archive: Add your Job and Resource Availability Traces! November 24, 2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.