Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006 Hunting High (Availability) and Low (Level) (or, who wants low-availability systems?) Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006 D.Salomoni - Jan 12, 2006
Farm (High-) Availability To state the obvious: middleware works if and only if the lower layers (“underware”?) work. Otherwise, the king is naked. Will mostly consider farming only here (but there are other subsystems, like networking, for example) – and, within farming, I will be taking a fairly low-level approach. (no details about specific services like afs, nfs, dns, etc.) High-performance is another story. Having “working lower layers” should not be regarded as a given, especially if the set-up is large/complex enough. Because it requires complex/composite solutions, money, testing and – above all – people and know-how. And, as we all know, “Complexity is the enemy of reliability.”TM Applies to the Tier-1, but (at least) to the future Tier-2s as well. Please do not underestimate the importance of these issues. D.Salomoni - Jan 12, 2006
Strawman Example Clustering (in this context): A group of computers which trust each other to provide a service even when system components fail. When one machine goes down, others take over its work – IP address takeover, name takeover, service takeover, etc. Without this, we will have a very hard time in keeping reasonable SLAs: Considering also INFN- (or Italian-) specific constraints – e.g. off-hours coverage capabilities/possibilities. Max downtime for typical SLAs: 99% ~3.6 days/year; 99.9% ~8.8 hours/year. How does this affect a site with e.g. 80 servers? Consider drives only (MTBF = 1e6 hours), 1 drive/server: 80 hard drives x 1 year x 8760 hours in a year =700800 /1e6 = 70.1% failure rate; applied to 80 hard drives -> 70.1% * 80 = 56 drives failing in the first year. If time to repair is (fairly unrealistically good estimate) 6 hours => 6*56 = 336 hours = 14 days of downtime per year for some service (and this is for drives only, if you want the total MTBF you should do 1/MTBF(total) = Sum[1/MTBF(subcomponent1) + 1/MTBF(subcomponent2) + ...]) D.Salomoni - Jan 12, 2006
Some Applications to Grid MW How does system reliability affect grid middleware? Comments taken from Markus' presentation at the WLCG meeting, Dec 20, 2005: BDII: an RB is statically configured to use a certain BDII. (could use “smart” aliases here) RB: cannot sit behind a load-balanced alias yet. If RB is rebooted, jobs in steady state will not be affected, jobs in transit may be lost. CE: cannot sit behind a load-balanced alias yet. If CE is rebooted, jobs in steady state will not be affected, jobs in transit may be lost. MyProxy: jobs currently can only have a single PX server. Downtime can cause many jobs to fail. FTS: currently depends critically on PX servers specified in transfer jobs. A down FTS may stop any new file transfers. (no intrinsic redundancy implemented at least as of FTS 1.4) LFC: downtime may cause many jobs to fail. An LFC instance may have read-only replicas. SE: too many parameters to consider here (CASTOR/dCache/DPM/etc), some experiments fail over to other SEs on writes, while fail-over on reads is only possible for replicated data sets (might cause chaotic data transfers, bad usage of network and CPU). RGMA: clients can only hand a single instance. VOMS: critical for SC4 – there must be at least a hot spare. VOBOX: downtimes could cause significant amounts of job failures. g-PBox: to become critical for implementing VO and site policies for job management. DGAS: to become vital for user-level accounting. GridICE: only a single instance for now. D.Salomoni - Jan 12, 2006
Points for consideration - 1 Relying on application-level (specifically, grid middleware) HA mechanisms only is currently unrealistic. Besides, a correct implementation will very likely take quite some time, or won’t be done at all (for lack of resources, or architectural constraints). Hence, we should [also] invest in lower-level HA efforts. This of course to be associated with the idea that the more redundancy middleware has in itself, the better it is. This item is high on the list of activities of the INFN Tier-1. We are evaluating and testing several alternatives and approaches. Application monitoring. Multiple virtual machines on a single physical machine. (to protect e.g. against OS crash, to provide rolling upgrades of OS and applications, etc.) Multiple physical machines. But more tests could be done, and more effectively, with the contribution of other entities (example: to-be T2 sites, SA1). Think for example of split-site clusters. Besides, architectures can be different depending on scope. For example, HA issues (and many other things!) are different for a computing vs. an analysis farm. D.Salomoni - Jan 12, 2006
Points for consideration - 2 As INFN, we have just formed a group trying to work on a continuum of activities covering lower layers like networking (esp. LAN), storage, farming – and their interaction with SC, grid middleware and the rest. For farming, consider e.g. high-availability (of grid and non-grid services) applied to systems, but also reliability applied e.g. to massive installations (Quattor and Quattor-YAIM integration comes in here, for example). This group will also consider best practices in deploying middleware services (with the aim of integrating these practices with lower layer HA solutions). Some pointers: http://www.linux-ha.org/ http://lcic.org/ha.html Worldwide LCG Service Coordination Meeting (http://agenda.cern.ch/fullAgenda.php?ida=a056628) , cf. in particular Maarten Litmaath’s talk D.Salomoni - Jan 12, 2006