Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006

Slides:



Advertisements
Similar presentations
Operating System Organization
Advertisements

Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
Virtual Workspaces Kate Keahey Argonne National Laboratory.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
High Availability in DB2 Nishant Sinha
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Point-to-point Architecture topics for discussion Remote I/O as a data access scenario Remote I/O is a scenario that, for the first time, puts the WAN.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.
DataTAG is a project funded by the European Union International School on Grid Computing, 23 Jul 2003 – n o 1 GridICE The eyes of the grid PART I. Introduction.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
II EGEE conference Den Haag November, ROC-CIC status in Italy
Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre
Quattor installation and use feedback from CNAF/T1 LCG Operation Workshop 25 may 2005 Andrea Chierici – INFN CNAF
The Grid Information System Maria Alandes Pradillo IT-SDC White Area Lecture, 4th June 2014.
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
Claudio Grandi INFN Bologna Workshop congiunto CCR e INFNGrid 13 maggio 2009 Le strategie per l’analisi nell’esperimento CMS Claudio Grandi (INFN Bologna)
CERN IT Department CH-1211 Geneva 23 Switzerland t Service Reliability & Critical Services January 15 th 2008.
Servizi core INFN Grid presso il CNAF: setup attuale
Service Availability Monitoring
RAID.
Jean-Philippe Baud, IT-GD, CERN November 2007
WLCG IPv6 deployment strategy
Grid Operations Centre Progress to Aug 03
Job monitoring and accounting data visualization
Server Upgrade HA/DR Integration
LCG Service Challenge: Planning and Milestones
Classic Storage Element
StoRM: a SRM solution for disk based storage systems
High Availability Linux (HA Linux)
Integrating HA Legacy Products into OpenSAF based system
Andreas Unterkircher CERN Grid Deployment
POW MND section.
GDB 8th March 2006 Flavia Donno IT/GD, CERN
Distributed Job Submission in a Dynamic Virtual Environment
Brief overview on GridICE and Ticketing System
Accounting at the T1/T2 Sites of the Italian Grid
How to enable computing
The CREAM CE: When can the LCG-CE be replaced?
SRM2 Migration Strategy
WLCG Service Interventions
CSC 480 Software Engineering
Network Requirements Javier Orellana
AliEn central services (structure and operation)
TCG Discussion on CE Strategy & SL4 Move
Introduction to Networks
The INFN Tier-1 Storage Implementation
a VO-oriented perspective
Cloud Testing Shilpi Chugh.
Design Unit 26 Design a small or home office network
Pierre Girard ATLAS Visit
EGEE Middleware: gLite Information Systems (IS)
IST346: Scalability.
Site availability Dec. 19 th 2006
Installation/Configuration
Information Services Claudio Cherubino INFN Catania Bologna
Deploying Production GRID Servers & Services
The LHCb Computing Data Challenge DC06
Presentation transcript:

Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006 Hunting High (Availability) and Low (Level) (or, who wants low-availability systems?) Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006 D.Salomoni - Jan 12, 2006

Farm (High-) Availability To state the obvious: middleware works if and only if the lower layers (“underware”?) work. Otherwise, the king is naked. Will mostly consider farming only here (but there are other subsystems, like networking, for example) – and, within farming, I will be taking a fairly low-level approach. (no details about specific services like afs, nfs, dns, etc.) High-performance is another story. Having “working lower layers” should not be regarded as a given, especially if the set-up is large/complex enough. Because it requires complex/composite solutions, money, testing and – above all – people and know-how. And, as we all know, “Complexity is the enemy of reliability.”TM Applies to the Tier-1, but (at least) to the future Tier-2s as well. Please do not underestimate the importance of these issues. D.Salomoni - Jan 12, 2006

Strawman Example Clustering (in this context): A group of computers which trust each other to provide a service even when system components fail. When one machine goes down, others take over its work – IP address takeover, name takeover, service takeover, etc. Without this, we will have a very hard time in keeping reasonable SLAs: Considering also INFN- (or Italian-) specific constraints – e.g. off-hours coverage capabilities/possibilities. Max downtime for typical SLAs: 99%  ~3.6 days/year; 99.9%  ~8.8 hours/year. How does this affect a site with e.g. 80 servers? Consider drives only (MTBF = 1e6 hours), 1 drive/server: 80 hard drives x 1 year x 8760 hours in a year =700800 /1e6 = 70.1% failure rate; applied to 80 hard drives -> 70.1% * 80 = 56 drives failing in the first year. If time to repair is (fairly unrealistically good estimate) 6 hours => 6*56 = 336 hours = 14 days of downtime per year for some service (and this is for drives only, if you want the total MTBF you should do 1/MTBF(total) = Sum[1/MTBF(subcomponent1) + 1/MTBF(subcomponent2) + ...]) D.Salomoni - Jan 12, 2006

Some Applications to Grid MW How does system reliability affect grid middleware? Comments taken from Markus' presentation at the WLCG meeting, Dec 20, 2005: BDII: an RB is statically configured to use a certain BDII. (could use “smart” aliases here) RB: cannot sit behind a load-balanced alias yet. If RB is rebooted, jobs in steady state will not be affected, jobs in transit may be lost. CE: cannot sit behind a load-balanced alias yet. If CE is rebooted, jobs in steady state will not be affected, jobs in transit may be lost. MyProxy: jobs currently can only have a single PX server. Downtime can cause many jobs to fail. FTS: currently depends critically on PX servers specified in transfer jobs. A down FTS may stop any new file transfers. (no intrinsic redundancy implemented at least as of FTS 1.4) LFC: downtime may cause many jobs to fail. An LFC instance may have read-only replicas. SE: too many parameters to consider here (CASTOR/dCache/DPM/etc), some experiments fail over to other SEs on writes, while fail-over on reads is only possible for replicated data sets (might cause chaotic data transfers, bad usage of network and CPU). RGMA: clients can only hand a single instance. VOMS: critical for SC4 – there must be at least a hot spare. VOBOX: downtimes could cause significant amounts of job failures. g-PBox: to become critical for implementing VO and site policies for job management. DGAS: to become vital for user-level accounting. GridICE: only a single instance for now. D.Salomoni - Jan 12, 2006

Points for consideration - 1 Relying on application-level (specifically, grid middleware) HA mechanisms only is currently unrealistic. Besides, a correct implementation will very likely take quite some time, or won’t be done at all (for lack of resources, or architectural constraints). Hence, we should [also] invest in lower-level HA efforts. This of course to be associated with the idea that the more redundancy middleware has in itself, the better it is. This item is high on the list of activities of the INFN Tier-1. We are evaluating and testing several alternatives and approaches. Application monitoring. Multiple virtual machines on a single physical machine. (to protect e.g. against OS crash, to provide rolling upgrades of OS and applications, etc.) Multiple physical machines. But more tests could be done, and more effectively, with the contribution of other entities (example: to-be T2 sites, SA1). Think for example of split-site clusters. Besides, architectures can be different depending on scope. For example, HA issues (and many other things!) are different for a computing vs. an analysis farm. D.Salomoni - Jan 12, 2006

Points for consideration - 2 As INFN, we have just formed a group trying to work on a continuum of activities covering lower layers like networking (esp. LAN), storage, farming – and their interaction with SC, grid middleware and the rest. For farming, consider e.g. high-availability (of grid and non-grid services) applied to systems, but also reliability applied e.g. to massive installations (Quattor and Quattor-YAIM integration comes in here, for example). This group will also consider best practices in deploying middleware services (with the aim of integrating these practices with lower layer HA solutions). Some pointers: http://www.linux-ha.org/ http://lcic.org/ha.html Worldwide LCG Service Coordination Meeting (http://agenda.cern.ch/fullAgenda.php?ida=a056628) , cf. in particular Maarten Litmaath’s talk D.Salomoni - Jan 12, 2006