Tier1 View: Resilience Status, plans, and best practice Martin Bly RAL Tier1 Fabric Manager GridPP22 – UCL - 2 April 2009
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 2 Overview How to make critical services at the T1 bullet proof
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 3 Resilience - Why? Services and system components fail – happens! You dont want your services to be brought down by a failure –MoU commitments quite taxing to meet even without failures –You cant hide from auntie SAM… Better to deal with problems without pressure to restart services –Fewer mistakes Even better to avoid the problems in the first place So: design service implementation so that it *will* survive failures of whatever nature
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 4 Approaches to resilience Hardware –Use hardware that can survive component failure Software –Use software that can survive problems on hardware –Use software designed for distributed operation –Use software that has inbuilt resilience Location –Locate hosts such that a service can survive failure at host location
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 5 Hardware Resilient hardware will help your services survive common failure modes and keep it operating until you can replace the component and make the service resilient again
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 6 Storage Most common is RAID as used in storage arrays Single (RAID5) or double (RAID6) disk failures do not take out the storage array –Use of hot spares allows automatic rebuilds to maintain the resilience RAID1 for system disks in servers – in the event of a single disk failure the server carries on –RAID1 with a hot spare can be used for super-critical systems – automatic rebuild maintains the resilience Works with software RAID as well as hardware RAID controllers –If you set the BIOS up for hot-swap capability… Failed disks can be replaced without taking the service down –If you have hot-swap caddies
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 7 Memory ECC helps systems to detect and correct single bit and multi-bit errors in the RAM – can help prevent data corruption If the EEC correction rate begins to rise, the RAM may be failing, or need reseating, or be subject to interference, or be slipping out of tolerance. Higher-end kit can stop using bad RAM – if not interrupting the service is considered worth the cost (high)
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 8 Power Supply Redundant PSU configurations –N+1 redundancy: at least one more PSU in a server than is needed to make it work. If one fails, the server keeps running and the failed unit can be replaced without taking the server down Multiple power feeds –For an N+1 redundant PSU configuration, one can feed each PSU from a different PDU. If one PDU fails (and they do), or the fuse blows (and they definitely do!) the other PSU is still powered and the service can continue UPS for systems where loss of power is a problem –Bridge blips, brownouts and short interruptions, smoothed feed, harmonic reduction –Permanent or time-limited – how much power must it provide and how long must it continue?
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 9 Interconnects Networking –Two or more network ports bonded can provide resilience if cables routed to different switches or via different routes – increases performance too –Bonded links in fibre installations can provide resilience against transceiver failure or fibre cuts –Stacked switches with bi-directional stacking capability If one cable fails, data goes the other way If one unit fails, data can still reach the one the other side –Fail-over links in site infrastructure and national / international long- haul links - fibre cuts happen with depressing regularity Fibre-channel –Multi-port FC HBAs and array controllers can be set up to provide two independent routes from servers to storage devices with multi-path and failover support keeping the data flowing
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 10 Software Software services should be designed to be resilient and to be provided by multiple hosts and at distributed locations. This is the Grid – its distributed. If the services arent distributable, rewrite them. – anon
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 11 Monitoring If it can be monitored… Look for and restart failed service daemons Look for signatures of impending problems to predict component failure Idle disks hide their faults –Regular low-level verification runs to push sick drives over the edge –Replace early in failure cycle So it doesnt fail during a rebuild… Increased error rates on network links from failing line cards, transceivers or cable/fibre degradation –If you have redundant links, you can replace the faulty one and keep the service going Call-out system for problems that impact services
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 12 Multiple hosts Services can be provided by more than one host if the application supports it –Share the load and increase performance –If one host fails, the rest provide the service –Use DNS round-robin to randomly select a host using a service alias with short TTL –Take broken host/s out of active DNS –Avoid single-points-of-failure Can locate multiple hosts… –… in different rooms –… in different buildings –… at different sites
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 13 Tier1 Resilience steps at the Tier1…
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 14 Hardware at the Tier1 Most of the hardware techniques are used at the Tier1 Bulk storage uses RAID1/5/6, ECC RAM, N+1 PSUs, multiple power feeds, regular verifies of arrays (scrubbing) Services nodes use RAID1, ECC RAM, some with N+1 PSUs Databases: RAID1/10/5/6, ECC RAM, N+1 PSU, dual FC links, multiple power feeds Networking: redundant off-site link to SJ5 –working on redundancy (failover/backup) for OPN link to CERN UPS (in the new building) –24/7 UPS for critical services / database racks –Short-lived UPS for storage systems to allow clean shutdown
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 15 CASTOR Service FC ARRAY (Neptune) ORACLE RAC (Pluto) srmns LSF licence Stager LSF Master Shared Castor Core rmmaster In general (all for CMS) mirror disks on stager/lsf master and rmmaster mirror disks Single CASTOR Instance eg CMS
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp D Services + LHCB LFC FC ARRAY 3D ORACLE RAC 3D lhcb lfc readonly replica, single host, fast kickstart failover to CERN
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 17 FTS and General LFC 5 Web Front Ends in DNS RR 1 channel / VO agent host ( raid 1) Hot spare soon RAID 10 SAN FTS Oracle RAC LFC DNS RR Oracle currently 2 independent servers. Work active to deploy 3 server RAC LFC currently single Host. Second host planned for mid September work in progress, running late
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 18 CE and Fabric ce torque/maui 3 doublets, one for each of ATLAS CMS and LHCB each CE has Mirror disks CE NIS dn to account mapping Mirrored disks /home file system (hardware RAID)
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 19 CE/SRM instances
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 20 WMS and LB Now: –lcgwms01 – LHC –lcgwms02 – everyone –lcgwms03 – non-LHC Developments: –lcgwms01 – LHC –lcgwms02 – LHC –lcgwms03 – non-LHC All WMS use both LB systems WMS triplet, LB doublet LB WMS
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 21 Other Tier1 Services UK-BDII: –DNS R-R triplet of simple hosts –Copes with load, provides resilience –Easy kickstart for rapid instancing RGMA registry: –single host, RAID disks, easy kickstart MONbox: –single host, RAID disks, easy kickstart VO boxes: –several x single host, easy kickstart Site BDII –DNS R-R doublet of simple hosts (same as UK-BDII) PROXY –Doublet of simple hosts, easy kickstart GOCDB: –internal failover with alternative database, (oracle), and external failover to another web front-end in Germany and mirrored database in Italy. Latter still being tested. Apel: –has a warm standby and is buying new hardware.
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 22 Tier1 Monitoring Catch problems early with nagios where possible (or at least catch problems before anyone notices) –load alarms –File systems near to full –certificates close to expiry –Failed drives Some ganglia/cacti capacity planning reviews (but ad hoc) looking for long term trends. Service Operations team making a difference.
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 23 Tier1 Backups Critical hosts all backed up to tape store Tape details written to central loggers –So we can find which tape numbers to restore if the host is toast Speedy restores to toasted systems Verify and exercise backups…
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 24 Tier1 On-call A good driver for service improvement. Continuous improvement process with weekly review of night-time incidents Review is driver for: –Auto-restarters (team still not 100% keen) –Improved monitoring (more plugins) –Better response documentation. –Changes to processes Also runs daytime Gradually routine operations will become more and more the responsibility of the service intervention team. CASTOR team carry out weekly detailed review of all incidents (looking to see how to avoid them again). Will generalise to whole Tier-1
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 25 Tier1 People Several teams with some degree of expertise sharing within each team –Fabric, Grid/Support, CASTOR, Databases –This has been pretty successful and we are reasonably confident we can handle tractable problems without the specialist present As far as is reasonable fair/practicable we seek to ensure leave is scheduled to ensure expert cover – not always possible On-call also spreading expertise in critical services (e.g., even the Facility Manager knows how to restart the CASTOR request handler!) Able to call upon RAL Tier-2 staff (or other GRIDPP/elsewhere) in case of complete lack of expertise. Have done this occasionally. Should probably be prepared to do it more often.
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 26 Off Site services A few critical services are candidates for off-site replication, others such as BDIIs, LHCB LFC are already federated Possible candidates: FTS and general LFC (possibly RGMA) –Both essential to GRIDPP –LFC based on Oracle Streaming technology already deployed and tested elsewhere (3D) –RAL could operate these remotely, but existing configuration very expensive (£40K hardware) plus Oracle licences. Failover to new DNS names would also need to be site resilient (not trivial). May be worth exploring with nearby sites or Daresbury
2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 27 Questions To Andrew, please…!