Tier1 View: Resilience Status, plans, and best practice Martin Bly RAL Tier1 Fabric Manager GridPP22 – UCL - 2 April 2009.

Slides:

Advertisements

Similar presentations

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.

Advertisements

1 UNIT I (Contd..) High-Speed LANs. 2 Introduction Fast Ethernet and Gigabit Ethernet Fast Ethernet and Gigabit Ethernet Fibre Channel Fibre Channel High-speed.

1 ALICE Grid Status David Evans The University of Birmingham GridPP 16 th Collaboration Meeting QMUL June 2006.

Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011.

Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.

Northgrid Status Alessandra Forti Gridpp22 UCL 2 April 2009.

RAL Tier1 Operations Andrew Sansum 18 th April 2012.

User Board - Supporting Other Experiments Stephen Burke, RAL pp Glenn Patrick.

David Britton, 28/May/ TeV Collisions 27 km circumference m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

So far Binary numbers Logic gates Digital circuits process data using gates – Half and full adder Data storage – Electronic memory – Magnetic memory –

Managing A Large Farm: CSF Andrew Sansum 26 November 2002.

Ian Bird LCG Project Leader Site Reviews WLCG Site Reviews Prague, 21 st March 2009.

1 DDS Xpress Digital Data Storage Solution. 2 Long-term Goal Legacy Telecoms switches are still operational Expected lifespan at least another 10 years.

Copyright © 2009 EMC Corporation. Do not Copy - All Rights Reserved.

IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB 265.

RAID A RRAYS Redundant Array of Inexpensive Discs.

RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.

Mehdi Naghavi Spring 1386 Operating Systems Mehdi Naghavi Spring 1386.

Basic Principles of PACS Networking Emily Seto Medical Engineering/SIMS Center for Global eHealth Innovation April 29, 2004.

Chapter 1: Introduction to Scaling Networks

Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.

REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.

Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

By : Nabeel Ahmed Superior University Grw Campus.

1 RAL Status and Plans Carmine Cioffi Database Administrator and Developer 3D Workshop, CERN, November 2009.

BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN Business Continuity Overview Wayne Salter HEPiX April 2012.

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.

Tier-1 Overview Andrew Sansum 21 November Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony.

Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

PPD Computing “Business Continuity” Windows and Mac Kevin Dunford May 17 th 2012.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

Online Systems Status Review of requirements System configuration Current acquisitions Next steps... Upgrade Meeting 4-Sep-1997 Stu Fuess.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

1 Selecting LAN server (Week 3, Monday 9/8/2003) © Abdou Illia, Fall 2003.

Tier1 Status Report Martin Bly RAL 27,28 April 2005.

RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.

McLean HIGHER COMPUTER NETWORKING Lesson 15 (a) Disaster Avoidance Description of disaster avoidance: use of anti-virus software use of fault tolerance.

Clustering In A SAN For High Availability Steve Dalton, President and CEO Gadzoox Networks September 2002.

Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

Install, configure and test ICT Networks

CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.

The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.

PIC port d’informació científica Luis Diaz (PIC) ‏ Databases services at PIC: review and plans.

Patricia Méndez Lorenzo Status of the T0 services.

1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.

WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.

Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

Servizi core INFN Grid presso il CNAF: setup attuale

Integrating Disk into Backup for Faster Restores

High Availability Linux (HA Linux)

Castor services at the Tier-0

WLCG Service Interventions

GridPP Tier1 Review Fabric

Fault Tolerance Distributed Web-based Systems

RAID RAID Mukesh N Tekwani April 23, 2019

Deploying Production GRID Servers & Services

Presentation transcript:

Tier1 View: Resilience Status, plans, and best practice Martin Bly RAL Tier1 Fabric Manager GridPP22 – UCL - 2 April 2009

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 2 Overview How to make critical services at the T1 bullet proof

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 3 Resilience - Why? Services and system components fail – happens! You dont want your services to be brought down by a failure –MoU commitments quite taxing to meet even without failures –You cant hide from auntie SAM… Better to deal with problems without pressure to restart services –Fewer mistakes Even better to avoid the problems in the first place So: design service implementation so that it *will* survive failures of whatever nature

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 4 Approaches to resilience Hardware –Use hardware that can survive component failure Software –Use software that can survive problems on hardware –Use software designed for distributed operation –Use software that has inbuilt resilience Location –Locate hosts such that a service can survive failure at host location

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 5 Hardware Resilient hardware will help your services survive common failure modes and keep it operating until you can replace the component and make the service resilient again

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 6 Storage Most common is RAID as used in storage arrays Single (RAID5) or double (RAID6) disk failures do not take out the storage array –Use of hot spares allows automatic rebuilds to maintain the resilience RAID1 for system disks in servers – in the event of a single disk failure the server carries on –RAID1 with a hot spare can be used for super-critical systems – automatic rebuild maintains the resilience Works with software RAID as well as hardware RAID controllers –If you set the BIOS up for hot-swap capability… Failed disks can be replaced without taking the service down –If you have hot-swap caddies

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 7 Memory ECC helps systems to detect and correct single bit and multi-bit errors in the RAM – can help prevent data corruption If the EEC correction rate begins to rise, the RAM may be failing, or need reseating, or be subject to interference, or be slipping out of tolerance. Higher-end kit can stop using bad RAM – if not interrupting the service is considered worth the cost (high)

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 8 Power Supply Redundant PSU configurations –N+1 redundancy: at least one more PSU in a server than is needed to make it work. If one fails, the server keeps running and the failed unit can be replaced without taking the server down Multiple power feeds –For an N+1 redundant PSU configuration, one can feed each PSU from a different PDU. If one PDU fails (and they do), or the fuse blows (and they definitely do!) the other PSU is still powered and the service can continue UPS for systems where loss of power is a problem –Bridge blips, brownouts and short interruptions, smoothed feed, harmonic reduction –Permanent or time-limited – how much power must it provide and how long must it continue?

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 9 Interconnects Networking –Two or more network ports bonded can provide resilience if cables routed to different switches or via different routes – increases performance too –Bonded links in fibre installations can provide resilience against transceiver failure or fibre cuts –Stacked switches with bi-directional stacking capability If one cable fails, data goes the other way If one unit fails, data can still reach the one the other side –Fail-over links in site infrastructure and national / international long- haul links - fibre cuts happen with depressing regularity Fibre-channel –Multi-port FC HBAs and array controllers can be set up to provide two independent routes from servers to storage devices with multi-path and failover support keeping the data flowing

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 10 Software Software services should be designed to be resilient and to be provided by multiple hosts and at distributed locations. This is the Grid – its distributed. If the services arent distributable, rewrite them. – anon

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 11 Monitoring If it can be monitored… Look for and restart failed service daemons Look for signatures of impending problems to predict component failure Idle disks hide their faults –Regular low-level verification runs to push sick drives over the edge –Replace early in failure cycle So it doesnt fail during a rebuild… Increased error rates on network links from failing line cards, transceivers or cable/fibre degradation –If you have redundant links, you can replace the faulty one and keep the service going Call-out system for problems that impact services

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 12 Multiple hosts Services can be provided by more than one host if the application supports it –Share the load and increase performance –If one host fails, the rest provide the service –Use DNS round-robin to randomly select a host using a service alias with short TTL –Take broken host/s out of active DNS –Avoid single-points-of-failure Can locate multiple hosts… –… in different rooms –… in different buildings –… at different sites

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 13 Tier1 Resilience steps at the Tier1…

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 14 Hardware at the Tier1 Most of the hardware techniques are used at the Tier1 Bulk storage uses RAID1/5/6, ECC RAM, N+1 PSUs, multiple power feeds, regular verifies of arrays (scrubbing) Services nodes use RAID1, ECC RAM, some with N+1 PSUs Databases: RAID1/10/5/6, ECC RAM, N+1 PSU, dual FC links, multiple power feeds Networking: redundant off-site link to SJ5 –working on redundancy (failover/backup) for OPN link to CERN UPS (in the new building) –24/7 UPS for critical services / database racks –Short-lived UPS for storage systems to allow clean shutdown

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 15 CASTOR Service FC ARRAY (Neptune) ORACLE RAC (Pluto) srmns LSF licence Stager LSF Master Shared Castor Core rmmaster In general (all for CMS) mirror disks on stager/lsf master and rmmaster mirror disks Single CASTOR Instance eg CMS

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp D Services + LHCB LFC FC ARRAY 3D ORACLE RAC 3D lhcb lfc readonly replica, single host, fast kickstart failover to CERN

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 17 FTS and General LFC 5 Web Front Ends in DNS RR 1 channel / VO agent host ( raid 1) Hot spare soon RAID 10 SAN FTS Oracle RAC LFC DNS RR Oracle currently 2 independent servers. Work active to deploy 3 server RAC LFC currently single Host. Second host planned for mid September work in progress, running late

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 18 CE and Fabric ce torque/maui 3 doublets, one for each of ATLAS CMS and LHCB each CE has Mirror disks CE NIS dn to account mapping Mirrored disks /home file system (hardware RAID)

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 19 CE/SRM instances

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 20 WMS and LB Now: –lcgwms01 – LHC –lcgwms02 – everyone –lcgwms03 – non-LHC Developments: –lcgwms01 – LHC –lcgwms02 – LHC –lcgwms03 – non-LHC All WMS use both LB systems WMS triplet, LB doublet LB WMS

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 21 Other Tier1 Services UK-BDII: –DNS R-R triplet of simple hosts –Copes with load, provides resilience –Easy kickstart for rapid instancing RGMA registry: –single host, RAID disks, easy kickstart MONbox: –single host, RAID disks, easy kickstart VO boxes: –several x single host, easy kickstart Site BDII –DNS R-R doublet of simple hosts (same as UK-BDII) PROXY –Doublet of simple hosts, easy kickstart GOCDB: –internal failover with alternative database, (oracle), and external failover to another web front-end in Germany and mirrored database in Italy. Latter still being tested. Apel: –has a warm standby and is buying new hardware.

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 22 Tier1 Monitoring Catch problems early with nagios where possible (or at least catch problems before anyone notices) –load alarms –File systems near to full –certificates close to expiry –Failed drives Some ganglia/cacti capacity planning reviews (but ad hoc) looking for long term trends. Service Operations team making a difference.

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 23 Tier1 Backups Critical hosts all backed up to tape store Tape details written to central loggers –So we can find which tape numbers to restore if the host is toast Speedy restores to toasted systems Verify and exercise backups…

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 24 Tier1 On-call A good driver for service improvement. Continuous improvement process with weekly review of night-time incidents Review is driver for: –Auto-restarters (team still not 100% keen) –Improved monitoring (more plugins) –Better response documentation. –Changes to processes Also runs daytime Gradually routine operations will become more and more the responsibility of the service intervention team. CASTOR team carry out weekly detailed review of all incidents (looking to see how to avoid them again). Will generalise to whole Tier-1

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 25 Tier1 People Several teams with some degree of expertise sharing within each team –Fabric, Grid/Support, CASTOR, Databases –This has been pretty successful and we are reasonably confident we can handle tractable problems without the specialist present As far as is reasonable fair/practicable we seek to ensure leave is scheduled to ensure expert cover – not always possible On-call also spreading expertise in critical services (e.g., even the Facility Manager knows how to restart the CASTOR request handler!) Able to call upon RAL Tier-2 staff (or other GRIDPP/elsewhere) in case of complete lack of expertise. Have done this occasionally. Should probably be prepared to do it more often.

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 26 Off Site services A few critical services are candidates for off-site replication, others such as BDIIs, LHCB LFC are already federated Possible candidates: FTS and general LFC (possibly RGMA) –Both essential to GRIDPP –LFC based on Oracle Streaming technology already deployed and tested elsewhere (3D) –RAL could operate these remotely, but existing configuration very expensive (£40K hardware) plus Oracle licences. Failover to new DNS names would also need to be site resilient (not trivial). May be worth exploring with nearby sites or Daresbury

2 April 2009Resilience at the Tier1 - Martin Bly - GridPp22 27 Questions To Andrew, please…!