RAL Tier1 Operations Andrew Sansum 18 th April 2012
Staffing Staff changes since GridPP27: Leavers Kier Hawker (Database Team Leader) New Starters Orlin Alexandrov (Grid Team) Dimitrios (Fabric Team) Vasilij Savin (Fabric Team) New Roles Ian Collier - Grid Team Leader Richard Sinclair Database Team Leader James Adams – storage system development 31 March 2014 Tier-1 Status
Some Changes CVMFS in use for Atlas & LHCb: –The Atlas (NFS) software server used to give significant problems. –Some CVMFS teething issues but overall much better! Virtualisation: –Starting to bear fruit. Uses Hyper-V. Numerous test systems Production systems that do not require particular resilience. Quattor: –Large gains already made. 31 March 2014 Tier-1 Status
Database Infrastructure We making Significant Changes to the Oracle Database Infrastructure. Why? Old servers are out of maintenance Move from 32bit to 64bit databases Performance improvements Standby systems Simplified architecture
Database Disk Arrays - Future 31 March 2014 Tier-1 Status Fibrechannel SAN Oracle RAC Nodes Disk Arrays Power Supplies (on UPS) Data Guard
Castor Changes since last GridPP Meeting: Castor upgrade to (March) Castor version (July) needed for the higher capacity "T10KC" tapes. Updated Garbage Collection Algorithm (to LRU rather than the default which is based on size). (July) (Moved logrotate to 1pm rather than 4am.) 31 March 2014 Tier-1 Status
Recent Developments (I) Hardware –Procured and commissioned 2.6PB disk –Procured and commissioned 15KHS06 disk –T10KC tape drives deployed and (1.5PB) ATLAS data migrated –New head nodes and core infrastructure storage capacity –Procured A new Tier-1 core network and new Site network ORACLE Database Hardware upgrade and re-organisation –Rebuilding database SAN infrastructure –Increased CASTOR database resilience. Now have two copies of CASTOR database. Maintained in step by Oracle Data-guard. –Upgraded 3D service to ORACLE 11 Virtualisation infrastructure (Hyper-V) now approved for critical production systems (deployment starting). 31 March 2014 Tier-1 Status
CASTOR (significant improvements in latency) –Upgraded to CASTOR (major upgrade) –Head node replacement EMI/UMD upgrades of Grid Middleware 31 March 2014 Tier-1 Status
Castor Issues. Load related issues on small/full service classes (e.g. AtlasScratchDisk; LHCbRawRDst) –Load can become concentrated on one or two disk servers. –Exacerbated if uneven distribution if disk server sizes. Solutions: –Add more capacity; clean-up. –Changes to tape migration policies. –Re-organization of service classes. 31 March 2014 Tier-1 Status
Disk Server Outages by Cause (2011) 31 March 2014 Tier-1 Status
Disk Drive Failure – Year 2011
Double Disk Failures (2011) In process of updating the firmware on the particular batch of disk controllers. 31 March 2014 Tier-1 Status
Data Loss Incidents Summary of losses since GridPP26 Total of 12 incidents logged: 1 – Due to a disk server failure (loss of 8 files for CMS) 1 – Due to a bad tape (loss of 3 files for LHCb) 1 - Files not in Castor Nameserver but no location. ( 9 LHCb files) 9 – Cases of corrupt files. In most cases the files were old (and pre-date Castor checksumming). Checksumming in place of tape and disk files. Daily and random checks made on disk files. 31 March 2014 Tier-1 Status
T10KC Tapes In Production Type CapacityIn UseTotal Capacity A 0.5TB PB B 1TB PB (CMS) C 5TB 31 March 2014 Tier-1 Status
T10000C Issues Failure of 6 out of 10 tapes. –Current A/B failure rate roughly 1 in –After writing part of a tape an error was reported. Concerns are three fold: –A high rate of write errors cause disruption –If tapes could not be filled our capacity would be reduced –We were not 100% confident that data would be secure Updated Firmware in drives. –100 tapes now successfully written without problem. In contact with Oracle. 31 March 2014 Tier-1 Status
A couple of final comments Disk server issues are the main area of effort for hardware reliability / stability....but do not forget the network. Hardware that has performed reliably in the past may throw up a systematic problem. 31 March 2014 Tier-1 Status
Formal Operations Processes 31 March 2014 Tier-1 Status Change Review Exception Review SIR Review Team Fault Review WLCG DAILY ops Liaison Meeting Production Scheduling Management Meeting Requirements Exception Handling
Service Exceptions 2011 Definitions –Service exception – High priority fault alert raising a pager call –Callout – Service exception raised outside formal working hours Operations Team –Daytime – Admin on Duty (AoD). Holds pager, handles service exceptions – passes on to daytime teams. –Nighttime – Primary Oncall (Like AoD) – holds pager fixes easy problems, operationally in Charge. Second line On-call (one per team) guarantees response. Some (not guaranteed) third line support or escalation in serious incidents. Exceptions Count in 2011 –461 Service exceptions –265 callouts 31 March 2014 Tier-1 Status
Exceptions by Type by Week
Exceptions by Service
Plans for Future ORACLE 11 upgrade for CASTOR/LFC/FTS needed by July CASTOR –Switch on transfer manager (reduce transfer startup latency) –Upgrade to (needed before Oracle 11 upgrade) –Upgrade to Network (move Tier-1 backbone to 40Gb/s) –Site front of house network upgrade early summer –Tier-1 new routing and spine layer.. DRI …. 31 March 2014 Tier-1 Status