RAL Tier1 Operations Andrew Sansum 18 th April 2012.

Slides:

Advertisements

Similar presentations

Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 GridPP 30, Glasgow, 26th March 2013.

Advertisements

Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011.

Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.

Project Status David Britton,15/Dec/ Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

RAL Tier1: 2001 to 2011 James Thorne GridPP th August 2007.

12th September 2002Tim Adye1 RAL Tier A Tim Adye Rutherford Appleton Laboratory BaBar Collaboration Meeting Imperial College, London 12 th September 2002.

Tier-1 Evolution and Futures GridPP 29, Oxford Ian Collier September 27 th 2012.

IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB 265.

1 $3.5M UI Claims Modernization Project Close Presented to PCC, July 27, 2011.

ATLAS Tier-3 in Geneva Szymon Gadomski, Uni GE at CSCS, November 2009 S. Gadomski, ”ATLAS T3 in Geneva", CSCS meeting, Nov 091 the Geneva ATLAS Tier-3.

High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.

Exchange 2010 Project Presentation/Discussion August 12, 2015 Project Team: Mark Dougherty – Design John Ditto – Project Manager Joel Eussen – Project.

OpStor - A multi vendor storage resource management and capacity forecasting software.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Tier1 Site Report HEPSysMan 30 June, 1 July 2011 Martin Bly, STFC-RAL.

RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.

Tier-1 Overview Andrew Sansum 21 November Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony.

Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

E-Infrastructure hierarchy Networking and Computational facilities in Armenia ASNET AM Network Armenian National Grid Initiative Armenian ATLAS site (AM-04-YERPHI)

RAL Tier1 Report Martin Bly HEPSysMan, RAL, June

WLCG Service Report ~~~ WLCG Management Board, 27 th October

RAL Site Report Castor F2F, CERN Matthew Viljoen.

1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB

WLCG Service Report ~~~ WLCG Management Board, 1 st September

RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

RAL Site Report HEPiX Spring 2011, GSI 2-6 May Martin Bly, STFC-RAL.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.

Feedback from the Tier1s GDB, September CNAF 24x7 support On-call person for all critical infrastructural services (cooling, power etc..) Manager.

UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.

The WLCG Service from a Tier1 Viewpoint Gareth Smith 7 th July 2010.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.

Tier-1 Andrew Sansum Deployment Board 12 July 2007.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Tier1 Databases GridPP Review 20 th June 2012 Richard Sinclair Database Services Team Leader.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

Tier-1 Data Storage Challenges Extreme Data Workshop Andrew Sansum 20 th April 2012.

HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL

Update on Plan for KISTI-GSDC

Maximum Availability Architecture Enterprise Technology Centre.

Castor services at the Tier-0

Olof Bärring LCG-LHCC Review, 22nd September 2008

GridPP Tier1 Review Fabric

Introduction of Week 6 Assignment Discussion

IT OPERATIONS Session 7.

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

Presentation transcript:

RAL Tier1 Operations Andrew Sansum 18 th April 2012

Staffing Staff changes since GridPP27: Leavers Kier Hawker (Database Team Leader) New Starters Orlin Alexandrov (Grid Team) Dimitrios (Fabric Team) Vasilij Savin (Fabric Team) New Roles Ian Collier - Grid Team Leader Richard Sinclair Database Team Leader James Adams – storage system development 31 March 2014 Tier-1 Status

Some Changes CVMFS in use for Atlas & LHCb: –The Atlas (NFS) software server used to give significant problems. –Some CVMFS teething issues but overall much better! Virtualisation: –Starting to bear fruit. Uses Hyper-V. Numerous test systems Production systems that do not require particular resilience. Quattor: –Large gains already made. 31 March 2014 Tier-1 Status

Database Infrastructure We making Significant Changes to the Oracle Database Infrastructure. Why? Old servers are out of maintenance Move from 32bit to 64bit databases Performance improvements Standby systems Simplified architecture

Database Disk Arrays - Future 31 March 2014 Tier-1 Status Fibrechannel SAN Oracle RAC Nodes Disk Arrays Power Supplies (on UPS) Data Guard

Castor Changes since last GridPP Meeting: Castor upgrade to (March) Castor version (July) needed for the higher capacity "T10KC" tapes. Updated Garbage Collection Algorithm (to LRU rather than the default which is based on size). (July) (Moved logrotate to 1pm rather than 4am.) 31 March 2014 Tier-1 Status

Recent Developments (I) Hardware –Procured and commissioned 2.6PB disk –Procured and commissioned 15KHS06 disk –T10KC tape drives deployed and (1.5PB) ATLAS data migrated –New head nodes and core infrastructure storage capacity –Procured A new Tier-1 core network and new Site network ORACLE Database Hardware upgrade and re-organisation –Rebuilding database SAN infrastructure –Increased CASTOR database resilience. Now have two copies of CASTOR database. Maintained in step by Oracle Data-guard. –Upgraded 3D service to ORACLE 11 Virtualisation infrastructure (Hyper-V) now approved for critical production systems (deployment starting). 31 March 2014 Tier-1 Status

CASTOR (significant improvements in latency) –Upgraded to CASTOR (major upgrade) –Head node replacement EMI/UMD upgrades of Grid Middleware 31 March 2014 Tier-1 Status

Castor Issues. Load related issues on small/full service classes (e.g. AtlasScratchDisk; LHCbRawRDst) –Load can become concentrated on one or two disk servers. –Exacerbated if uneven distribution if disk server sizes. Solutions: –Add more capacity; clean-up. –Changes to tape migration policies. –Re-organization of service classes. 31 March 2014 Tier-1 Status

Disk Server Outages by Cause (2011) 31 March 2014 Tier-1 Status

Disk Drive Failure – Year 2011

Double Disk Failures (2011) In process of updating the firmware on the particular batch of disk controllers. 31 March 2014 Tier-1 Status

Data Loss Incidents Summary of losses since GridPP26 Total of 12 incidents logged: 1 – Due to a disk server failure (loss of 8 files for CMS) 1 – Due to a bad tape (loss of 3 files for LHCb) 1 - Files not in Castor Nameserver but no location. ( 9 LHCb files) 9 – Cases of corrupt files. In most cases the files were old (and pre-date Castor checksumming). Checksumming in place of tape and disk files. Daily and random checks made on disk files. 31 March 2014 Tier-1 Status

T10KC Tapes In Production Type CapacityIn UseTotal Capacity A 0.5TB PB B 1TB PB (CMS) C 5TB 31 March 2014 Tier-1 Status

T10000C Issues Failure of 6 out of 10 tapes. –Current A/B failure rate roughly 1 in –After writing part of a tape an error was reported. Concerns are three fold: –A high rate of write errors cause disruption –If tapes could not be filled our capacity would be reduced –We were not 100% confident that data would be secure Updated Firmware in drives. –100 tapes now successfully written without problem. In contact with Oracle. 31 March 2014 Tier-1 Status

A couple of final comments Disk server issues are the main area of effort for hardware reliability / stability....but do not forget the network. Hardware that has performed reliably in the past may throw up a systematic problem. 31 March 2014 Tier-1 Status

Formal Operations Processes 31 March 2014 Tier-1 Status Change Review Exception Review SIR Review Team Fault Review WLCG DAILY ops Liaison Meeting Production Scheduling Management Meeting Requirements Exception Handling

Service Exceptions 2011 Definitions –Service exception – High priority fault alert raising a pager call –Callout – Service exception raised outside formal working hours Operations Team –Daytime – Admin on Duty (AoD). Holds pager, handles service exceptions – passes on to daytime teams. –Nighttime – Primary Oncall (Like AoD) – holds pager fixes easy problems, operationally in Charge. Second line On-call (one per team) guarantees response. Some (not guaranteed) third line support or escalation in serious incidents. Exceptions Count in 2011 –461 Service exceptions –265 callouts 31 March 2014 Tier-1 Status

Exceptions by Type by Week

Exceptions by Service

Plans for Future ORACLE 11 upgrade for CASTOR/LFC/FTS needed by July CASTOR –Switch on transfer manager (reduce transfer startup latency) –Upgrade to (needed before Oracle 11 upgrade) –Upgrade to Network (move Tier-1 backbone to 40Gb/s) –Site front of house network upgrade early summer –Tier-1 new routing and spine layer.. DRI …. 31 March 2014 Tier-1 Status