RAL Site Report Castor F2F, CERN Matthew Viljoen.

Slides:



Advertisements
Similar presentations
Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.
Advertisements

RAL Tier1 Operations Andrew Sansum 18 th April 2012.
Tier-1 Evolution and Futures GridPP 29, Oxford Ian Collier September 27 th 2012.
LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
Castor F2F Meeting Barbara Martelli Castor Database CNAF.
Client Management. Introduction In a typical organization there are a lot of client machines used for day to day operations Client management is a necessary.
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
ASGC 1 ASGC Site Status 3D CERN. ASGC 2 Outlines Current activity Hardware and software specifications Configuration issues and experience.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Your university or experiment logo here NextGen Storage Shaun de Witt (STFC) With Contributions from: James Adams, Rob Appleyard, Ian Collier, Brian Davies,
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CERN Physics Database Services and Plans Maria Girone, CERN-IT
Your university or experiment logo here CEPH at the Tier 1 Brain Davies On behalf of James Adams, Shaun de Witt & Rob Appleyard.
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.
Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.
Report from CASTOR external operations F2F meeting held at RAL in February Barbara Martelli INFN - CNAF.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT DPM / LFC and FTS news Ricardo Rocha ( on behalf of the IT/GT/DMS.
Online System Status LHCb Week Beat Jost / Cern 9 June 2015.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Your university or experiment logo here Future Disk-Only Storage Project Shaun de Witt GridPP Review 20-June-2012.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Future Plans at RAL Tier 1 Shaun de Witt. Introduction Current Set-Up Short term plans Final Configuration How we get there… How we plan/hope/pray to.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
Operational experiences Castor deployment team Castor Readiness Review – June 2006.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
ASGC Site Report Jason Shih ASGC Grid Ops CASTOR External Operation Face to Face Meeting.
CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Patricia Méndez Lorenzo Status of the T0 services.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
ASCC Site Report Eric Yen & Simon C. Lin Academia Sinica 20 July 2005.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
Bonny Strong RAL RAL CASTOR Update External Institutes Meeting Nov 2006 Bonny Strong, Tim Folkes, and Chris Kruk.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
First Experiences with Ceph on the WLCG Grid Rob Appleyard Shaun de Witt, James Adams, Brian Davies.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
CASTOR at RAL in 2016 Rob Appleyard. Contents Current Status Staffing Upgrade plans Questions Conclusion.
Jean-Philippe Baud, IT-GD, CERN November 2007
Service Challenge 3 CERN
CASTOR-SRM Status GridPP NeSC SRM workshop
Castor services at the Tier-0
Ákos Frohner EGEE'08 September 2008
GridPP Tier1 Review Fabric
Data Management cluster summary
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

RAL Site Report Castor F2F, CERN Matthew Viljoen

Stuff to cover (at some point) General Operations report (Matthew) Database report (Rich) Tape report (Tim) Disk-only solution investigations (Shaun)

Current group structure Scientific Computing Department (Adrian Wander) Data Services group (Jens) DB Team (Rich) and CASTOR Team (Matt) Matt 70% mgnt/ops Chris 50% ops Shaun 25% ops & problem solving Rob 100% ops, CIP Jens 10% CIP and Tim 50% Tape … plus effort from DBAs … and 1 FTE from Fabric Team … and 0.5 FTE from Production Team Running CASTOR

Current status CASTOR Prod Instances (all NS): Tier 1: ATLAS, CMS, LHCb and Gen (ALICE, H1, T2K, MICE, MINOS, ILCß) Facilities: DLS, CEDA Test Instance (2.1.12/2.1.13): Preprod + Cert CIP homegrown in LISP DB Details from Rich… Tape SL8500 with 2,453xA, 4,655xB, 1,998xC tapes for Tier 1

Setup Headnodes 10 SRMs (4 ATLAS, 2 CMS, 2 LHCb, 2 Gen) “Stager/Scheduler/DLF” per instance and 2 NS SRM (4 ATLAS, 2 LHCb/CMS/Gen running 2.11) Disk servers (approx. 443) 10-40TB, RAID6 ext4,XFS DB Details later from Rich… CIP homegrown in LISP

Stats (Nov ‘12) VODisk (used/total)Tape (used) ATLAS3.4/4.2PB2.6PB CMS1.2/1.9PB3.7PB LHCb1.3/2PB1.2PB Gen0.3/0.4PB0.85PB Facilities(no D1T0)1.5PB (inc. D0T2) Total used: 6.2PB 11.5PB

SL8500 usage (All RAL)

Tape usage (Facilities)

Recent changes (2011/2012 )Complete hardware refresh this year 12Q2 Minor * upgrades and Switch tape subsystem to Tape Gateway Switch from LSF on all instances to Transfer Manager No more licensing costs! Better performance, and…SIMPLER! 12Q3 Repack for LHCb 2300 A -> C 12Q4 Major stager upgrade Introduction of global federated xroot for CMS, ATLAS

Hardware Refresh New SRMs, CASTOR + DB headnodes SL5 and Configuration Management System (CMS) - Quattor + Puppet - control throughout Leading to: Improved overall performance Switch over availability stats from SAM Ops to VO No more ATLAS background noise in SAM tests (before, consistent <5% of miscellaneous ATLAS failures) Content Mgmnt System (CMS) adoption (installation, DR, no more need to rely on system backups)

CMS – a few words Before we relied on backups, now on re-installation A node can be reinstalled in <1hr Tier 1 solution is Quattor, supported by Fabric Team. CASTOR has always used Puppet (for config files). Now we use a Quattor/Puppet hybrid Content typeExamples OS payload glibc el5_8.4, vim-enhanced el5 RPMs OS-level config resolv.conf, crontab, admin accounts/keys CASTOR payload castor-stager-server , castor-vmgr-client RPMs CASTOR configcastor.conf, tnsnames.ora Quattor Puppet

Remaining problem areas Disk server deployment and decommissioning overheads Can extend automation with CMS -> Shouldn’t need to manually “bless” disk servers Ongoing need for ORACLE database expertise Large number of different instances (4 prod, 3 test, Facilities…) Lack of read-only mode with new scheduler Lack of disk server balancing Monitoring (based on Ganglia) currently difficult to use. CK looking at new solutions Quattor clunkiness and general reluctance to use it Better templates structure should help Aqualon?

SIRs affecting CASTOR over last year /10/31 Castor ATLAS Outage 2011/12/02 VO Software Server 2011/12/15 Network Break Atlas SRM DB 2012/03/16 Network Packet Storm 2012/06/13 Oracle11 Update Failure 2012/11/07 Site Wide Power Failure 2012/11/20 UPS Over Voltage Bad ORA execution plan ORA RM bug, now fixed Power supply outage Power intervention gone wrong

What next? Full “off-site” database Dataguard backup Common headnode type, for improved: - Resiliency: easier to replace faulty node - Scalability: dynamically changing pool of headnodes - Doubling up daemons whereever possible -> Better Uptime – e.g. applying errata/rebooting etc SL6 upgrade in new year

Further ahead… Using virtualization more… Cert instance already virtualized Virtualize by default (headnodes, tape servers, CIPs…) VTL? 2013: New disk only solution alongside CASTOR Higher performance for analysis, easier to run IPv6?

To conclude… CASTOR nice and stable nowadays Rigorous change control at Tier 1 also helps! Track record of good interventions Comprehensive testing infrastructure paying dividends Balance right between new functionality vs. stability 3-6 months trailing behind CERN head version Good performance (esp. for tape). No plans to move away from CASTOR, alongside new “next-gen” disk storage solution