Castor services at the Tier-0

Slides:



Advertisements
Similar presentations
LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.
Advertisements

CERN Castor external operation meeting – November 2006 Olof Bärring CERN / IT.
June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.
16/9/2004Features of the new CASTOR1 Alice offline week, 16/9/2004 Olof Bärring, CERN.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos.
RAL Site Report Castor F2F, CERN Matthew Viljoen.
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Clustering In A SAN For High Availability Steve Dalton, President and CEO Gadzoox Networks September 2002.
An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
Operational experiences Castor deployment team Castor Readiness Review – June 2006.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.
Patricia Méndez Lorenzo Status of the T0 services.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
CERN site report Operational aspects of Grid Services at the Tier-0.
Storage & Database Team Activity Report INFN CNAF,
Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre
Servizi core INFN Grid presso il CNAF: setup attuale
Jean-Philippe Baud, IT-GD, CERN November 2007
Dynamic Extension of the INFN Tier-1 on external resources
WLCG Management Board, 30th September 2008
High Availability Linux (HA Linux)
Status and plans Giuseppe Lo Re INFN-CNAF 8/05/2007.
IT-DB Physics Services Planning for LHC start-up
Cross-site problem resolution Focus on reliable file transfer service
Tape Operations Vladimír Bahyl on behalf of IT-DSS-TAB
1 VO User Team Alarm Total ALICE ATLAS CMS
Service Challenge 3 CERN
~~~ WLCG Management Board, 10th March 2009
CASTOR-SRM Status GridPP NeSC SRM workshop
Luca dell’Agnello INFN-CNAF
Olof Bärring LCG-LHCC Review, 22nd September 2008
WLCG Service Interventions
NGS Oracle Service.
Ákos Frohner EGEE'08 September 2008
SpiraTest/Plan/Team Deployment Considerations
Deploying Production GRID Servers & Services
The LHCb Computing Data Challenge DC06
Presentation transcript:

Castor services at the Tier-0 Jan van Eldik CERN Castor Operations team

Outline Castor in numbers 24 x 7 operations Teams, workflows Building blocks for Castor services DNS loadbalancing, redundant hardware Case studies Castor service deployment SRM service deployment 2

Castor service challenge Disk Cache Size Number of Servers Number of Disk Pools Number of files on disk Data on Tape Alice 261 TB 47 3 1,311,901 742 TB Atlas 505 TB 77 6 1,794,670 720TB CMS 472 TB 85 5 299,396 1,220 TB LHCb 197 TB 36 700,097 453 TB Total 1,435 TB 245 19 4.1 M 3,115 TB Numbers from Sept 2007 3

Castor service challenge - II 2007 2008 2009 2010 2011 2012 25 5 10 20 15 PB Alice Atlas CMS LHCb 2007: 2 PB, 400 servers 2012: 21 PB, 1000 servers 2007: 2 PB, 400 servers 2012: 21 PB, ~1000 servers 4

Support Infrastructure HelpDesk, GGUS Operator Service Manager On Duty SysAdmin CASTOR Service Expert CASTOR Developer 5

System-level alarms CASTOR Service Expert Operator SysAdmin 1st level alarm handling 24 x 7 coverage on site Driven by procedures Operator 2nd level alarms handling 24 x 7 coverage, on-call out of working hours Problem determination Manage hardware repairs SysAdmin CASTOR Service Expert Service responsible Applies software upgrades, configuration changes and provides procedures Manages disruptive interventions Handles problematic situations 6

User support CASTOR Service Expert Service Manager On Duty HelpDesk, 1st level user support Handle common user questions Triage HelpDesk, GGUS 2nd level user support Handle common problems Procedure driven Service Manager On Duty CASTOR Service Expert Handle uncommon and complex problems Provide procedures 7

Number of calls per week HelpDesk, GGUS Operator 127 18 Service Manager On Duty SysAdmin 6 5 Castor Service Expert Ops -> Sysadmin : 4187 tkts SAO -> us : 173 (94 direct, 79 via SMoD) HelpDesk -> SMoD : 605 SMoD -> Castor service manager : 200 CSM -> Dev : 19 Plus ~10 calls via support lists, direct e-mails, phone calls, … 0.5 Castor Developer 8

Technologies used Use of DNS aliases Loadbalancing where possible To provide user-friendly entry points to services Allows to change deployment layout transparently (failover to standby servers) Loadbalancing where possible Multiple (cheap) servers, split over network switches, power bars, … Scale service by adding servers Allows ‘cyclic upgrades’ Pre-requisite: stateless daemons Hardware with redundancy features Hot-swappable disks and power supplies, hardware RAID ‘Mid-range’ servers for core components NAS diskservers RAID-5 + SPARE configurations SPOFs: motherboards, RAID controllers, … Oracle databases on RAC Only component on ‘critical power’ 9

Castor central services 10

Castor diskcaches Five independent diskcaches Server cluster: Alice, Atlas, CMS, LHCb, Public Server cluster: Stager, request handler, scheduler, rtcpclientd, … Most of them stateless Current deployment: midrange servers, with DNS aliases for all of them Near future: loadbalanced aliases (except scheduler) Diskserver pools: Configured and sized according to experiment needs Rely on hardware redundancy And on copies on tape  Fully automated box management 11

Castor SRM SRM v1 SRM v2 DNS aliases srm.cern.ch Shared by all VO’s Loadbalanced over 10 CPU servers Deployment bug: on single switch SPOF: shared request spool SRM v2 Separate endpoints per VO Fixed bug in loadbalancing  Any SPOFs left? 12

SRM v2.2 production endpoints 13

Conclusion & Outlook Castor service = H/W + S/W + operations Workflows are in place for 24 x 7 operations Teams, alarms, procedures We are actively hunting down SPOFs DNS aliases and loadbalancing Redundant hardware Castor software is rapidly maturing “Thanks!” to developers Ready to add 700 diskservers in 2008 And to operate them! 14

Castor nameserver disk cache Tape backend Reconstruction farms Online farms Analysis facility WAN data exports 15