CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.

CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008

Topics Current Architecture Upgrades and Changes Completed Operational Challenges Tape Server Issues Diskserver Deployment Project CCRC08 Certification Testbed Top 5 Issues Plans for Next 6 Months

Current Architecture - Production cms stager: 3 head nodes 45 diskservers (552 TB) atlas stager: 3 head nodes 64 diskservers (433 TB) lhcb stager: 3 head nodes 21 diskservers (133 TB) gen stager: 3 head nodes 1 diskserver (6 TB) (repack plus smaller users )

Current Architecture – Production Shared Services Nameservers: 2 servers for nsdaemon DNS load-balanced cluster 1 of these also hosts: vdqm, vmgr, cupv Tape servers: 15 servers FC-attached STK T10000 tape drives

Upgrades and Changes Completed Nov 2007 – upgrade to 2.1.4 and SL4-64 –Except tape servers Dec 2007 – SRMv2 Dec 2007 - SRMv2 in production Apr 2008 – upgrade to 2.1.6 Oracle RAC (currently 3 nodes) –SRMv2 –Gen instance stager and dlf

Upgrades and Changes Completed - continued Diskserver deployment project – prototype and testing Nagios monitoring for 24/7 support and callout procedures Dynamic Information Provider developed Migration from dcache –Completed for lhcb –Completed for atlas disk, –Still working on atlas tape

Operational Challenges Power failures –8 Feb: 1 ½ days to get castor back in server –6 May: 1 full day to bring castor back in service –Database FS corruption. Trying to get UPS for DBs. –LSF startup issues Establishing after-hours callout system –Linkages between nagios and bleeper callout –Hindered by broken link between Tier1 helpdesk and GGUS –Tuning monitoring alerts –Developing operational documentation –Developing handoff procedures Backplane meltdown of Viglen diskservers (Feb) Failure of new diskservers (April) Loss of DBA due to budget cuts

Tape Server Issues Have not yet been able to install SLC4-64bit –NI Failure with 64-bit (now understood?) –Missing device /dev/nst1 Fibre-channel card? Transtec vs IBM servers? Something special in kernel at CERN? –Have been working on this tape server upgrade for over 6 months –Still running castor 2.1.3 on SLC3 Much work on improving migration performance Now seeing serious problems with servers hanging when doing heavy load of recalls

Disk servers deployment today and tomorrow (1/2): Kick-start scriptPost-install script Castor personalization Disk server registration

Disk servers deployment today and tomorrow (2/2): Kick-start scriptPost-install script Castor personalization Disk server registration Puppet

CCRC08 Big success of migration policies overcoming poor tape migration for CMS Working out system and procedures for out-of- hours callouts has required much time and effort Biggest problems: –Power outage on 6 May –Problems with new root certificate on 20 May –SRMv2 crashes –Major problems with tape drives hanging, starting last week

CCRC08 - continued Smaller problems: –Had disk-disk copies staying in PEND, but setting DiskCopyPendTimeout made this workable –Jobs submitted to disk1 servers which became full was overcome with setting PendTimeout –Daily restart of castor-gridftp (to clean dead processes) causing job failures: both SE and CE jobs.

CCRC08 - Tape Transfer Stats Data from Last Week of May

Certification Testbed Have not delivered what we expected Have completed installation of infrastructure Extending contract for testbed sys admin another 4 months, pending STFC approval Need to maintain release structure to be able to take advantage of work completed

Top 5 Issues Tape drives hanging GC bug Problems installing tape servers with SL4-64 Repack Disk-disk copies stay in PEND or (earlier) multiple copies Other issues: Slow database, requires new stats Support for work on migration policies Submitting jobs to disk1 servers which become full Bulk deletes

Plans for Next 6 Months 1.Diskserver deployment project 2.Certification testbed used to test new releases 3.Taper servers upgraded to SLC4-64 4.Disaster recovery plan Based on puppet, kickstart, and DB backups Documented Tested 5.Support for small experiments 6.Repack operational 7.Xrootd and rootd

Plans for Next 6 Months - cont 8.Improved resilience: Oracle RAC and Dataguard Redundant, load-balanced stagers Cold standby jobManager/LSF hosts Failover of vdqm/vmgr/cupv 9.Decommission srmv1 endpoints 10.Castor-gridftp v2 - internal 11.Continued improvements in monitoring and documentation

Plans for Next 6 Months - cont 12.Planning for move to new computer building in Dec 08 13.Possible second robot 14.Tape media – T10000B 15.UPS for DB servers ASAP!

CIP CASTOR Information Provider Jens Jensen (in absentia) CASTOR F2F mtg 10-11 June 08

Anatomy of the CIP CASTOR stagers CIP Back end CIP Front end The Grid

Front End Written by Derek Ross from Tier 1 Publishes into Tier 1 BDIIs Uses a condensed text format for communicating with the back end –Historical reasons –One line per service class condensing all the news that’s fit to print

Back End Queries stager for information about disk pools Could have queried SRM DBs for STDs etc –But uses configuration file for non-dynamic information –Not all classes are published –Documented

Next steps Deploy backend with improved exception handling –Currently hangs if stager is down –Most recent version is ready for prod’n but hasn’t been deployed yet –Could replace the front end and publish LDIF directly –Packaging and improved deployment

See Also http://www.gridpp.ac.uk/wiki/ RAL_Tier1_CASTOR_Accounting –Currently slightly out of date, refers to January version –Can deal with Bs and KBs and MBs and GBs and PBs and EBs

CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.

Similar presentations

Presentation on theme: "CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.

Similar presentations

Presentation on theme: "CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008."— Presentation transcript:

Similar presentations

About project

Feedback