Presentation is loading. Please wait.

Presentation is loading. Please wait.

CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.

Similar presentations


Presentation on theme: "CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008."— Presentation transcript:

1 CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008

2 Topics Current Architecture Upgrades and Changes Completed Operational Challenges Tape Server Issues Diskserver Deployment Project CCRC08 Certification Testbed Top 5 Issues Plans for Next 6 Months

3 Current Architecture - Production cms stager: 3 head nodes 45 diskservers (552 TB) atlas stager: 3 head nodes 64 diskservers (433 TB) lhcb stager: 3 head nodes 21 diskservers (133 TB) gen stager: 3 head nodes 1 diskserver (6 TB) (repack plus smaller users )

4 Current Architecture – Production Shared Services Nameservers: 2 servers for nsdaemon DNS load-balanced cluster 1 of these also hosts: vdqm, vmgr, cupv Tape servers: 15 servers FC-attached STK T10000 tape drives

5 Upgrades and Changes Completed Nov 2007 – upgrade to 2.1.4 and SL4-64 –Except tape servers Dec 2007 – SRMv2 Dec 2007 - SRMv2 in production Apr 2008 – upgrade to 2.1.6 Oracle RAC (currently 3 nodes) –SRMv2 –Gen instance stager and dlf

6 Upgrades and Changes Completed - continued Diskserver deployment project – prototype and testing Nagios monitoring for 24/7 support and callout procedures Dynamic Information Provider developed Migration from dcache –Completed for lhcb –Completed for atlas disk, –Still working on atlas tape

7 Operational Challenges Power failures –8 Feb: 1 ½ days to get castor back in server –6 May: 1 full day to bring castor back in service –Database FS corruption. Trying to get UPS for DBs. –LSF startup issues Establishing after-hours callout system –Linkages between nagios and bleeper callout –Hindered by broken link between Tier1 helpdesk and GGUS –Tuning monitoring alerts –Developing operational documentation –Developing handoff procedures Backplane meltdown of Viglen diskservers (Feb) Failure of new diskservers (April) Loss of DBA due to budget cuts

8 Tape Server Issues Have not yet been able to install SLC4-64bit –NI Failure with 64-bit (now understood?) –Missing device /dev/nst1 Fibre-channel card? Transtec vs IBM servers? Something special in kernel at CERN? –Have been working on this tape server upgrade for over 6 months –Still running castor 2.1.3 on SLC3 Much work on improving migration performance Now seeing serious problems with servers hanging when doing heavy load of recalls

9 Disk servers deployment today and tomorrow (1/2): Kick-start scriptPost-install script Castor personalization Disk server registration

10 Disk servers deployment today and tomorrow (2/2): Kick-start scriptPost-install script Castor personalization Disk server registration Puppet

11 CCRC08 Big success of migration policies overcoming poor tape migration for CMS Working out system and procedures for out-of- hours callouts has required much time and effort Biggest problems: –Power outage on 6 May –Problems with new root certificate on 20 May –SRMv2 crashes –Major problems with tape drives hanging, starting last week

12 CCRC08 - continued Smaller problems: –Had disk-disk copies staying in PEND, but setting DiskCopyPendTimeout made this workable –Jobs submitted to disk1 servers which became full was overcome with setting PendTimeout –Daily restart of castor-gridftp (to clean dead processes) causing job failures: both SE and CE jobs.

13 CCRC08 - Tape Transfer Stats Data from Last Week of May

14 Certification Testbed Have not delivered what we expected Have completed installation of infrastructure Extending contract for testbed sys admin another 4 months, pending STFC approval Need to maintain release structure to be able to take advantage of work completed

15 Top 5 Issues Tape drives hanging GC bug Problems installing tape servers with SL4-64 Repack Disk-disk copies stay in PEND or (earlier) multiple copies Other issues: Slow database, requires new stats Support for work on migration policies Submitting jobs to disk1 servers which become full Bulk deletes

16 Plans for Next 6 Months 1.Diskserver deployment project 2.Certification testbed used to test new releases 3.Taper servers upgraded to SLC4-64 4.Disaster recovery plan Based on puppet, kickstart, and DB backups Documented Tested 5.Support for small experiments 6.Repack operational 7.Xrootd and rootd

17 Plans for Next 6 Months - cont 8.Improved resilience: Oracle RAC and Dataguard Redundant, load-balanced stagers Cold standby jobManager/LSF hosts Failover of vdqm/vmgr/cupv 9.Decommission srmv1 endpoints 10.Castor-gridftp v2 - internal 11.Continued improvements in monitoring and documentation

18 Plans for Next 6 Months - cont 12.Planning for move to new computer building in Dec 08 13.Possible second robot 14.Tape media – T10000B 15.UPS for DB servers ASAP!

19 CIP CASTOR Information Provider Jens Jensen (in absentia) CASTOR F2F mtg 10-11 June 08

20 Anatomy of the CIP CASTOR stagers CIP Back end CIP Front end The Grid

21 Front End Written by Derek Ross from Tier 1 Publishes into Tier 1 BDIIs Uses a condensed text format for communicating with the back end –Historical reasons –One line per service class condensing all the news that’s fit to print

22 Back End Queries stager for information about disk pools Could have queried SRM DBs for STDs etc –But uses configuration file for non-dynamic information –Not all classes are published –Documented

23 Next steps Deploy backend with improved exception handling –Currently hangs if stager is down –Most recent version is ready for prod’n but hasn’t been deployed yet –Could replace the front end and publish LDIF directly –Packaging and improved deployment

24 See Also http://www.gridpp.ac.uk/wiki/ RAL_Tier1_CASTOR_Accounting –Currently slightly out of date, refers to January version –Can deal with Bs and KBs and MBs and GBs and PBs and EBs


Download ppt "CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008."

Similar presentations


Ads by Google