Download presentation
Presentation is loading. Please wait.
Published byGiles Walsh Modified over 8 years ago
1
CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008
2
Topics Current Architecture Upgrades and Changes Completed Operational Challenges Tape Server Issues Diskserver Deployment Project CCRC08 Certification Testbed Top 5 Issues Plans for Next 6 Months
3
Current Architecture - Production cms stager: 3 head nodes 45 diskservers (552 TB) atlas stager: 3 head nodes 64 diskservers (433 TB) lhcb stager: 3 head nodes 21 diskservers (133 TB) gen stager: 3 head nodes 1 diskserver (6 TB) (repack plus smaller users )
4
Current Architecture – Production Shared Services Nameservers: 2 servers for nsdaemon DNS load-balanced cluster 1 of these also hosts: vdqm, vmgr, cupv Tape servers: 15 servers FC-attached STK T10000 tape drives
5
Upgrades and Changes Completed Nov 2007 – upgrade to 2.1.4 and SL4-64 –Except tape servers Dec 2007 – SRMv2 Dec 2007 - SRMv2 in production Apr 2008 – upgrade to 2.1.6 Oracle RAC (currently 3 nodes) –SRMv2 –Gen instance stager and dlf
6
Upgrades and Changes Completed - continued Diskserver deployment project – prototype and testing Nagios monitoring for 24/7 support and callout procedures Dynamic Information Provider developed Migration from dcache –Completed for lhcb –Completed for atlas disk, –Still working on atlas tape
7
Operational Challenges Power failures –8 Feb: 1 ½ days to get castor back in server –6 May: 1 full day to bring castor back in service –Database FS corruption. Trying to get UPS for DBs. –LSF startup issues Establishing after-hours callout system –Linkages between nagios and bleeper callout –Hindered by broken link between Tier1 helpdesk and GGUS –Tuning monitoring alerts –Developing operational documentation –Developing handoff procedures Backplane meltdown of Viglen diskservers (Feb) Failure of new diskservers (April) Loss of DBA due to budget cuts
8
Tape Server Issues Have not yet been able to install SLC4-64bit –NI Failure with 64-bit (now understood?) –Missing device /dev/nst1 Fibre-channel card? Transtec vs IBM servers? Something special in kernel at CERN? –Have been working on this tape server upgrade for over 6 months –Still running castor 2.1.3 on SLC3 Much work on improving migration performance Now seeing serious problems with servers hanging when doing heavy load of recalls
9
Disk servers deployment today and tomorrow (1/2): Kick-start scriptPost-install script Castor personalization Disk server registration
10
Disk servers deployment today and tomorrow (2/2): Kick-start scriptPost-install script Castor personalization Disk server registration Puppet
11
CCRC08 Big success of migration policies overcoming poor tape migration for CMS Working out system and procedures for out-of- hours callouts has required much time and effort Biggest problems: –Power outage on 6 May –Problems with new root certificate on 20 May –SRMv2 crashes –Major problems with tape drives hanging, starting last week
12
CCRC08 - continued Smaller problems: –Had disk-disk copies staying in PEND, but setting DiskCopyPendTimeout made this workable –Jobs submitted to disk1 servers which became full was overcome with setting PendTimeout –Daily restart of castor-gridftp (to clean dead processes) causing job failures: both SE and CE jobs.
13
CCRC08 - Tape Transfer Stats Data from Last Week of May
14
Certification Testbed Have not delivered what we expected Have completed installation of infrastructure Extending contract for testbed sys admin another 4 months, pending STFC approval Need to maintain release structure to be able to take advantage of work completed
15
Top 5 Issues Tape drives hanging GC bug Problems installing tape servers with SL4-64 Repack Disk-disk copies stay in PEND or (earlier) multiple copies Other issues: Slow database, requires new stats Support for work on migration policies Submitting jobs to disk1 servers which become full Bulk deletes
16
Plans for Next 6 Months 1.Diskserver deployment project 2.Certification testbed used to test new releases 3.Taper servers upgraded to SLC4-64 4.Disaster recovery plan Based on puppet, kickstart, and DB backups Documented Tested 5.Support for small experiments 6.Repack operational 7.Xrootd and rootd
17
Plans for Next 6 Months - cont 8.Improved resilience: Oracle RAC and Dataguard Redundant, load-balanced stagers Cold standby jobManager/LSF hosts Failover of vdqm/vmgr/cupv 9.Decommission srmv1 endpoints 10.Castor-gridftp v2 - internal 11.Continued improvements in monitoring and documentation
18
Plans for Next 6 Months - cont 12.Planning for move to new computer building in Dec 08 13.Possible second robot 14.Tape media – T10000B 15.UPS for DB servers ASAP!
19
CIP CASTOR Information Provider Jens Jensen (in absentia) CASTOR F2F mtg 10-11 June 08
20
Anatomy of the CIP CASTOR stagers CIP Back end CIP Front end The Grid
21
Front End Written by Derek Ross from Tier 1 Publishes into Tier 1 BDIIs Uses a condensed text format for communicating with the back end –Historical reasons –One line per service class condensing all the news that’s fit to print
22
Back End Queries stager for information about disk pools Could have queried SRM DBs for STDs etc –But uses configuration file for non-dynamic information –Not all classes are published –Documented
23
Next steps Deploy backend with improved exception handling –Currently hangs if stager is down –Most recent version is ready for prod’n but hasn’t been deployed yet –Could replace the front end and publish LDIF directly –Packaging and improved deployment
24
See Also http://www.gridpp.ac.uk/wiki/ RAL_Tier1_CASTOR_Accounting –Currently slightly out of date, refers to January version –Can deal with Bs and KBs and MBs and GBs and PBs and EBs
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.