Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bonny Strong RAL RAL CASTOR Update External Institutes Meeting 13-15 Nov 2006 Bonny Strong, Tim Folkes, and Chris Kruk.

Similar presentations


Presentation on theme: "Bonny Strong RAL RAL CASTOR Update External Institutes Meeting 13-15 Nov 2006 Bonny Strong, Tim Folkes, and Chris Kruk."— Presentation transcript:

1 Bonny Strong RAL RAL CASTOR Update External Institutes Meeting 13-15 Nov 2006 Bonny Strong, Tim Folkes, and Chris Kruk

2 Bonny Strong RAL Overview 1.Current Status 2.Tape Systems: Status/Issues 3.Experience with CMS CSA06 4.Developing Monitoring at RAL 5.Issues/Problems 6.Goals for Next 6 Months

3 Bonny Strong RAL 1. Current Status of CASTOR at RAL

4 Bonny Strong RAL Castor Operations Team Central services –System manager: Bonny Strong –Deputy manager: Shaun de Witt Disk frontend systems and LSF –System manager: Chris Kruk –Deputy manager: Cheney Ketley Tape backend systems –System manager: Tim Folkes –Deputy manager: Bonny Strong SRM systems –System manager: Shaun de Witt –Deputy manager: Chris Kruk Grid security and infrastructure –Special advisor: Jens Jensen

5 Bonny Strong RAL Current Infrastructure of Castor2 at RAL Chris Kruk CCLRC-RAL

6 Bonny Strong RAL Platforms Pre-production platform Currently supports only: CMS for CSA06 Dteam Ops Default, for testing Test platform

7 Bonny Strong RAL Current pre-production Castor infrastructure (1/4): Central Servers Cupvdaemon Msgdaemon Nsdaemon vdqmserv Vmgrdaemon rfiod Cupvdaemon Msgdaemon Nsdaemon vdqmserv Vmgrdaemon rfiod Castor100: Name Server Castor101: Stager Castor103: LSF Master Castor102: DLF Rhserver Stager Rtcpclientd MigHunter Rhserver Stager Rtcpclientd MigHunter LSF Master FlexLM LSF Master FlexLM Dlfserver Rmmaster expertd Dlfserver Rmmaster expertd Castor version 2.1.0-3 OS version CERN SL 3.0.6

8 Bonny Strong RAL 7 disk servers CMS DTeam Default Disk Servers OPS 1 disk server Current pre-production Castor infrastructure (2/4): Castor version 2.1.0-3 OS version CERN SL 3.0.6 SL 4.2

9 Bonny Strong RAL 4 tape drives CMS DTeam Default Tape Servers OPS Current pre-production Castor infrastructure (3/4): Castor version 2.1.0-3 OS version CERN SL 3.0.6 Max drives allocation: Total number of drives = 4 CMS DTeam Default OPS 1 tape drive

10 Bonny Strong RAL ns stager vmgr cupv srm ns stager vmgr cupv srm castor151 castor150 DB machines DLF Current pre-production Castor infrastructure (4/4): OS version Red Hat Enterprise 3 Oracle 10.2.0.2.0

11 Bonny Strong RAL Pre-production platform: Pre-production - fully functional platform where number of disk servers will be increase soon. c/sd/st/sDBSRM in used:410421 Not in use yet:00602

12 Bonny Strong RAL Current test Castor infrastructure (1/2): rfiodCupvdaemon Msgdaemon Nsdaemon Vdqmserv Vmgrdaemon Rfiod Rhserver Stager Rtcpclientd MigHunter rfiodCupvdaemon Msgdaemon Nsdaemon Vdqmserv Vmgrdaemon Rfiod Rhserver Stager Rtcpclientd MigHunter Central Servers Tcastor100: NameServer Stager Tcastor100: NameServer Stager Tcastor102: DLF LSF Tcastor102: DLF LSF dlfserver Rmmaster expertd dlfserver Rmmaster expertd lcg0611 lcg0610 DB machines OS version CERN SL 3.0.6 Castor version 2.1.0-3 OS version Red Hat Enterprise 3 Oracle 10.2.0.2.0 ns stager vmgr cupv srm ns stager vmgr cupv srm DLF

13 Bonny Strong RAL Current test Castor infrastructure (2/2): Disk Servers disk1tape0 disk0tape1 1 disk server 1 tape drive tcastor200 Tape server disk1tape1 default 1 disk server Waiting for SRMv2.2 OS version CERN SL 3.0.6 Castor version 2.1.0-3 Castor version 2.1.0-3 OS version CERN SL 3.0.6

14 Bonny Strong RAL Test platform: Test – still under construction and configuration. c/sd/st/sDBSRM in used:23121 Not in used yet:01000 Sharing the same machine

15 Bonny Strong RAL Current Castor infrastructure: 3 servers SRMv1 SRM machines SRMv2.2 1 server Sharing machine with SRMv1 Test platform Associated with 3 different storage types per VO

16 Bonny Strong RAL 2. Tape Systems: Status and Issues Tim Folkes

17 Bonny Strong RAL Configuration 5 STK T10000 tape drives dedicated to castor 10 tape servers, 5 in use. –4GB memory per server (enough?) All connected via Brocade 4100 switch

18 Bonny Strong RAL Performance Doing non-disk IO ~120MB/sec Disk IO limited to disk speeds –Tape IO depends on disk activity No disk activity can get ~100MB/sec With disk activity ~15-20MB/sec

19 Bonny Strong RAL Performance How do we tune disk servers for better tape performance? –More tape server memory –XFS rather than EXT-3 –Reduce number of streams –More disk servers

20 Bonny Strong RAL Problems Kudzu causing FC loop resets at reboot –Caused tapes to rewind and over-write data No re-pack (yet). Affects our resourcing /var/tmp/RTCOPY vanishes. –We do not run the script to remove old files, so why vanishing? When VDQM restarts drives left in UNKN state with date of 1 Jan tpusage does not work

21 Bonny Strong RAL 3. Experience with CMS CSA06

22 Bonny Strong RAL Deployment for CSA06 Problems in deployment left us behind schedule for mid-Oct startup. Had to enlist extra help: –Olof, visit to RAL to help trouble-shoot –Shaun and Jens, who delayed other commitments –RAL managers, assisted coordination Coordination meetings with CMS – 3 times weekly – reviewed problems and progress We made it - just in time!

23 Bonny Strong RAL CMS Configuration Diskservers provided by Tier1 group –Started with only 2 diskservers –Ramped up to 7 diskservers –All running SLC4 Pre-production platform frozen after start of CSA06; no changes made except adding diskservers

24 Bonny Strong RAL CMS Service Classes CMS originally planned 3 service classes Due to delays in deploying diskservers, decided to run with only 1 service class CMS wanted disk1tape0, promising to manage disks themselves to not fill We chose not to run without GC due to known LSF meltdown problems Set GC at 90% and kept tape copy Found filesystems on same diskserver filled unevenly, so one FS may hit 90% while another only at 50%

25 Bonny Strong RAL CMS Results Achieved overall throughput of 250 MB/sec to disk Now have about 100TB data on tape End result is that CMS are very happy with castor performance –Found it big improvement over dcache

26 Bonny Strong RAL 4. Developing Monitoring at RAL

27 Bonny Strong RAL Choosing a Monitoring Tool Require close coordination between data storage group and Tier1 group Tier1 (and other RAL groups) have already chosen Nagios for future monitoring and after- hours callouts Castor team developing monitoring for castor with Nagios, using Nagios service provided by Tier1 Brought in contractor to do castor Nagios development

28 Bonny Strong RAL Monitoring Checks Standard Linux checks Checks for castor services running and number processes Castor-specific checks –Still working to define most useful checks –Trying to find way of alerting when tape migration is not working

29 Bonny Strong RAL Time-series Data Standard Nagios checks only single point- in-time values – e.g. number jobs PEND in LSF queue Want to be able to keep time series of some data, and be able to design checks based on time series Tried Nagios Grapher without success Now using ganglia to provide graphs, but not alerts Planning to develop Oracle db solution to record time series to allow monitoring checks, and to prepare reports

30 Bonny Strong RAL 5. Issues and Problems Encountered

31 Bonny Strong RAL Most Troublesome Problems for CMS deployment Diskservers in different network domain from central servers Tape looping Problems with castor_gridftp Problems with database upgrades and hotfixes

32 Bonny Strong RAL Main Problems Not Yet Resolved Rfio issues –Timeout in stager waiting reply from rmmaster –Address in use (fixed?) Failures not reported to users Database connection loss takes service down Tape looping (fixed in 2.1.1-4?)

33 Bonny Strong RAL Key Issues (1) 1.Problems with database upgrades 2.Castor_gridftp 3.Error messages don’t have enough info 4.“Normal” errors fill logs

34 Bonny Strong RAL Key Issues (2) 5.Need more admin tools Would like to not access database directly 6.Limited by only one diskpool per diskserver, especially for testing 7.Uneven filling of filesystems on diskservers Is this an SL4 issues? 8.Clearing out bad files/data

35 Bonny Strong RAL Database Upgrade Issues Sep: hotfix for 2.1.0-6 was missing a “create table” statement. Even after inserting this statement, db was in bad state and had to do point-in-time rollback to get system operational. Problems with identifying versions of hotfixes. Hotfix labeled 2.1.0-6.hotfix on download pages was overwritten with another script with extensive changes. Makes it difficult tracking what hotfixes have been applied, and comparing one installation to another. July: upgrade to 2.1.0-3 had incorrect code due to manual changes in released db scripts made on the incorrect code lines.

36 Bonny Strong RAL castor_gridftp Lack of documentation Lack of support or expertise Problems identifying latest version Difficulties in deployment –RPM does not define dependencies –Which version of gridftp should be used? Limitations of stagemap.conf

37 Bonny Strong RAL Example Error Message DBCALL=C_Services_fillObj(); ERROR_STRING=No such file or directory; DB_ERROR=No object found for i;:1522761; File=rtcpcldCatalogueInterface.c; DBKEY=1479796; Line=380; errno=115; serrno=2 Eventually found that record in tape table had associated svcclass that no longer existed, but had to go back to code to interpret this message, then guess which record it referred to. Could error message include which DB table, field, and record?

38 Bonny Strong RAL Example Error Message GC Failed to remove file File name=/exportstage/castor5/58/149658@castorns.ads.rl. ac.uk.5596911 Error=gcGetFileSize : Cannot stat file /exportstage/castor5/58/149658@castorns.ads.rl.ac.uk.5 596911 Origin=No such file or directory On which diskserver?

39 Bonny Strong RAL 6. Goals for Next 6 Months

40 Bonny Strong RAL Goals (1) Moving pre-production to production platform, including more VOs Testing on test platform –Upgrade to 2.1.1-4 –SRM 2.2 –Test CASTOR-SRB driver Improve monitoring Add reporting and accounting Move to SRMv2.2

41 Bonny Strong RAL Goals (2) Database changes –3 rd Oracle instance to serve nameserver vmgr, cupv, and SRM2.2 –Setup Oracle standby for production Plan second Castor instance for non- PP users Bring castor operations team up to production-level standards –Coverage for support –Documentation and training –Helpdesk, notification, change control, etc


Download ppt "Bonny Strong RAL RAL CASTOR Update External Institutes Meeting 13-15 Nov 2006 Bonny Strong, Tim Folkes, and Chris Kruk."

Similar presentations


Ads by Google