Presentation is loading. Please wait.

Presentation is loading. Please wait.

CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos

Similar presentations


Presentation on theme: "CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos"— Presentation transcript:

1 CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos miguel.coelho.santos@cern.ch

2 2 Topics CASTOR dirty laundry CASTOR integration CASTOR monitoring

3 CASTOR dirty laundry A small list of fresh items

4 4 Diskserver If a new diskserver starts reporting to rmmaster it goes into ‘production’ on the ‘default’ pool. You can be bitten by this when you install a new box or when you clean an old box (it becomes new to the stager). This is not so fresh but...

5 5 Diskserver (II) On the last versions this behavior acquired one more variable: MINALLOWEDFREESPACE If not set nothing will run If set to zero filesystem may fill-up Depending on the version, you get one or the other by default.

6 6 Looping Tape Recalls files stay in STAGEIN forever tape is being mounted and unmounted (looping) failing with "Re-enable tape+segment for selection" in recaller log file grep "Re-enable tape+segments for selection" /var/spool/rtcpclientd/rtcpcld | cut -d" " -f19 | sort -u will give you the tape IDs! There is a ‘procedure’ on twiki to: 1.isolate the subrequest that refer to files on this tape 2.fail the diskcopies 3.restart the subrequests 4. clean the unlinked tapecopies

7 7 Cleaning Daemon Three things to keep in mind: 1. it needs a separate configuration file 2. it now logs to DLF! 3. it looks to be having some ORACLE connection problems. Monitor PID existence and the OCI problem (we still don’t, but will this afternoon).

8 8 LSF File descriptor ”Failed in an LSF library call: Failed in sending/receiving a message: Bad file descriptor” Again a fresh problem on a piece of code that was not touched. It’s a bonus!! Current dirty workaround: monitor the message and restart LSF

9 9 Impact of LSF dropouts

10 10 Diskcopies in Failed When a user tries to recall a file not on tape neither on the stager, either the file is: 1. on a disabled filesystem 2. on another stager (with common name server) 3. lost The result is the same, you get an accumulation of ‘FAILED’ diskcopies and stager_qry degrades (a lot!)

11 11 0 length files on NS Deadlocks may cause the update on the name server to fail which means the file on tape has segments size but 0 size. nsls -l returns 0 nsls -T is bigger than 0

12 12 rmmaster does not stop It actually does, but ‘service rmmaster stop’ currently does not kill all the processes all the time. Impact? ‘service rmmaster start’ does not start rmmaster if some process is still there.

13 13 Some Hacks stager restart (restart was stopped yesterday on c2public. Currently 50% more but not so bad) recall restart LSF restart (already mentioned) manual migration restart (fixed??) I’m sure there are other that I have forgotten about but they haven’t forgotten about us... and will come nocking again, soon...

14 CASTOR ELFms integration

15 15 What is ‘integrated’? Configuration files (see Jan’s presentation) Daemon monitoring Log file monitoring for some ‘exceptions’ (Oracle connection problems, DB full, etc) CASTOR diskserver status is ruled by SMS Consistency monitoring

16 16 SMS and CASTOR SMS is the ELFms State Management System. for example, SMS controls if LEMON monitoring will raise alarms for a WN. We map SMS state to CASTOR rmmaster state: (production->Idle; standby->draining; maintenance- >Down)

17 17 Next Currently we monitor RAID status on the diskserver and we react to RAID problems using a script running centrally. We want to start doing this at diskserver level because: we stop depending on lemon central services (we will just be using the lemon agent on the box) we can use the severity reported by the sensor we reduce delays

18 18 diskNodeShutDown diskNodeShutDown is a tool that uses CASTOR external commands to move data out of a diskserver. It uses diskServer_qry, stager_qry and stager_get and move files out of a diskserver when we want to preform an intervention. Still very cumbersome to use, still suffering from bugs on stager_qry and stager_get

19 Monitoring

20 20 Recent developments (I) Recently we started a shift in the way we do monitoring. Until now monitoring was done by LEMON sensors Current and future metrics are to be implemented inside the DBs (if applicable). The lemon agent will only read the results. You may use other tools!

21 21 Recent developments (II) Code developments are underway for CASTOR to log different kind of messages. migrator is already logging per file the svcclass and tapepool among other stats some requests are already logging enough information for DLF to get latencies and other stats. more news expected soon

22 22 Consistency checking Because CASTOR has no central/common configuration tool we have to check for consistency across the different parts. http://castoradm4.cern.ch/lrf/status/c2public.php

23 23 Metrics Service metrics: ➡ metrics concerning availability and performance Accounting metrics: ➡ metrics concerning overall usage

24 24 Current Monitoring (internal) 29 Lemon Metrics (more than 140 values), not including daemons status, actuators, etc. ~45K measurements per day more than 200K values sampled per day

25 25 Current Monitoring Organization Three levels: stager: head nodes responsible for the service service class: functional partition of a stager (~groups of disk servers) disk servers: the basic element

26 26 Displays (4Q 2006) These display will be migrated to the new Lemon Service Displays (1Q2007).

27 27 Some Plot examples

28 28 CASTOR Service CASTOR2 ns commands rfio commands stager commands Service metrics ( high update freq.) Accounting metrics ( low update freq.) Service Interface

29 29 Pending Service Metrics Migrator information: number of files, size, tape pool by service class Complete meta-migration: snapshot of number of files to be migrated, and avg/min/max of size and age on disk. Complete meta-recall: snapshot of number of files to be recalled, and avg/min/max of size and age of the request.

30 30 Sad news... Yesterday at ~18h the dev box for CASTOR monitoring pasted-away... Lets have a look at the old version: http://castoradm4.cern.ch/lrf/index.php

31 31 Questions?


Download ppt "CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos"

Similar presentations


Ads by Google