CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos

Slides:



Advertisements
Similar presentations
CERN Castor external operation meeting – November 2006 Olof Bärring CERN / IT.
Advertisements

PlanetLab Operating System support* *a work in progress.
Installation and Deployment in Microsoft Dynamics CRM 4.0
OpenVMS System Management A different perspective by Andy Park TrueBit b.v.
16/9/2004Features of the new CASTOR1 Alice offline week, 16/9/2004 Olof Bärring, CERN.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
1 Network File System. 2 Network Services A Linux system starts some services at boot time and allow other services to be started up when necessary. These.
Staging to CAF + User groups + fairshare Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE Offline week,
19 February CASTOR Monitoring developments Theodoros Rekatsinas, Witek Pokorski, Dennis Waldron, Dirk Duellmann,
Castor F2F Meeting Barbara Martelli Castor Database CNAF.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status Tony Cass (With thanks to Miguel Coelho dos Santos & Alex Iribarren) LCG-LHCC.
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
Problem Determination Your mind is your most important tool!
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Microsoft FrontPage 2003 Illustrated Complete Finalizing a Web Site.
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.
Installing, Configuring And Troubleshooting Coldfusion Mark A Kruger CFG Ryan Stille CF Webtools.
10/23/2015ISYS366 - Installation1 ISYS366 Installation.
Write-through Cache System Policies discussion and A introduction to the system.
ASAP RDF SGP RDF 1.2 and 1.3 Transfer of Information
Functional description Detailed view of the system Status and features Castor Readiness Review – June 2006 Giuseppe Lo Presti, Olof Bärring CERN / IT.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
Distributed Backup And Disaster Recovery for AFS A work in progress Steve Simmons Dan Hyde University.
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
T Project Review Sotanorsu I3 Iteration
Light weight Disk Pool Manager experience and future plans Jean-Philippe Baud, IT-GD, CERN September 2005.
The Relational Model1 Transaction Processing Units of Work.
New stager commands Details and anatomy CASTOR external operation meeting CERN - Geneva 14/06/2005 Sebastien Ponce, CERN-IT.
02 August OraMonPlans 08/ August Topics Enhancements –OraMon DB redundancy layer –Compare and fix OraMon configurations –Expiry of historical.
Deployment and Operation Castor Delta Review – December Olof Bärring CERN / IT.
CERN SRM Development Benjamin Coutourier Shaun de Witt CHEP06 - Mumbai.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
2-Dec Offline Report Matthias Schröder Topics: Scientific Linux Fatmen Monte Carlo Production.
Facebook is a social utility that connects you with the people around you. Use Facebook to…  Keep up with friends and family  Share photos and videos.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
INFSO-RI Enabling Grids for E-sciencE SRMv2.2 in DPM Sophie Lemaitre Jean-Philippe.
Operational experiences Castor deployment team Castor Readiness Review – June 2006.
Lemon Tutorial Sensor How-To Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
STAR Scheduling status Gabriele Carcassi 9 September 2002.
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
Patricia Méndez Lorenzo Status of the T0 services.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
+ AliEn status report Miguel Martinez Pedreira. + Touching the APIs Bug found, not sending site info from ROOT to central side was causing the sites to.
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
Bonny Strong RAL RAL CASTOR Update External Institutes Meeting Nov 2006 Bonny Strong, Tim Folkes, and Chris Kruk.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Good user practices + Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CUF,
How to fix Error code 0x80072ee2 in Windows 8.1? Fix%20%20Update%20Error%200x80072EE2%20in%20Windows%20 8.1,%20Windows%2010!%20-%20Fix%20PC%20Errors.htm.
TNPM v1.3 Flow Control. 2 High Level Instead of each component having flow control settings that govern only its directory, we now have a set of flow.
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR Overview.
Valencia Cluster status Valencia Cluster status —— Gang Qin Nov
Item 9 The committee recommends that the development and operations teams review the list of workarounds, involving replacement of palliatives with features.
Cameron Blashka | Informer Implementation Specialist
Jean-Philippe Baud, IT-GD, CERN November 2007
CASTOR Giuseppe Lo Presti on behalf of the CASTOR dev team
Monitoring and Fault Tolerance
Giuseppe Lo Re Workshop Storage INFN 20/03/2006 – CNAF (Bologna)
Generator Services planning meeting
Stephen Burke, PPARC/RAL Jeff Templon, NIKHEF
Backup Monitoring – EMC NetWorker
Presentation transcript:

CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos

2 Topics CASTOR dirty laundry CASTOR integration CASTOR monitoring

CASTOR dirty laundry A small list of fresh items

4 Diskserver If a new diskserver starts reporting to rmmaster it goes into ‘production’ on the ‘default’ pool. You can be bitten by this when you install a new box or when you clean an old box (it becomes new to the stager). This is not so fresh but...

5 Diskserver (II) On the last versions this behavior acquired one more variable: MINALLOWEDFREESPACE If not set nothing will run If set to zero filesystem may fill-up Depending on the version, you get one or the other by default.

6 Looping Tape Recalls files stay in STAGEIN forever tape is being mounted and unmounted (looping) failing with "Re-enable tape+segment for selection" in recaller log file grep "Re-enable tape+segments for selection" /var/spool/rtcpclientd/rtcpcld | cut -d" " -f19 | sort -u will give you the tape IDs! There is a ‘procedure’ on twiki to: 1.isolate the subrequest that refer to files on this tape 2.fail the diskcopies 3.restart the subrequests 4. clean the unlinked tapecopies

7 Cleaning Daemon Three things to keep in mind: 1. it needs a separate configuration file 2. it now logs to DLF! 3. it looks to be having some ORACLE connection problems. Monitor PID existence and the OCI problem (we still don’t, but will this afternoon).

8 LSF File descriptor ”Failed in an LSF library call: Failed in sending/receiving a message: Bad file descriptor” Again a fresh problem on a piece of code that was not touched. It’s a bonus!! Current dirty workaround: monitor the message and restart LSF

9 Impact of LSF dropouts

10 Diskcopies in Failed When a user tries to recall a file not on tape neither on the stager, either the file is: 1. on a disabled filesystem 2. on another stager (with common name server) 3. lost The result is the same, you get an accumulation of ‘FAILED’ diskcopies and stager_qry degrades (a lot!)

11 0 length files on NS Deadlocks may cause the update on the name server to fail which means the file on tape has segments size but 0 size. nsls -l returns 0 nsls -T is bigger than 0

12 rmmaster does not stop It actually does, but ‘service rmmaster stop’ currently does not kill all the processes all the time. Impact? ‘service rmmaster start’ does not start rmmaster if some process is still there.

13 Some Hacks stager restart (restart was stopped yesterday on c2public. Currently 50% more but not so bad) recall restart LSF restart (already mentioned) manual migration restart (fixed??) I’m sure there are other that I have forgotten about but they haven’t forgotten about us... and will come nocking again, soon...

CASTOR ELFms integration

15 What is ‘integrated’? Configuration files (see Jan’s presentation) Daemon monitoring Log file monitoring for some ‘exceptions’ (Oracle connection problems, DB full, etc) CASTOR diskserver status is ruled by SMS Consistency monitoring

16 SMS and CASTOR SMS is the ELFms State Management System. for example, SMS controls if LEMON monitoring will raise alarms for a WN. We map SMS state to CASTOR rmmaster state: (production->Idle; standby->draining; maintenance- >Down)

17 Next Currently we monitor RAID status on the diskserver and we react to RAID problems using a script running centrally. We want to start doing this at diskserver level because: we stop depending on lemon central services (we will just be using the lemon agent on the box) we can use the severity reported by the sensor we reduce delays

18 diskNodeShutDown diskNodeShutDown is a tool that uses CASTOR external commands to move data out of a diskserver. It uses diskServer_qry, stager_qry and stager_get and move files out of a diskserver when we want to preform an intervention. Still very cumbersome to use, still suffering from bugs on stager_qry and stager_get

Monitoring

20 Recent developments (I) Recently we started a shift in the way we do monitoring. Until now monitoring was done by LEMON sensors Current and future metrics are to be implemented inside the DBs (if applicable). The lemon agent will only read the results. You may use other tools!

21 Recent developments (II) Code developments are underway for CASTOR to log different kind of messages. migrator is already logging per file the svcclass and tapepool among other stats some requests are already logging enough information for DLF to get latencies and other stats. more news expected soon

22 Consistency checking Because CASTOR has no central/common configuration tool we have to check for consistency across the different parts.

23 Metrics Service metrics: ➡ metrics concerning availability and performance Accounting metrics: ➡ metrics concerning overall usage

24 Current Monitoring (internal) 29 Lemon Metrics (more than 140 values), not including daemons status, actuators, etc. ~45K measurements per day more than 200K values sampled per day

25 Current Monitoring Organization Three levels: stager: head nodes responsible for the service service class: functional partition of a stager (~groups of disk servers) disk servers: the basic element

26 Displays (4Q 2006) These display will be migrated to the new Lemon Service Displays (1Q2007).

27 Some Plot examples

28 CASTOR Service CASTOR2 ns commands rfio commands stager commands Service metrics ( high update freq.) Accounting metrics ( low update freq.) Service Interface

29 Pending Service Metrics Migrator information: number of files, size, tape pool by service class Complete meta-migration: snapshot of number of files to be migrated, and avg/min/max of size and age on disk. Complete meta-recall: snapshot of number of files to be recalled, and avg/min/max of size and age of the request.

30 Sad news... Yesterday at ~18h the dev box for CASTOR monitoring pasted-away... Lets have a look at the old version:

31 Questions?