Bonny Strong RAL RAL CASTOR Update External Institutes Meeting 13-15 Nov 2006 Bonny Strong, Tim Folkes, and Chris Kruk.

Slides:



Advertisements
Similar presentations
LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.
Advertisements

CERN Castor external operation meeting – November 2006 Olof Bärring CERN / IT.
Monitoring of Castor at RAL Castor F2F Rutherford Appleton Laboratory, February 18 th 2009 Cheney Ketley RAL.
Storage: Futures Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 8 October 2008.
16/9/2004Features of the new CASTOR1 Alice offline week, 16/9/2004 Olof Bärring, CERN.
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
Castor F2F Meeting Barbara Martelli Castor Database CNAF.
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
RLS Tier-1 Deployment James Casey, PPARC-LCG Fellow, CERN 10 th GridPP Meeting, CERN, 3 rd June 2004.
RAL Site Report Castor F2F, CERN Matthew Viljoen.
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
CASTOR Databases at RAL Carmine Cioffi Database Administrator and Developer Castor Face to Face, RAL February 2009.
Functional description Detailed view of the system Status and features Castor Readiness Review – June 2006 Giuseppe Lo Presti, Olof Bärring CERN / IT.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
Wahid, Sam, Alastair. Now installed on production storage Edinburgh: srm.glite.ecdf.ed.ac.uk  Local and global redir work (port open) e.g. root://srm.glite.ecdf.ed.ac.uk//atlas/dq2/mc12_8TeV/NTUP_SMWZ/e1242_a159_a165_r3549_p1067/mc1.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
BNL DDM Status Report Hironori Ito Brookhaven National Laboratory.
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
CASTOR: CERN’s data management system CHEP03 25/3/2003 Ben Couturier, Jean-Damien Durand, Olof Bärring CERN.
Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Report from CASTOR external operations F2F meeting held at RAL in February Barbara Martelli INFN - CNAF.
User Board Input Tier Storage Review 21 November 2008 Glenn Patrick Rutherford Appleton Laboratory.
Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.
New stager commands Details and anatomy CASTOR external operation meeting CERN - Geneva 14/06/2005 Sebastien Ponce, CERN-IT.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CASTOR status Presentation to LCG PEB 09/11/2004 Olof Bärring, CERN-IT.
CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.
Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
Future Plans at RAL Tier 1 Shaun de Witt. Introduction Current Set-Up Short term plans Final Configuration How we get there… How we plan/hope/pray to.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
Operational experiences Castor deployment team Castor Readiness Review – June 2006.
BaBar Cluster Had been unstable mainly because of failing disks Very few (
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.
Disk Server Deployment at RAL Castor F2F RAL - Feb 2009 Martin Bly.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.
CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016.
Storage & Database Team Activity Report INFN CNAF,
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
CASTOR at RAL in 2016 Rob Appleyard. Contents Current Status Staffing Upgrade plans Questions Conclusion.
Castor dev Overview Castor external operation meeting – November 2006 Sebastien Ponce CERN / IT.
Servizi core INFN Grid presso il CNAF: setup attuale
Item 9 The committee recommends that the development and operations teams review the list of workarounds, involving replacement of palliatives with features.
Status and plans Giuseppe Lo Re INFN-CNAF 8/05/2007.
Giuseppe Lo Re Workshop Storage INFN 20/03/2006 – CNAF (Bologna)
Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
CASTOR-SRM Status GridPP NeSC SRM workshop
Castor services at the Tier-0
Olof Bärring LCG-LHCC Review, 22nd September 2008
Presentation transcript:

Bonny Strong RAL RAL CASTOR Update External Institutes Meeting Nov 2006 Bonny Strong, Tim Folkes, and Chris Kruk

Bonny Strong RAL Overview 1.Current Status 2.Tape Systems: Status/Issues 3.Experience with CMS CSA06 4.Developing Monitoring at RAL 5.Issues/Problems 6.Goals for Next 6 Months

Bonny Strong RAL 1. Current Status of CASTOR at RAL

Bonny Strong RAL Castor Operations Team Central services –System manager: Bonny Strong –Deputy manager: Shaun de Witt Disk frontend systems and LSF –System manager: Chris Kruk –Deputy manager: Cheney Ketley Tape backend systems –System manager: Tim Folkes –Deputy manager: Bonny Strong SRM systems –System manager: Shaun de Witt –Deputy manager: Chris Kruk Grid security and infrastructure –Special advisor: Jens Jensen

Bonny Strong RAL Current Infrastructure of Castor2 at RAL Chris Kruk CCLRC-RAL

Bonny Strong RAL Platforms Pre-production platform Currently supports only: CMS for CSA06 Dteam Ops Default, for testing Test platform

Bonny Strong RAL Current pre-production Castor infrastructure (1/4): Central Servers Cupvdaemon Msgdaemon Nsdaemon vdqmserv Vmgrdaemon rfiod Cupvdaemon Msgdaemon Nsdaemon vdqmserv Vmgrdaemon rfiod Castor100: Name Server Castor101: Stager Castor103: LSF Master Castor102: DLF Rhserver Stager Rtcpclientd MigHunter Rhserver Stager Rtcpclientd MigHunter LSF Master FlexLM LSF Master FlexLM Dlfserver Rmmaster expertd Dlfserver Rmmaster expertd Castor version OS version CERN SL 3.0.6

Bonny Strong RAL 7 disk servers CMS DTeam Default Disk Servers OPS 1 disk server Current pre-production Castor infrastructure (2/4): Castor version OS version CERN SL SL 4.2

Bonny Strong RAL 4 tape drives CMS DTeam Default Tape Servers OPS Current pre-production Castor infrastructure (3/4): Castor version OS version CERN SL Max drives allocation: Total number of drives = 4 CMS DTeam Default OPS 1 tape drive

Bonny Strong RAL ns stager vmgr cupv srm ns stager vmgr cupv srm castor151 castor150 DB machines DLF Current pre-production Castor infrastructure (4/4): OS version Red Hat Enterprise 3 Oracle

Bonny Strong RAL Pre-production platform: Pre-production - fully functional platform where number of disk servers will be increase soon. c/sd/st/sDBSRM in used: Not in use yet:00602

Bonny Strong RAL Current test Castor infrastructure (1/2): rfiodCupvdaemon Msgdaemon Nsdaemon Vdqmserv Vmgrdaemon Rfiod Rhserver Stager Rtcpclientd MigHunter rfiodCupvdaemon Msgdaemon Nsdaemon Vdqmserv Vmgrdaemon Rfiod Rhserver Stager Rtcpclientd MigHunter Central Servers Tcastor100: NameServer Stager Tcastor100: NameServer Stager Tcastor102: DLF LSF Tcastor102: DLF LSF dlfserver Rmmaster expertd dlfserver Rmmaster expertd lcg0611 lcg0610 DB machines OS version CERN SL Castor version OS version Red Hat Enterprise 3 Oracle ns stager vmgr cupv srm ns stager vmgr cupv srm DLF

Bonny Strong RAL Current test Castor infrastructure (2/2): Disk Servers disk1tape0 disk0tape1 1 disk server 1 tape drive tcastor200 Tape server disk1tape1 default 1 disk server Waiting for SRMv2.2 OS version CERN SL Castor version Castor version OS version CERN SL 3.0.6

Bonny Strong RAL Test platform: Test – still under construction and configuration. c/sd/st/sDBSRM in used:23121 Not in used yet:01000 Sharing the same machine

Bonny Strong RAL Current Castor infrastructure: 3 servers SRMv1 SRM machines SRMv2.2 1 server Sharing machine with SRMv1 Test platform Associated with 3 different storage types per VO

Bonny Strong RAL 2. Tape Systems: Status and Issues Tim Folkes

Bonny Strong RAL Configuration 5 STK T10000 tape drives dedicated to castor 10 tape servers, 5 in use. –4GB memory per server (enough?) All connected via Brocade 4100 switch

Bonny Strong RAL Performance Doing non-disk IO ~120MB/sec Disk IO limited to disk speeds –Tape IO depends on disk activity No disk activity can get ~100MB/sec With disk activity ~15-20MB/sec

Bonny Strong RAL Performance How do we tune disk servers for better tape performance? –More tape server memory –XFS rather than EXT-3 –Reduce number of streams –More disk servers

Bonny Strong RAL Problems Kudzu causing FC loop resets at reboot –Caused tapes to rewind and over-write data No re-pack (yet). Affects our resourcing /var/tmp/RTCOPY vanishes. –We do not run the script to remove old files, so why vanishing? When VDQM restarts drives left in UNKN state with date of 1 Jan tpusage does not work

Bonny Strong RAL 3. Experience with CMS CSA06

Bonny Strong RAL Deployment for CSA06 Problems in deployment left us behind schedule for mid-Oct startup. Had to enlist extra help: –Olof, visit to RAL to help trouble-shoot –Shaun and Jens, who delayed other commitments –RAL managers, assisted coordination Coordination meetings with CMS – 3 times weekly – reviewed problems and progress We made it - just in time!

Bonny Strong RAL CMS Configuration Diskservers provided by Tier1 group –Started with only 2 diskservers –Ramped up to 7 diskservers –All running SLC4 Pre-production platform frozen after start of CSA06; no changes made except adding diskservers

Bonny Strong RAL CMS Service Classes CMS originally planned 3 service classes Due to delays in deploying diskservers, decided to run with only 1 service class CMS wanted disk1tape0, promising to manage disks themselves to not fill We chose not to run without GC due to known LSF meltdown problems Set GC at 90% and kept tape copy Found filesystems on same diskserver filled unevenly, so one FS may hit 90% while another only at 50%

Bonny Strong RAL CMS Results Achieved overall throughput of 250 MB/sec to disk Now have about 100TB data on tape End result is that CMS are very happy with castor performance –Found it big improvement over dcache

Bonny Strong RAL 4. Developing Monitoring at RAL

Bonny Strong RAL Choosing a Monitoring Tool Require close coordination between data storage group and Tier1 group Tier1 (and other RAL groups) have already chosen Nagios for future monitoring and after- hours callouts Castor team developing monitoring for castor with Nagios, using Nagios service provided by Tier1 Brought in contractor to do castor Nagios development

Bonny Strong RAL Monitoring Checks Standard Linux checks Checks for castor services running and number processes Castor-specific checks –Still working to define most useful checks –Trying to find way of alerting when tape migration is not working

Bonny Strong RAL Time-series Data Standard Nagios checks only single point- in-time values – e.g. number jobs PEND in LSF queue Want to be able to keep time series of some data, and be able to design checks based on time series Tried Nagios Grapher without success Now using ganglia to provide graphs, but not alerts Planning to develop Oracle db solution to record time series to allow monitoring checks, and to prepare reports

Bonny Strong RAL 5. Issues and Problems Encountered

Bonny Strong RAL Most Troublesome Problems for CMS deployment Diskservers in different network domain from central servers Tape looping Problems with castor_gridftp Problems with database upgrades and hotfixes

Bonny Strong RAL Main Problems Not Yet Resolved Rfio issues –Timeout in stager waiting reply from rmmaster –Address in use (fixed?) Failures not reported to users Database connection loss takes service down Tape looping (fixed in ?)

Bonny Strong RAL Key Issues (1) 1.Problems with database upgrades 2.Castor_gridftp 3.Error messages don’t have enough info 4.“Normal” errors fill logs

Bonny Strong RAL Key Issues (2) 5.Need more admin tools Would like to not access database directly 6.Limited by only one diskpool per diskserver, especially for testing 7.Uneven filling of filesystems on diskservers Is this an SL4 issues? 8.Clearing out bad files/data

Bonny Strong RAL Database Upgrade Issues Sep: hotfix for was missing a “create table” statement. Even after inserting this statement, db was in bad state and had to do point-in-time rollback to get system operational. Problems with identifying versions of hotfixes. Hotfix labeled hotfix on download pages was overwritten with another script with extensive changes. Makes it difficult tracking what hotfixes have been applied, and comparing one installation to another. July: upgrade to had incorrect code due to manual changes in released db scripts made on the incorrect code lines.

Bonny Strong RAL castor_gridftp Lack of documentation Lack of support or expertise Problems identifying latest version Difficulties in deployment –RPM does not define dependencies –Which version of gridftp should be used? Limitations of stagemap.conf

Bonny Strong RAL Example Error Message DBCALL=C_Services_fillObj(); ERROR_STRING=No such file or directory; DB_ERROR=No object found for i;: ; File=rtcpcldCatalogueInterface.c; DBKEY= ; Line=380; errno=115; serrno=2 Eventually found that record in tape table had associated svcclass that no longer existed, but had to go back to code to interpret this message, then guess which record it referred to. Could error message include which DB table, field, and record?

Bonny Strong RAL Example Error Message GC Failed to remove file File ac.uk Error=gcGetFileSize : Cannot stat file Origin=No such file or directory On which diskserver?

Bonny Strong RAL 6. Goals for Next 6 Months

Bonny Strong RAL Goals (1) Moving pre-production to production platform, including more VOs Testing on test platform –Upgrade to –SRM 2.2 –Test CASTOR-SRB driver Improve monitoring Add reporting and accounting Move to SRMv2.2

Bonny Strong RAL Goals (2) Database changes –3 rd Oracle instance to serve nameserver vmgr, cupv, and SRM2.2 –Setup Oracle standby for production Plan second Castor instance for non- PP users Bring castor operations team up to production-level standards –Coverage for support –Documentation and training –Helpdesk, notification, change control, etc