CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.

Slides:



Advertisements
Similar presentations
RAL Tier1 Operations Andrew Sansum 18 th April 2012.
Advertisements

CERN Castor external operation meeting – November 2006 Olof Bärring CERN / IT.
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
1 Recovery and Backup RMAN TIER 1 Experience, status and questions. Meeting at CNAF June of 2007, Bologna, Italy Carlos Fernando Gamboa, BNL Gordon.
1 RAL Status and Plans Carmine Cioffi Database Administrator and Developer 3D Workshop, CERN, November 2009.
Castor F2F Meeting Barbara Martelli Castor Database CNAF.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
RAL Site Report Castor F2F, CERN Matthew Viljoen.
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
Maintaining Large Vista Installations Amy Edwards, Ezra Freelove, & George Hernandez July 12, 2007.
Maintaining Large Vista Installations Amy Edwards, Ezra Freelove, & George Hernandez July 12, 2007.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
CASTOR Databases at RAL Carmine Cioffi Database Administrator and Developer Castor Face to Face, RAL February 2009.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
CERN Physics Database Services and Plans Maria Girone, CERN-IT
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
CERN - IT Department CH-1211 Genève 23 Switzerland t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio.
Report from CASTOR external operations F2F meeting held at RAL in February Barbara Martelli INFN - CNAF.
User Board Input Tier Storage Review 21 November 2008 Glenn Patrick Rutherford Appleton Laboratory.
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CASTOR status Presentation to LCG PEB 09/11/2004 Olof Bärring, CERN-IT.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
CERN - IT Department CH-1211 Genève 23 Switzerland Tier-0 CCRC’08 May Post-Mortem Miguel Santos Ricardo Silva IT-FIO-FS.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Castor incident (and follow up) Alberto Pace.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
Future Plans at RAL Tier 1 Shaun de Witt. Introduction Current Set-Up Short term plans Final Configuration How we get there… How we plan/hope/pray to.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Operational experiences Castor deployment team Castor Readiness Review – June 2006.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
BaBar Cluster Had been unstable mainly because of failing disks Very few (
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
Disk Server Deployment at RAL Castor F2F RAL - Feb 2009 Martin Bly.
CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.
Patricia Méndez Lorenzo Status of the T0 services.
CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos
Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
Bonny Strong RAL RAL CASTOR Update External Institutes Meeting Nov 2006 Bonny Strong, Tim Folkes, and Chris Kruk.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
Status and plans Giuseppe Lo Re INFN-CNAF 8/05/2007.
Castor services at the Tier-0
Olof Bärring LCG-LHCC Review, 22nd September 2008
Ákos Frohner EGEE'08 September 2008
Introduction of Week 6 Assignment Discussion
Presentation transcript:

CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008

Topics Current Architecture Upgrades and Changes Completed Operational Challenges Tape Server Issues Diskserver Deployment Project CCRC08 Certification Testbed Top 5 Issues Plans for Next 6 Months

Current Architecture - Production cms stager: 3 head nodes 45 diskservers (552 TB) atlas stager: 3 head nodes 64 diskservers (433 TB) lhcb stager: 3 head nodes 21 diskservers (133 TB) gen stager: 3 head nodes 1 diskserver (6 TB) (repack plus smaller users )

Current Architecture – Production Shared Services Nameservers: 2 servers for nsdaemon DNS load-balanced cluster 1 of these also hosts: vdqm, vmgr, cupv Tape servers: 15 servers FC-attached STK T10000 tape drives

Upgrades and Changes Completed Nov 2007 – upgrade to and SL4-64 –Except tape servers Dec 2007 – SRMv2 Dec SRMv2 in production Apr 2008 – upgrade to Oracle RAC (currently 3 nodes) –SRMv2 –Gen instance stager and dlf

Upgrades and Changes Completed - continued Diskserver deployment project – prototype and testing Nagios monitoring for 24/7 support and callout procedures Dynamic Information Provider developed Migration from dcache –Completed for lhcb –Completed for atlas disk, –Still working on atlas tape

Operational Challenges Power failures –8 Feb: 1 ½ days to get castor back in server –6 May: 1 full day to bring castor back in service –Database FS corruption. Trying to get UPS for DBs. –LSF startup issues Establishing after-hours callout system –Linkages between nagios and bleeper callout –Hindered by broken link between Tier1 helpdesk and GGUS –Tuning monitoring alerts –Developing operational documentation –Developing handoff procedures Backplane meltdown of Viglen diskservers (Feb) Failure of new diskservers (April) Loss of DBA due to budget cuts

Tape Server Issues Have not yet been able to install SLC4-64bit –NI Failure with 64-bit (now understood?) –Missing device /dev/nst1 Fibre-channel card? Transtec vs IBM servers? Something special in kernel at CERN? –Have been working on this tape server upgrade for over 6 months –Still running castor on SLC3 Much work on improving migration performance Now seeing serious problems with servers hanging when doing heavy load of recalls

Disk servers deployment today and tomorrow (1/2): Kick-start scriptPost-install script Castor personalization Disk server registration

Disk servers deployment today and tomorrow (2/2): Kick-start scriptPost-install script Castor personalization Disk server registration Puppet

CCRC08 Big success of migration policies overcoming poor tape migration for CMS Working out system and procedures for out-of- hours callouts has required much time and effort Biggest problems: –Power outage on 6 May –Problems with new root certificate on 20 May –SRMv2 crashes –Major problems with tape drives hanging, starting last week

CCRC08 - continued Smaller problems: –Had disk-disk copies staying in PEND, but setting DiskCopyPendTimeout made this workable –Jobs submitted to disk1 servers which became full was overcome with setting PendTimeout –Daily restart of castor-gridftp (to clean dead processes) causing job failures: both SE and CE jobs.

CCRC08 - Tape Transfer Stats Data from Last Week of May

Certification Testbed Have not delivered what we expected Have completed installation of infrastructure Extending contract for testbed sys admin another 4 months, pending STFC approval Need to maintain release structure to be able to take advantage of work completed

Top 5 Issues Tape drives hanging GC bug Problems installing tape servers with SL4-64 Repack Disk-disk copies stay in PEND or (earlier) multiple copies Other issues: Slow database, requires new stats Support for work on migration policies Submitting jobs to disk1 servers which become full Bulk deletes

Plans for Next 6 Months 1.Diskserver deployment project 2.Certification testbed used to test new releases 3.Taper servers upgraded to SLC Disaster recovery plan Based on puppet, kickstart, and DB backups Documented Tested 5.Support for small experiments 6.Repack operational 7.Xrootd and rootd

Plans for Next 6 Months - cont 8.Improved resilience: Oracle RAC and Dataguard Redundant, load-balanced stagers Cold standby jobManager/LSF hosts Failover of vdqm/vmgr/cupv 9.Decommission srmv1 endpoints 10.Castor-gridftp v2 - internal 11.Continued improvements in monitoring and documentation

Plans for Next 6 Months - cont 12.Planning for move to new computer building in Dec Possible second robot 14.Tape media – T10000B 15.UPS for DB servers ASAP!

CIP CASTOR Information Provider Jens Jensen (in absentia) CASTOR F2F mtg June 08

Anatomy of the CIP CASTOR stagers CIP Back end CIP Front end The Grid

Front End Written by Derek Ross from Tier 1 Publishes into Tier 1 BDIIs Uses a condensed text format for communicating with the back end –Historical reasons –One line per service class condensing all the news that’s fit to print

Back End Queries stager for information about disk pools Could have queried SRM DBs for STDs etc –But uses configuration file for non-dynamic information –Not all classes are published –Documented

Next steps Deploy backend with improved exception handling –Currently hangs if stager is down –Most recent version is ready for prod’n but hasn’t been deployed yet –Could replace the front end and publish LDIF directly –Packaging and improved deployment

See Also RAL_Tier1_CASTOR_Accounting –Currently slightly out of date, refers to January version –Can deal with Bs and KBs and MBs and GBs and PBs and EBs