RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.

Slides:

Advertisements

Similar presentations

Steve Traylen Particle Physics Department Experiences of DCache at RAL UK HEP Sysman, 11/11/04 Steve Traylen

Advertisements

EGEE is a project funded by the European Union under contract IST Using SRM: DPM and dCache G.Donvito,V.Spinoso INFN Bari

Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.

CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.

Experiences Deploying Xrootd at RAL Chris Brew (RAL)

Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos.

RAL Site Report Castor F2F, CERN Matthew Viljoen.

Your university or experiment logo here NextGen Storage Shaun de Witt (STFC) With Contributions from: James Adams, Rob Appleyard, Ian Collier, Brian Davies,

GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:

Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

Your university or experiment logo here CEPH at the Tier 1 Brain Davies On behalf of James Adams, Shaun de Witt & Rob Appleyard.

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

Light weight Disk Pool Manager experience and future plans Jean-Philippe Baud, IT-GD, CERN September 2005.

Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.

Report from CASTOR external operations F2F meeting held at RAL in February Barbara Martelli INFN - CNAF.

Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.

CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.

Jens G Jensen RAL, EDG WP5 Storage Element Overview DataGrid Project Conference Heidelberg, 26 Sep-01 Oct 2003.

Your university or experiment logo here The Protocol Zoo A Site Presepective Shaun de Witt, STFC (RAL)

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT DPM / LFC and FTS news Ricardo Rocha ( on behalf of the IT/GT/DMS.

CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.

Future Plans at RAL Tier 1 Shaun de Witt. Introduction Current Set-Up Short term plans Final Configuration How we get there… How we plan/hope/pray to.

CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.

Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.

Operational experiences Castor deployment team Castor Readiness Review – June 2006.

Author - Title- Date - n° 1 Partner Logo WP5 Status John Gordon Budapest September 2002.

SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.

LCG Storage Workshop “Service Challenge 2 Review” James Casey, IT-GD, CERN CERN, 5th April 2005.

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

T3g software services Outline of the T3g Components R. Yoshida (ANL)

Database CNAF Barbara Martelli Rome, April 4 st 2006.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

Bonny Strong RAL RAL CASTOR Update External Institutes Meeting Nov 2006 Bonny Strong, Tim Folkes, and Chris Kruk.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016.

First Experiences with Ceph on the WLCG Grid Rob Appleyard Shaun de Witt, James Adams, Brian Davies.

An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.

SLACFederated Storage Workshop Summary Andrew Hanushevsky SLAC National Accelerator Laboratory April 10-11, 2014 SLAC.

Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.

Elasticsearch – An Open Source Log Analysis Tool Rob Appleyard and James Adams, STFC Application-Level Logging for a Large Tier 1 Storage System.

CASTOR at RAL in 2016 Rob Appleyard. Contents Current Status Staffing Upgrade plans Questions Conclusion.

Jean-Philippe Baud, IT-GD, CERN November 2007

CASTOR Giuseppe Lo Presti on behalf of the CASTOR dev team

Service Challenge 3 CERN

CASTOR-SRM Status GridPP NeSC SRM workshop

Castor services at the Tier-0

Olof Bärring LCG-LHCC Review, 22nd September 2008

1 VO User Team Alarm Total ALICE ATLAS CMS

CERN Site Report Giuseppe Lo Presti

Ákos Frohner EGEE'08 September 2008

CTA: CERN Tape Archive Overview and architecture

Australia Site Report Sean Crosby DPM Workshop – 13 December 2013.

The LHCb Computing Data Challenge DC06

Presentation transcript:

RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra

Contents Operations – Our experience at RAL Future Plans for CASTOR at RAL Ceph (Shaun) DB report (Juan)

RAL Load issues (lots of xroot transfers -> wait i/o on DS) - crashing Draining Rebalancing DB deadlocks (Juan) SL6 – shift to full Quattor configuration 2013 Disk deployment – 57*110TB nodes –Current numbers Elasticsearch – scaled up

Operations

Mostly smooth operation Deployed 2013 disk buy into production –54*120TB RAID 6 nodes Change of leadership –Rob now CASTOR service manager, taking over from Matt –Matt leading DB team

CASTOR

Issues from the Upgrade Very long DB upgrade time for ATLAS –Scripts not optimised for large disk-only instance ALICE xroot package not available for SL5 We feel a lot of these are caused by differences in usage patterns between RAL and CERN. Rebalancing – see later. Upgrade led us to find lots of crap in our DBs –Tables present in inappropriate DBs, bad constraints, etc.

Feedback from production New admin commands work well (modifydiskserver etc.) Read-only mode very useful –Free space on RO disk servers still reported as available for use Some lack of documentation –E.g. we didn’t know about modifydbconfig –deletediskcopy as replacement for cleanlostfiles In fact we have our own home rolled one which could be submitted as contrib cleanlostfiles [diskcopyid|diskserver[:filesystem]]

Feedback from production (2) We needed a way to spot unmigrated files on D0T1 nodes –printmigrationstatus doesn’t tell you about failed migrations. –Home-rolled ‘printcanbemigr’ script created for this use case. LHCb wanted an http endpoint – implementing test WebDAV interface for CASTOR –Graduate project –Interested to hear how xroot is going to do this…

SRM Stream of ‘SRM DB Duplicate’ problems –Easy clean-up –But they are disruptive Duplicate users Duplicate files (more common, less problematic) Hotfix applied to SRM DB to deal with clients who put double-slashes in their filenames

Xroot High load on disk servers tends to produce high Wait I/O – 50 concurrent xroot transfers…

Xroot (2) –Experimentation with transfer counts in diskmanager to optimise # of allowed transfers for each node –Currently Shaun is the single xroot expert, but we’re trying to fix that –Xroot manager daemon seems leaky…

Draining Problem with draining svcclasses with > 1 copy –‘patched’ thanks to CERN Overall better, but consistency of draining still a problem –Draining a whole server causes problems (TM crashes, DoS vs user requests) –Draining single filesystem seems better –But frequently needs kicking (many remaining files, draining still running, but nothing happening) –Also seems to be better on servers with 10GB network

Draining example

More on Draining Every 10.0s: draindiskserver -q Tue Sep 16 16:06: DiskServer MountPoint Created TFiles TSize RFiles RSize Done Failed RunTime Progress ETC Status gdss515.gridpp.rl.ac.uk /exportstage/castor3/ 16-Sep :07: GiB GiB h58mn38s 22.5 % 6h49mn6s RUNNING TOTAL 16-Sep :06: GiB GiB % ~]# listtransfers -p -x -r d2ddest:d2dsrc TOTAL D2DDEST D2DSRC DISKPOOL NBSLOTS NBUTPEND NBTPEND NBSPEND NBTRUN NBSRUN NBTPEND NBSPEND NBTRUN NBSRUN NBTPEND NBSPEND NBTRUN NBSRUN atlasStripInput

Rebalancing Too heavyweight Causes problems… –DoS to users –Unexplained TM crashes –Too large a queue Consider… –Do you really need to rebalance Disk0 svcclasses? –Move it into a controllable daemon We tried to tune it using the ‘Sensitivity’ parameter –All-or-nothing behaviour We have this turned off for all instances

Future plans for CASTOR at RAL

SL6 Bruno working on this –We plan to shift all our headnodes over to SL6 this autumn –Full Quattorisation of all headnodes (no more Puppet) We’ve wanted to do this for a long time 2 config management systems is one too many –Disk servers to follow

Log Analysis Elasticsearch logging system edging toward production –Lack of suitable hardware for search cluster –…but the system works well for now on old worker nodes –~5TB logging information currently stored in Elasticsearch –Looking to scale-out to other Tier 1 applications –Differing log formats cause problems – better than DLF but xroot and gridftp are still problematic

2015 and beyond… Can we get before LHC startup? Ceph – the future of RAL disk-only? –Test instance under development – Bruno working on this. –Shaun will now be telling you more…