CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.

Slides:



Advertisements
Similar presentations
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Advertisements

16/9/2004Features of the new CASTOR1 Alice offline week, 16/9/2004 Olof Bärring, CERN.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
From Entrepreneurial to Enterprise IT Grows Up Nate Baxley – ATLAS Rami Dass – ATLAS
CERN DNS Load Balancing Vladimír Bahyl IT-FIO. 26 November 2007WLCG Service Reliability Workshop2 Outline  Problem description and possible solutions.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
Experiences Deploying Xrootd at RAL Chris Brew (RAL)
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
Online Database Support Experiences Diana Bonham, Dennis Box, Anil Kumar, Julie Trumbo, Nelly Stanfield.
2nd September Richard Hawkings / Paul Laycock Conditions data handling in FDR2c  Tag hierarchies set up (largely by Paul) and communicated in advance.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
D C a c h e Michael Ernst Patrick Fuhrmann Tigran Mkrtchyan d C a c h e M. Ernst, P. Fuhrmann, T. Mkrtchyan Chep 2003 Chep2003 UCSD, California.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Functional description Detailed view of the system Status and features Castor Readiness Review – June 2006 Giuseppe Lo Presti, Olof Bärring CERN / IT.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
Author - Title- Date - n° 1 Partner Logo EU DataGrid, Work Package 5 The Storage Element.
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
CASTOR: CERN’s data management system CHEP03 25/3/2003 Ben Couturier, Jean-Damien Durand, Olof Bärring CERN.
Optimisation of Grid Enabled Storage at Small Sites Jamie K. Ferguson University of Glasgow – Jamie K. Ferguson – University.
Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CASTOR status Presentation to LCG PEB 09/11/2004 Olof Bärring, CERN-IT.
CERN SRM Development Benjamin Coutourier Shaun de Witt CHEP06 - Mumbai.
Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
Jens G Jensen RAL, EDG WP5 Storage Element Overview DataGrid Project Conference Heidelberg, 26 Sep-01 Oct 2003.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Linux Operations and Administration
Future Plans at RAL Tier 1 Shaun de Witt. Introduction Current Set-Up Short term plans Final Configuration How we get there… How we plan/hope/pray to.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Operational experiences Castor deployment team Castor Readiness Review – June 2006.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
CASTOR project status CASTOR project status CERNIT-PDP/DM October 1999.
Handling of T1D0 in CCRC’08 Tier-0 data handling Tier-1 data handling Experiment data handling Reprocessing Recalling files from tape Tier-0 data handling,
1 Xrootd-SRM Andy Hanushevsky, SLAC Alex Romosan, LBNL August, 2006.
DMLite GridFTP frontend Andrey Kiryanov IT/SDC 13/12/2013.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos
ASCC Site Report Eric Yen & Simon C. Lin Academia Sinica 20 July 2005.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
1 DIRAC Data Management Components A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
CASTOR: possible evolution into the LHC era
Jean-Philippe Baud, IT-GD, CERN November 2007
CASTOR Giuseppe Lo Presti on behalf of the CASTOR dev team
Giuseppe Lo Re Workshop Storage INFN 20/03/2006 – CNAF (Bologna)
Service Challenge 3 CERN
Castor services at the Tier-0
Olof Bärring LCG-LHCC Review, 22nd September 2008
LCGAA nightlies infrastructure
Ákos Frohner EGEE'08 September 2008
The INFN Tier-1 Storage Implementation
Bernd Panzer-Steindel CERN/IT
CASTOR: CERN’s data management system
INFNGRID Workshop – Bari, Italy, October 2004
Presentation transcript:

CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2

GDB meeting2 20 July 2005 Outline CASTOR2 and its role in SC3 Configuration of disk pools Issues we came across Making it a service Conclusions

GDB meeting3 20 July 2005 What is CASTOR2 ? Complete replacement of the stager (disk cache management) Database centric  Request spooling  Event logging Requests scheduling externalized (at CERN via LSF) More interesting features:  Stateless daemons  Tape requests bundling Scalable (tested by Tier-0 exercise) Central services stayed unmodified:  Name server, VMGR, VDQM, etc. More details:  Talk of Olof Bärring at PEB meeting 7 th June 2005

GDB meeting4 20 July 2005 Role of CASTOR2 in SC3 FTS Tier 1 centers CASTOR holds the data hence it is a founding element in the transfers FTS does the negotiations Transfers done over the gridFTP protocol as 3 rd party copy CASTOR2 SRMgridFTP

GDB meeting5 20 July 2005 CASTOR2 disk pool configuration Other server s CASTOR2 IA64 nodes spare aliprod cmsprod default IA32 disk servers ~ 3 TB each PhEDEx IA32 disk servers wan Tier-1s WAN access CERN internal network only gridFTP+SRM

GDB meeting6 20 July 2005 WAN service class evolution 1/4 IA64 nodesIA32 disk servers CASTOR1 1. All nodes were in CASTOR1 disk pool with a dedicated stager.

GDB meeting7 20 July 2005 WAN service class evolution 2/4 IA64 nodesIA32 disk servers CASTOR1 2. Move of 9 nodes to CASTOR2. 3. Creation of the WAN service class. wan

GDB meeting8 20 July 2005 WAN service class evolution 3/4 IA64 nodesIA32 disk servers CASTOR1 4. Introduction of IA64 oplaproXX nodes into the WAN service class 5. PhEDEx joined the party. wan + PhEDExwan

GDB meeting9 20 July 2005 WAN service class evolution 4/4 IA64 nodesIA32 disk servers wan 6. Separation of PhEDEx and WAN service classes to prevent interference. 7. Elimination of the need for CASTOR1. 8. Merge of remaining IA32 CASTOR1 nodes into the WAN service class. PhEDEx

GDB meeting10 20 July 2005 Issues confronted – external Load distribution with SRM  Previously, SRM returned static hostname (which was supposed to be a DNS alias) in TURLs  Now it can return hostname of the disk server where the data resides  It can also extract hostnames behind a DNS alias and rotate over the given set in the returned TURLs Good for requests of multiple files Support of IA32 and IA64 architectures  Reason to use IA64 was to compare different architectures doing high performance transfers  Compiling CASTOR2 for IA64 architecture uncovered missing include files and type inconsistencies in the C code  Installation procedures had to be extended  IA64 hardware support not of a production quality

GDB meeting11 20 July 2005 Issues confronted – internal 1/2 CASTOR2 LSF plug-in problems  Selects candidates to run jobs based on external (CASTOR2 related) resource requirements  Initialization occasionally fails = system halts  All incidents understood (caused by misconfiguration)  Operational procedures put in place Waiting for tape recalls  SC3 is run with disk only files  If a file is removed, system will look for it on tape  The fact, that the file is not on tape was not correctly propagated back to the requester  Fixed promptly on the database level

GDB meeting12 20 July 2005 Issues confronted – internal 2/2 Replication causing internal traffic  SRM returns exact location where the data is  gridFTP calls stager_get() (via rfio_open())  Scheduler returns a replica   Not an issue as this is an unlikely usage pattern Requests are treated twice  SRM request for a file schedules an LSF job  Once the file is ready, gridFTP does rfio_open()  Another job is scheduled internally   Will be resolved once gsiftp protocol fully supported

GDB meeting13 20 July 2005 Circa 2 revisions per week Schedule: Deployment point of view – dynamic system  Packaging level changes: Structure of the configuration files Names of the startup scripts RPM dependencies are modified  Database bottlenecks are being cleaned up Function based indexes configured  LSF configuration changes required Removed eexec script Release marathon XX releases: June July August 1

GDB meeting14 20 July 2005 Making it a service Monitoring enabled  New daemons on the server nodes as well as disk servers  Check the log files for new patterns  CASTOR LSF status integrated in Lemon Alarms configured  Standard tasks (e.g. HW issues) handled by the Operators & SysAdmins  CASTOR2 specific problems  call developers 24x7  CASTOR operational procedures being drafted  How to put a disk server in and out of production ?  How to create a new service class and how to move disk servers between them ?  How to modify number of concurrent jobs running on a disk server ?

GDB meeting15 20 July 2005 Conclusions SC3 is proving valuable in shaping up CASTOR2 CASTOR2 as it is now has not been fully optimized for SC3 usage pattern  Disk only file copy configuration is unlikely to be the mainstream usage pattern once LHC starts  Native support for gsiftp protocol not yet built in (only ROOT & RFIO supported for now) Replications of hot files are affected Internal transfers over the loopback interface exist  This in fact doubles the load on the system (which in itself is good so that we can find and fix the problems) Recommended CASTOR2 hardware setup has proved suitable for the SC3 tests  See my talk from 17 th May 2005 at GDB and check 4 scenarios CASTOR2 can throttle requests per hardware type (in fact per machine)  Something that CASTOR1 stager couldn’t None of the issues mentioned above is a show stopper and no major problems are expected ahead For the tasks required by the service phase of SC3, we have already reached production quality and are confident that we can handle it

GDB meeting16 20 July 2005 Thank you