CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CASTOR Operational experiences HEPiX Taiwan Oct. 2008 Miguel Coelho dos Santos.

Slides:

Advertisements

Similar presentations

16/9/2004Features of the new CASTOR1 Alice offline week, 16/9/2004 Olof Bärring, CERN.

Advertisements

Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank

CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.

 Contents 1.Introduction about operating system. 2. What is 32 bit and 64 bit operating system. 3. File systems. 4. Minimum requirement for Windows 7.

NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge

Staging to CAF + User groups + fairshare Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE Offline week,

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.

Castor F2F Meeting Barbara Martelli Castor Database CNAF.

Experiences Deploying Xrootd at RAL Chris Brew (RAL)

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,

CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status Tony Cass (With thanks to Miguel Coelho dos Santos & Alex Iribarren) LCG-LHCC.

CERN IT Department CH-1211 Genève 23 Switzerland t Plans and Architectural Options for Physics Data Analysis at CERN D. Duellmann, A. Pace.

CERN IT Department CH-1211 Genève 23 Switzerland t Experience with Windows Vista at CERN Rafal Otto Internet Services Group IT Department.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS From data management to storage services to the next challenges.

1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB

CERN IT Department CH-1211 Geneva 23 Switzerland t Storageware Flavia Donno CERN WLCG Collaboration Workshop CERN, November 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 6 th July 2009.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

CERN - IT Department CH-1211 Genève 23 Switzerland WLCG 2009 Data-Taking Readiness Planning Workshop Tier-0 Experiences Miguel Coelho dos.

LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.

1 MONGODB: CH ADMIN CSSE 533 Week 4, Spring, 2015.

RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.

CERN - IT Department CH-1211 Genève 23 Switzerland t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio.

HEPiX FNAL ‘02 25 th Oct 2002 Alan Silverman HEPiX Large Cluster SIG Report Alan Silverman 25 th October 2002 HEPiX 2002, FNAL.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.

CERN - IT Department CH-1211 Genève 23 Switzerland Tier-0 CCRC’08 May Post-Mortem Miguel Santos Ricardo Silva IT-FIO-FS.

Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.

CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT DPM / LFC and FTS news Ricardo Rocha ( on behalf of the IT/GT/DMS.

CERN IT Department CH-1211 Genève 23 Switzerland PES 1 Ermis service for DNS Load Balancer configuration HEPiX Fall 2014 Aris Angelogiannopoulos,

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.

CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.

Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.

Operational experiences Castor deployment team Castor Readiness Review – June 2006.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

CASTOR project status CASTOR project status CERNIT-PDP/DM October 1999.

CERN - IT Department CH-1211 Genève 23 Switzerland Tape Operations Update Vladimír Bahyl IT FIO-TSI CERN.

CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

Patricia Méndez Lorenzo Status of the T0 services.

CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.

Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,

Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

CERN IT Department CH-1211 Genève 23 Switzerland t DPM status and plans David Smith CERN, IT-DM-SGT Pre-GDB, Grid Storage Services 11 November.

CERN - IT Department CH-1211 Genève 23 Switzerland CERN Tape Status Tape Operations Team IT/FIO CERN.

Federating Data in the ALICE Experiment

Cross-site problem resolution Focus on reliable file transfer service

Experiences and Outlook Data Preservation and Long Term Analysis

Castor services at the Tier-0

Olof Bärring LCG-LHCC Review, 22nd September 2008

Ákos Frohner EGEE'08 September 2008

The INFN Tier-1 Storage Implementation

CTA: CERN Tape Archive Overview and architecture

Data Management cluster summary

Presentation transcript:

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos

CERN - IT Department CH-1211 Genève 23 Switzerland Summary CASTOR Setup at CERN Recent activities Operations Current Issues Future upgrades and activities 2

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Setup at CERN Some numbers regarding the CASTOR HSM system at CERN. Production setup 3 5 tape robots, 120 drives 35K tape cartridges, 21 PB 900 disk servers, 28K disks, 5 PB 108M files in name space, 13M copies on disk 15PB of data on tape castorns castorcentral tapeservers castoralice castoratlas castorcms castorlhcb castorpublic castorcernt3 srm

CERN - IT Department CH-1211 Genève 23 Switzerland CASTORNS setup GHz Xeon, 2 CPU, 4GB ram

CERN - IT Department CH-1211 Genève 23 Switzerland CASTORCENTRAL setup GHz Xeon, 2 CPU, 4GB ram

CERN - IT Department CH-1211 Genève 23 Switzerland Instance setup (cernt3) 6 CPU 2.33GHz, 8 cores, 16G

CERN - IT Department CH-1211 Genève 23 Switzerland On High-Availability High-available setup achieved by running more than one daemon of each kind Strong preference for Active-Active setups, i.e. load-balancing servers using DNS Remaining single points of failure: –vdqm –dlfserver –rtcpclientd –mighunter Deployment of CERNT3 model to other production instances depends on deployment of new SRM version 7

CERN - IT Department CH-1211 Genève 23 Switzerland Recent activities CCRC-08 was a successful test Introducing data services piquet and debugging alarm lists Ongoing cosmic data taking Created REPACK instance –Needed a separate instance to exercise repack at larger scale –Problems are still being found and fixed Created CERNT3 instance (local analysis) –New monitoring (daemons and service) –Running and new xroot redirector –Improved HA deployment 8

CERN - IT Department CH-1211 Genève 23 Switzerland CCRC-08 Large number of files moved through the system Performance (speed, time to tape, etc) and reliability met targets => Tier-0 works OK SRM was the most problematic component: SRMserver deadlock (no HA possible), DB contention, gcWeight, crashes. 9

CERN - IT Department CH-1211 Genève 23 Switzerland Current data activity No accelerator... still there is data moving Disk Layer Tape Layer 10

CERN - IT Department CH-1211 Genève 23 Switzerland Data Services piquet In preparation for data taking, besides making the service more resilient to failures (HA deployment), the Service Manager on Duty rota was extended for 24/7 coverage of critical data services => Data Services Piquet The Services covered are: CASTOR, SRM, FTS and LFC. Manpower goes up to increase coverage but: Improve documentation Make admin Processes more robust (and formal) The number of admins that can do changes goes up The average service specific experience of admins doing changes goes down

CERN - IT Department CH-1211 Genève 23 Switzerland Operations Some processes have been reviewed recently Change management –Application upgrades –OS upgrade –Configuration change or emergencies Procedure documentation DISCLAIMER ITIL covers extensively some of these topics. The following slides are a summary of a few of our internal processes. They originate from day to day experience, some ITIL guidelines are followed but as ITIL points out, implementation depends heavily on specific environment. Our processes might not apply to other organizations. 12

CERN - IT Department CH-1211 Genève 23 Switzerland Change Management (I) Application Upgrades, for example to deploy castor patch Follow simple rules –Exercise the upgrade on pre-production setup, identical as possible to production setup –Document the upgrade procedure –Reduce concurrent changes –Announce with enough lead-time, O(days) –“No upgrades on Friday” –Don’t upgrade multiple production setups at the same time 13

CERN - IT Department CH-1211 Genève 23 Switzerland Change Management (II) OS Upgrades Monthly OS upgrade, 5K servers –Test machines are incrementally upgraded every day –Test is frozen into Pre-production setup All relevant (specially mission critical) services should (must!) have a pre-prod instance, castorpps for CASTOR –1 week to verify mission critical services internal verification –1 week to verify the rest of the services external verification (several clients, for example voboxes) –Changes to production are deployed on a Monday ( D-day ) –D-day is widely announced since D-19 (=7+7+5) days 14

CERN - IT Department CH-1211 Genève 23 Switzerland Change Management (III) Change requests by customer –Handled case by case –It is generally a documented, exercised, standard change => low risk Emergency changes –Service already degraded –Not change management but incident management! Starting to work on measuring change management success (and failures) in order to keep on improving the process 15

CERN - IT Department CH-1211 Genève 23 Switzerland Procedure documentation Document day-to-day operations Procedure writer (expert) Procedure executer –Implementing service changes –Handling service incidents Executer validates procedure each time it executes it successfully, otherwise the expert is called and the procedure updated Simple/Standard incidents are now being handled by recently arrived team member without any special CASTOR experience 16

CERN - IT Department CH-1211 Genève 23 Switzerland Current Issues (I) File access contention –LSF scheduling is not scaling as initially expected –Most disk movers are memory intensive: gridftp, rfio, root. –So far xrootd (scala) seems more scalable and redirector should be faster Hardware replacement –A lot of disk server movement (retirements+arrivals). Moving data is very hard. Need drain disk server tools. File corruption –Not common but happened recently that a raid array started to go bad without alarm from fabric monitoring –Some files on disk were corrupted. –Need to calculate and verify checksum of every file on every access. –2.1.7 has basic RFIO only check summing on file creation (update bug to be fixed, removes checksum) –No checksum on replicas –2.1.8 has replica checksum and checksum before PUT 17

CERN - IT Department CH-1211 Genève 23 Switzerland Current issues (II) Hotspots –Description Without any hardware/software failure, users are not able to access some files on disk within acceptable time –Causes 1.Mover contention 2.File location Not enough resources (memory) on disk server to start more movers. Problem is specially relevant in gridftp and rfio transfers. xroot should solve it for local access. Grid access needs to be addressed. Although more movers can be started on the disk server, other disk server resources are being starved, for example network or disk IO. Can be caused by: 1.Multiple high speed accesses to the same file (few hot files) 2.Multiple high speed accesses to different files (various hot files) There is currently no way to relocate hot files or to cope with peaks by temporarily having more copies of hot files. 18

CERN - IT Department CH-1211 Genève 23 Switzerland Current issues (III) Small files –Small files expose overheads, i.e. initiating/finishing transfers, seeking, meta data writing (for example tape labels, catalogue entries and POSIX information on disk (xfs)) –Currently tape label writing seems to be the biggest issue... –More monitoring should be put in place to understand better the various contributions from the different overheads Monitoring –Current monitoring paradigm relies on parsing log messages  –CASTOR would benefit from metric data collection at daemon level 19

CERN - IT Department CH-1211 Genève 23 Switzerland Future upgrades Upgrade FTS –It will allow to start using internal gridftp! (lower memory footprint) Upgrade SRM –A lot of instabilities in recent versions    –Getting SRM stable is critical –It will allow to run load balanced stagers everywhere! –New xrootd –Complete file check summing –Etc (see talk by S. Ponce) Monitoring –New fabric monitoring sensor and more lemon metrics are in the pipeline 20

CERN - IT Department CH-1211 Genève 23 Switzerland Future activities Data taking Local analysis support Disk server retirements –~65 disk servers to retire 1Q 2009 –More throughout the year 21

CERN - IT Department CH-1211 Genève 23 Switzerland Questions?