CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio.

Slides:



Advertisements
Similar presentations
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS CASTOR Status Alberto Pace.
Advertisements

CERN Castor external operation meeting – November 2006 Olof Bärring CERN / IT.
16/9/2004Features of the new CASTOR1 Alice offline week, 16/9/2004 Olof Bärring, CERN.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status Tony Cass (With thanks to Miguel Coelho dos Santos & Alex Iribarren) LCG-LHCC.
CERN IT Department CH-1211 Genève 23 Switzerland t Plans and Architectural Options for Physics Data Analysis at CERN D. Duellmann, A. Pace.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos.
CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.
CERN IT Department CH-1211 Geneva 23 Switzerland t Storageware Flavia Donno CERN WLCG Collaboration Workshop CERN, November 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 6 th July 2009.
CERN IT Department CH-1211 Genève 23 Switzerland t Castor development status Alberto Pace LCG-LHCC Referees Meeting, May 5 th, 2008 DRAFT.
INFNGrid Constanza Project: Status Report A.Domenici, F.Donno, L.Iannone, G.Pucciani, H.Stockinger CNAF, 6 December 2004 WP3-WP5 FIRB meeting.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
CERN - IT Department CH-1211 Genève 23 Switzerland WLCG 2009 Data-Taking Readiness Planning Workshop Tier-0 Experiences Miguel Coelho dos.
CERN Physics Database Services and Plans Maria Girone, CERN-IT
Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.
CERN IT Department CH-1211 Genève 23 Switzerland PES SVN User Forum David Asbury Alvaro Gonzalez Alvarez Pawel Kacper Zembrzuski 16 April.
Report from CASTOR external operations F2F meeting held at RAL in February Barbara Martelli INFN - CNAF.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons 10/12/2014.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CERN - IT Department CH-1211 Genève 23 Switzerland t COOL Conditions Database for the LHC Experiments Development and Deployment Status Andrea.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
CERN SRM Development Benjamin Coutourier Shaun de Witt CHEP06 - Mumbai.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT DPM / LFC and FTS news Ricardo Rocha ( on behalf of the IT/GT/DMS.
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
Last update 29/01/ :01 LCG 1Maria Dimou- cern-it-gd Maria Dimou IT/GD CERN VOMS server deployment LCG Grid Deployment Board
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
Operational experiences Castor deployment team Castor Readiness Review – June 2006.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Overview of DMLite Ricardo Rocha ( on behalf of the LCGDM team.
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
CERN - IT Department CH-1211 Genève 23 Switzerland t Operating systems and Information Services OIS Proposed Drupal Service Definition IT-OIS.
CERN - IT Department CH-1211 Genève 23 Switzerland Tape Operations Update Vladimír Bahyl IT FIO-TSI CERN.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Site Services and Policies Summary Dirk Düllmann, CERN IT More details at
CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.
CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.
CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).
Developments for tape CERN IT Department CH-1211 Genève 23 Switzerland t DSS Developments for tape CASTOR workshop 2012 Author: Steven Murray.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 1 Managing changes Olof Bärring WLCG 2009, 14 th November 2008.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
Castor dev Overview Castor external operation meeting – November 2006 Sebastien Ponce CERN / IT.
CASTOR Giuseppe Lo Presti on behalf of the CASTOR dev team
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
Castor services at the Tier-0
Olof Bärring LCG-LHCC Review, 22nd September 2008
Ákos Frohner EGEE'08 September 2008
Francesco Giacomini – INFN JRA1 All-Hands Nikhef, February 2008
Data Management cluster summary
Presentation transcript:

CERN - IT Department CH-1211 Genève 23 Switzerland t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio

CERN - IT Department CH-1211 Genève 23 Switzerland Outline Current status/issues ATLAS T0 challenge Current work items SRM-2.2 – see previous presentation LSF Plugin New DB hardware xrootd interface Medium term activities Tape repack service Common CASTOR/DPM rfio library Future development areas

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Status - 3 Current status/issues (1) The current production release ( ) was deployed in Nov’06. Bugs and issues: – LSF plugin fails unrecoverable way upon high load. Workaround: automatic LSF master restart (up to 5 mins of non-service) – DLF (logging) client socket leak  stuck servers – Bug in tape error handling module. May cause insufficient retries upon failing recall/migration – Known performance/interference bottleneck in LSF scheduler plugin

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Status - 4 Current status: ATLAS T0 challenge 3-4/3: Problem with DLF DB  socket leak in two central servers 5/3:DLF problem fixed. New problem: corrupted block in stager db. DB shutdown and recovered De-fragmentation of largest table 6/3: Problem with oracle execution plan causing high row lock contention 7-11/3: DB OK but clients see high error rate due to unrecoverable errors in LSF plugin. Deployed a new binary without DLF 12/3: 5 more diskservers added to ‘default’ to better cope with non-T0 activities

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Status - 5 Current status/issues (2) Known LSF limitations (details next slide) in current release prevent the stager from supporting loads ~>5-10 requests/s – This limitation also affects migration of non-LHC CASTOR1 instances (e.g. NA48, COMPASS) to CASTOR2 Hardware/Oracle problems – Broken mirror errors followed by performance degradations during mirror rebuild – Block corruptions requiring DB recovery These problems are being addressed with top priority, though the timescales for deploying new releases did not match the dates for the ATLAS T0 challenge – A newer release with improved Oracle auto-reconnection to be installed ASAP Other recent developments which are approaching production deployment are – More efficient handling of some requests used by SRM prepareToGet and putDone requests are no longer scheduled in LSF. Represents ~25% of the ATLAS request load – Support for the xrootd protocol in CASTOR2, which may help to shield chaotic analysis activity from the inherent latencies in CASTOR2 request handling

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Status - 6 New LSF Plug-in LSF interface long seen as bottleneck and identified as priority area during June 2006 review. – Heavy usage of database queries in CASTOR specific LSF plug-in leads to execution slowdown Bottleneck for performance and priority scheduling: – Current version passes only 1st 1,000 requests to LSF to schedule to minimize performance degradation, so cannot schedule 1,001st request according to relative user priorities. – Causes interference between independent activities, e.g. ATLAS: T0 blocked when analysis fills the 1000 available slots Proof of concept using a shared memory system demonstrated in August 2006 – Avoiding inefficient plugin-DB connections Complete version has been tested on development instance for ~2 weeks Original plan was to have it delivered in late October – But needed to be pushed back to Q (cf. CASTOR delta review) due to reduced expert knowledge in development team Next steps – Installation on ITDC pre-production setup starting this week – Stress testing and consolidation (until end-April) – Deployment on production instances (until mid-May)

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Status - 7 New DB hardware New database hardware: – RAC clusters of oracle certified hardware has been purchased for hosting the name server and stager databases CASTOR name server move 2 nd of April Definite plans for moving the stager databases will be negotiated with the experiments (~1/2 day downtime) – The SRM v2.2 database will also be moved before production deployment

CERN - IT Department CH-1211 Genève 23 Switzerland Xrootd interface Core development done Q3 2006, testing Q by SLAC (A. Hanushevsky) with support from CERN Current status – Dev setup already open to ALICE test users – CASTOR and XROOT RPMs are ready and will be made available this week on the pre-production setup. – ALICE instance will be upgraded as soon as verified by ALICE. CASTOR Status - 8

CERN - IT Department CH-1211 Genève 23 Switzerland Tape Repack A new tape repack component is required for CASTOR2 for several reasons – in the short term, to efficiently migrate 5PB of data to new tape media – in the long term as we expect tape repacking to become an essentially continuous background activity – old Repack was CASTOR-1 specific Current status – Running new version on pre-production instance for 2 months, intense debugging – Testing long, concurrent repacks – Pure Repack functionality now complete, but tests are highlighting need for stability and reliability improvements in the underlying core CASTOR2 system. –While working on Repack2, m igration to LHC robotics from 9940 being performed using CASTOR1- O(500) days / 20 drives to complete (14K tapes left, out of 21K) CASTOR Status - 9

CERN - IT Department CH-1211 Genève 23 Switzerland Common CASTOR/DPM RFIO CASTOR and DPM have two independent set of RFIO commands and libraries, but with same naming – Has represented a source of link time and run time problems for experiments A common framework architecture allowing MSS-specific plug- ins has been designed and implemented –Plug-ins for CASTOR2 and DPM developed –common CVS repository contains framework and clients (e.g. rfcp,…) The new framework has been extensively tested on the CASTOR dev setups Coordination with DPM developers for defining the production deployment schedule is ongoing CASTOR Status - 10

CERN - IT Department CH-1211 Genève 23 Switzerland Future Development Areas (1) We expect development work on critical areas (especially the LSF plugin and SRM v2.2) to be complete by mid-year. Development plan for second half of the year is still to be agreed, but two issues are likely to be addressed with priority: 1)Server Release support: Simultaneous support requested by external institutes for two stable major server releases Comparable to supporting SLC3 and SLC4: Critical bugfixes backported to more mature release New functionality will go only to newer release Release lifetime defined by available support resources – aiming for O(6) months Client releases: backwards compatibility defined by user needs Mechanism for ensuring backwards compatibility in place CASTOR Status - 11

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Status - 12 Future Development Areas (2) 2) Strong Authentication: – Current uid/gid-based authentication is not secure – Strong authentication libraries for CASTOR were developed in 2005 and integrated in DPM but not yet in CASTOR2 – Developments ongoing, to be completed in Q – Rollout / Transition phase needs to be defined, in particular during lifetime of Castor1 (common name server) – Improved ACL support for VOMS roles and groups will be built on top