CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR Overview.

Slides:



Advertisements
Similar presentations
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS CASTOR Status Alberto Pace.
Advertisements

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS TSM CERN Daniele Francesco Kruse CERN IT/DSS.
Storage: Futures Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 8 October 2008.
16/9/2004Features of the new CASTOR1 Alice offline week, 16/9/2004 Olof Bärring, CERN.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
19 February CASTOR Monitoring developments Theodoros Rekatsinas, Witek Pokorski, Dennis Waldron, Dirk Duellmann,
Experiences Deploying Xrootd at RAL Chris Brew (RAL)
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
CERN IT Department CH-1211 Genève 23 Switzerland t Plans and Architectural Options for Physics Data Analysis at CERN D. Duellmann, A. Pace.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos.
CERN IT Department CH-1211 Geneva 23 Switzerland t Storageware Flavia Donno CERN WLCG Collaboration Workshop CERN, November 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
Functional description Detailed view of the system Status and features Castor Readiness Review – June 2006 Giuseppe Lo Presti, Olof Bärring CERN / IT.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
CERN - IT Department CH-1211 Genève 23 Switzerland t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio.
Report from CASTOR external operations F2F meeting held at RAL in February Barbara Martelli INFN - CNAF.
New stager commands Details and anatomy CASTOR external operation meeting CERN - Geneva 14/06/2005 Sebastien Ponce, CERN-IT.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CASTOR status Presentation to LCG PEB 09/11/2004 Olof Bärring, CERN-IT.
CERN SRM Development Benjamin Coutourier Shaun de Witt CHEP06 - Mumbai.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT DPM / LFC and FTS news Ricardo Rocha ( on behalf of the IT/GT/DMS.
Castor-dev planning and resources for 2007 Castor Development Team Castor Delta Review –December 2006 German Cancio, Giuseppe Lo Presti, Sebastien Ponce.
G.Govi CERN/IT-DB 1 September 26, 2003 POOL Integration, Testing and Release Procedure Integration  Packages structure  External dependencies  Configuration.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS New tape server software Status and plans CASTOR face-to-face.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
Operational experiences Castor deployment team Castor Readiness Review – June 2006.
CASTOR project status CASTOR project status CERNIT-PDP/DM October 1999.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
CERN - IT Department CH-1211 Genève 23 Switzerland t Operating systems and Information Services OIS Proposed Drupal Service Definition IT-OIS.
CERN - IT Department CH-1211 Genève 23 Switzerland Tape Operations Update Vladimír Bahyl IT FIO-TSI CERN.
CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.
CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
CERN IT Department CH-1211 Genève 23 Switzerland t Increasing Tape Efficiency Original slides from HEPiX Fall 2008 Taipei RAL f2f meeting,
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
CASTOR new stager proposal CASTOR users’ meeting 24/06/2003 The CASTOR team.
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t DPM status and plans David Smith CERN, IT-DM-SGT Pre-GDB, Grid Storage Services 11 November.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS CASTOR and EOS status and plans Giuseppe Lo Presti on behalf.
CERN - IT Department CH-1211 Genève 23 Switzerland CERN Tape Status Tape Operations Team IT/FIO CERN.
Castor dev Overview Castor external operation meeting – November 2006 Sebastien Ponce CERN / IT.
Item 9 The committee recommends that the development and operations teams review the list of workarounds, involving replacement of palliatives with features.
CASTOR: possible evolution into the LHC era
Jean-Philippe Baud, IT-GD, CERN November 2007
CASTOR Giuseppe Lo Presti on behalf of the CASTOR dev team
Netscape Application Server
High Availability Linux (HA Linux)
Status and plans Giuseppe Lo Re INFN-CNAF 8/05/2007.
Giuseppe Lo Re Workshop Storage INFN 20/03/2006 – CNAF (Bologna)
Software testing
Status of the CERN Analysis Facility
ALICE FAIR Meeting KVI, 2010 Kilian Schwarz GSI.
CASTOR-SRM Status GridPP NeSC SRM workshop
Castor services at the Tier-0
Ákos Frohner EGEE'08 September 2008
Data Management cluster summary
Presentation transcript:

CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR Overview and Status Sebastien Ponce CERN / IT

CERN - IT Department CH-1211 Genève 23 Switzerland 2/22Outline Brief history of the last year – statistics, major events Overview of major versions – 2.1.2, 2.1.3, Other improvements – tape part – release scheme – coordination with external institutes Current situation – and in a nutshell – CASTOR dev staff

CERN - IT Department CH-1211 Genève 23 Switzerland 3/22 Some statistics for the year > 9000 commits i.e. > 40 per working day 130 bugs fixed 3 new versions 44 release candidates 12 official releases 14 hotfixes 10 PB on tape, 80M files 2 PB on disk, 5.3M files 500 diskservers CMS : “The castor facility has performed remarkably well”

CERN - IT Department CH-1211 Genève 23 Switzerland 4/22 Major events February : version – prepareToGets are not scheduled anymore – all deadlocks are gone in the DB June : version – new monitoring and shared memory – putDones are not scheduled anymore September : version – support for disk only – jobManager has replaced rmmaster – SLC4 is supported as well as 64 bits servers

CERN - IT Department CH-1211 Genève 23 Switzerland 5/22 The key features PrepareToGets are not scheduled – first step to solve LSF melt downs in case of large recalls from an experiment – recalls don't use LSF anymore – but disk to disk copies are still scheduled Database improvements – deadlocks were all removed – reconnections automatized

CERN - IT Department CH-1211 Genève 23 Switzerland 6/22 The key features Solution to the LSF meldown – shared memory used New monitoring schema – rmMasterDaemon replaced rmmaster Fully qualified domain names everywhere Clips was dropped, python to replace it putDones are no more scheduled nameserver file name length extended GC desynchronized

CERN - IT Department CH-1211 Genève 23 Switzerland 7/ : removing the LSF limitation Major issues prior to were : – LSF melt down – LSF scheduling speed – LSF message boxes An analysis of the causes was presented in November 2006 in the face to face meeting – lack of multithreading in LSF – combined to latency of the DB access The new architecture has also been presented – rmmaster and scheduler on a single node – shared memory for monitoring access

CERN - IT Department CH-1211 Genève 23 Switzerland What will improve LSF – plugin limitations gone expecting > 100K queuing jobs is fine. To be measured – code better supported now as we understand it – improvements in the scheduling made possible limiting number of retries returning proper errors from the plugin RmMaster, RmNode – fully rewritten (monitoring part), so better supported – RmMaster made almost “stateless” DB sync allows to restart after a crash machine states not lost Slide from November 2006 limit is ~70K Done, see & Dennis' presentation DB sync used for any upgrade System has proved to recover from rmMasterDaemon crashes

CERN - IT Department CH-1211 Genève 23 Switzerland 9/ : removing the LSF limitation(2) The LSF message boxes have been dropped Replaced by – web server on the scheduler machine Accessed by the diskservers – or a shared filesystem As a consequence, jobs don't go to PUSP anymore

CERN - IT Department CH-1211 Genève 23 Switzerland 10/22 The key features jobManager replaces rmmaster DiskOnly support Black and white lists Support for SLC4 and 64 bits server – CERN new head nodes are 64 bits – CERN name server nodes are 64 bits – still issues at the tape level ? – See discussion this afternoon First handling of hardware “come back” – stopping useless recalls – recognizing files overwritten and not available – more to come in See next slides

CERN - IT Department CH-1211 Genève 23 Switzerland 11/22 The key features (2) Support for running 2 request handlers on the same machine is built in the init scripts nameserver – nschclass can now change regular files if the user is ADMIN – this is needed for disk only pools forcing the fileclass – when not ADMIN, the behavior did not change at all Balancing of VMGR pools – Randomized the tape choice within the one matching space reqs Removed message daemon from tape code

CERN - IT Department CH-1211 Genève 23 Switzerland 12/ : the new jobManager Uses standard framework – One master daemon distributing the load – A pool of preforked children for LSF calls – Pipes between master and children for communication Uses stager DB as backend – Introduces new states for SubRequests : READYFORSCHED & BEINGSCHEDULED RPMS – castor-jobmanager-server was added to be installed on the ‘jobManager’ machine – castor-rmmaster-server contains only rmMasterDaemon Stays only on the scheduler machine

CERN - IT Department CH-1211 Genève 23 Switzerland 13/ : support for diskOnly a svcClass can be declared “diskOnly” – user requests that need to allocate space on disk will fail in case no space is available – by default, they would have queued forever Ability to force a fileClass – only supported in diskOnly svcClasses – requires nameserver – allows to enforce a tape 0 fileclass All this is orthogonal to the Garbage Collection – diskOnly + GC can coexist, “scratch” area stagerm per svcclass – '*' as svcclass name gives the old behavior

CERN - IT Department CH-1211 Genève 23 Switzerland 14/ : Black and white lists These lists can be enable in castor.conf – RH USEACCESSLISTS YES (disabled by default) check done in request handler before insertion into DB lists contain svcClass, uid, gid and type of request – The wild card is the NULL value Access allowed if – user is in the white list – AND user is not in the black list – E.g. you accept all atlas (WL) except one guy (BL) Tools to handle these lists will come only in A Stress test was run that showed no impact on the request handler load

CERN - IT Department CH-1211 Genève 23 Switzerland 15/22 Tape improvements Tape components use DLF – So far “tpdaemon” and “rtcpd” are using DLF – rmcdaemon to come soon Removal of message daemon – CERN has now stopped it completely Several bug fixes – Bad MIR auto repair – drives stuck in UNKNOWN/RUNNING – Volume in use problems – See tape related presentation

CERN - IT Department CH-1211 Genève 23 Switzerland 16/22 Release process New schema (details in Giuseppe's talk) – Versions major releases e.g , – Releases minor, bug fix releases e.g , – Hotfixes are SQL only fixes. e.g Updated forum (click here)click here – description of known bugs – work around scripts provided and documented – snapshot of savannah state for each release

CERN - IT Department CH-1211 Genève 23 Switzerland 17/22 Coordination with external institutes Phone conference every second week – mainly led by the dev team – agenda status reports discussion of current problems recent developments Deployment meeting every month – operation + dev + tape teams – agenda savannah tickets review discussions on task priorities definition of the release contents Face to face meeting every year

CERN - IT Department CH-1211 Genève 23 Switzerland 18/22 Current situation and versions are supported – already 5 releases of out (-5 to -9) – in production at CERN & ASGC Dev team is working on next versions – due by monday contains only a the new stager backward compatible with will stay internal – due on December 1 st the next official release content was discussed in the monthly meeting

CERN - IT Department CH-1211 Genève 23 Switzerland 19/ in a nutshell Only contains a new stager Why to rewrite it : – to integrate it in the overall framework – to be able to maintain it – to review the monitoring of requests Timetable : – Marisa worked on it for > 6 months – Giuseppe is helping to integrate and debug – First internal release this week to start testing – Official release only in 2.1.6

CERN - IT Department CH-1211 Genève 23 Switzerland 20/ highlights List discussed in last deployment meeting (Oct 17 th ) – First implementation of strong authentication – Python policy framework recall policy and stream policy – Extended visualization support for Repack2 – Tools for managing black and white lists – File synchronization between diskserver and stager – Cleaning of STAGEOUT files – GridFTPv2 integration – Improved SQL code for monitoring – Consistency checks when hardware comes back – Scheduling of disk-to-disk copies – Checksums for disk resident files

CERN - IT Department CH-1211 Genève 23 Switzerland 21/22 Time lines for November – tests of the new stager as part of December 1 st 2007 – first candidate release During December – testing at RAL and CERN Mid January – first production release February – deployment for use in CCRC

CERN - IT Department CH-1211 Genève 23 Switzerland 22/22 The core castor dev team Sebastien Ponce Project Leader, DB code generation, LSF Giuseppe Lo Presti DB, core framework, SRM, new stager Giulia Taurelli Repack, python policies test suites, RFIO Dennis Waldron DLF, jobManager, D2Dcopy schduling Arne Wiebalck tape related software Harware team Tape part : Rosa Maria Garcia Rioja 64 bits port, xroot gridFTP v2, security Steven Murray arriving next Monday tape layer Maria Isabel Martin Serrano new stager leaving end of December