Download presentation
Presentation is loading. Please wait.
Published byValerie Jordan Modified over 8 years ago
1
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR Overview and Status Sebastien Ponce CERN / IT
2
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 2/22Outline Brief history of the last year – statistics, major events Overview of major versions – 2.1.2, 2.1.3, 2.1.4 Other improvements – tape part – release scheme – coordination with external institutes Current situation – 2.1.5 and 2.1.6 in a nutshell – CASTOR dev staff
3
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 3/22 Some statistics for the year > 9000 commits i.e. > 40 per working day 130 bugs fixed 3 new versions 44 release candidates 12 official releases 14 hotfixes 10 PB on tape, 80M files 2 PB on disk, 5.3M files 500 diskservers CMS : “The castor facility has performed remarkably well”
4
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 4/22 Major events February : 2.1.2 version – prepareToGets are not scheduled anymore – all deadlocks are gone in the DB June : 2.1.3 version – new monitoring and shared memory – putDones are not scheduled anymore September : 2.1.4 version – support for disk only – jobManager has replaced rmmaster – SLC4 is supported as well as 64 bits servers
5
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 5/22 The 2.1.2 key features PrepareToGets are not scheduled – first step to solve LSF melt downs in case of large recalls from an experiment – recalls don't use LSF anymore – but disk to disk copies are still scheduled Database improvements – deadlocks were all removed – reconnections automatized
6
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 6/22 The 2.1.3 key features Solution to the LSF meldown – shared memory used New monitoring schema – rmMasterDaemon replaced rmmaster Fully qualified domain names everywhere Clips was dropped, python to replace it putDones are no more scheduled nameserver file name length extended GC desynchronized
7
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 7/22 2.1.3 : removing the LSF limitation Major issues prior to 2.1.3 were : – LSF melt down – LSF scheduling speed – LSF message boxes An analysis of the causes was presented in November 2006 in the face to face meeting – lack of multithreading in LSF – combined to latency of the DB access The new architecture has also been presented – rmmaster and scheduler on a single node – shared memory for monitoring access
8
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it What will improve LSF – plugin limitations gone expecting > 100K queuing jobs is fine. To be measured – code better supported now as we understand it – improvements in the scheduling made possible limiting number of retries returning proper errors from the plugin RmMaster, RmNode – fully rewritten (monitoring part), so better supported – RmMaster made almost “stateless” DB sync allows to restart after a crash machine states not lost Slide from November 2006 limit is ~70K Done, see 2.1.4 & Dennis' presentation DB sync used for any upgrade System has proved to recover from rmMasterDaemon crashes
9
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 9/22 2.1.3 : removing the LSF limitation(2) The LSF message boxes have been dropped Replaced by – web server on the scheduler machine Accessed by the diskservers – or a shared filesystem As a consequence, jobs don't go to PUSP anymore
10
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 10/22 The 2.1.4 key features jobManager replaces rmmaster DiskOnly support Black and white lists Support for SLC4 and 64 bits server – CERN new head nodes are 64 bits – CERN name server nodes are 64 bits – still issues at the tape level ? – See discussion this afternoon First handling of hardware “come back” – stopping useless recalls – recognizing files overwritten and not available – more to come in 2.1.6 See next slides
11
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 11/22 The 2.1.4 key features (2) Support for running 2 request handlers on the same machine is built in the init scripts nameserver – nschclass can now change regular files if the user is ADMIN – this is needed for disk only pools forcing the fileclass – when not ADMIN, the behavior did not change at all Balancing of VMGR pools – Randomized the tape choice within the one matching space reqs Removed message daemon from tape code
12
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 12/22 2.1.4 : the new jobManager Uses standard framework – One master daemon distributing the load – A pool of preforked children for LSF calls – Pipes between master and children for communication Uses stager DB as backend – Introduces new states for SubRequests : READYFORSCHED & BEINGSCHEDULED RPMS – castor-jobmanager-server was added to be installed on the ‘jobManager’ machine – castor-rmmaster-server contains only rmMasterDaemon Stays only on the scheduler machine
13
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 13/22 2.1.4 : support for diskOnly a svcClass can be declared “diskOnly” – user requests that need to allocate space on disk will fail in case no space is available – by default, they would have queued forever Ability to force a fileClass – only supported in diskOnly svcClasses – requires 2.1.4 nameserver – allows to enforce a tape 0 fileclass All this is orthogonal to the Garbage Collection – diskOnly + GC can coexist, “scratch” area stagerm per svcclass – '*' as svcclass name gives the old behavior
14
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 14/22 2.1.4 : Black and white lists These lists can be enable in castor.conf – RH USEACCESSLISTS YES (disabled by default) check done in request handler before insertion into DB lists contain svcClass, uid, gid and type of request – The wild card is the NULL value Access allowed if – user is in the white list – AND user is not in the black list – E.g. you accept all atlas (WL) except one guy (BL) Tools to handle these lists will come only in 2.1.6 A Stress test was run that showed no impact on the request handler load
15
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 15/22 Tape improvements Tape components use DLF – So far “tpdaemon” and “rtcpd” are using DLF – rmcdaemon to come soon Removal of message daemon – CERN has now stopped it completely Several bug fixes – Bad MIR auto repair – drives stuck in UNKNOWN/RUNNING – Volume in use problems – See tape related presentation
16
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 16/22 Release process New schema (details in Giuseppe's talk) – Versions major releases e.g. 2.1.3, 2.1.4 – Releases minor, bug fix releases e.g. 2.1.3-24, 2.1.4-9 – Hotfixes are SQL only fixes. e.g. 2.1.3-24-2 Updated forum (click here)click here – description of known bugs – work around scripts provided and documented – snapshot of savannah state for each release
17
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 17/22 Coordination with external institutes Phone conference every second week – mainly led by the dev team – agenda status reports discussion of current problems recent developments Deployment meeting every month – operation + dev + tape teams – agenda savannah tickets review discussions on task priorities definition of the release contents Face to face meeting every year
18
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 18/22 Current situation 2.1.3 and 2.1.4 versions are supported – already 5 releases of 2.1.4 out (-5 to -9) – 2.1.4 in production at CERN & ASGC Dev team is working on next versions – 2.1.5 due by monday contains only a the new stager backward compatible with 2.1.4 will stay internal – 2.1.6 due on December 1 st the next official release content was discussed in the monthly meeting
19
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 19/22 2.1.5 in a nutshell Only contains a new stager Why to rewrite it : – to integrate it in the overall framework – to be able to maintain it – to review the monitoring of requests Timetable : – Marisa worked on it for > 6 months – Giuseppe is helping to integrate and debug – First internal release this week to start testing – Official release only in 2.1.6
20
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 20/22 2.1.6 highlights List discussed in last deployment meeting (Oct 17 th ) – First implementation of strong authentication – Python policy framework recall policy and stream policy – Extended visualization support for Repack2 – Tools for managing black and white lists – File synchronization between diskserver and stager – Cleaning of STAGEOUT files – GridFTPv2 integration – Improved SQL code for monitoring – Consistency checks when hardware comes back – Scheduling of disk-to-disk copies – Checksums for disk resident files
21
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 21/22 Time lines for 2.1.6 November – tests of the new stager as part of 2.1.5 December 1 st 2007 – first candidate release During December – testing at RAL and CERN Mid January – first production release February – deployment for use in CCRC
22
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 22/22 The core castor dev team Sebastien Ponce Project Leader, DB code generation, LSF Giuseppe Lo Presti DB, core framework, SRM, new stager Giulia Taurelli Repack, python policies test suites, RFIO Dennis Waldron DLF, jobManager, D2Dcopy schduling Arne Wiebalck tape related software Harware team Tape part : Rosa Maria Garcia Rioja 64 bits port, xroot gridFTP v2, security Steven Murray arriving next Monday tape layer Maria Isabel Martin Serrano new stager leaving end of December
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.