S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June ATLAS cluster in Geneva Szymon Gadomski University of Geneva and INP Cracow the existing system − hardware, configuration, use − failure of a file server the system for the LHC startup − draft architecture of the cluster − comments
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June The cluster at Uni Dufour First batch (2005) 12 Sun Fire v20z (wn01 to 12) –2 AMD Opteron 2.4 GHz, 4 GB RAM 1 Sun Fire v20z (grid00) –7.9 GB disk array SLC3 Second batch ( ) 21 Sun Fire X2200 (atlas-ui01 to 03 and wn16 to 33) –2 Dual Core AMD Opteron 2.6 GHz, 4 GB RAM 1 Sun Fire X4500 with 48 disks 500 GB each –17.6 TB effective storage (1 RAID 6, 1 RAID 50, 2 RAID 5) SLC4 Description at
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June Diagram of the existing system 26 TB and 108 cores in workers and login PCs atlas-ui01..03, grid00 and grid02 have double identity because of the CERN network setup by Frederik Orellana
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June The cluster at Uni Dufour (2) Frederik, Jan 07“old” worker nodesnew ones
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June disk arrays CERN network worker nodes back side
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June Current use (1) GRID batch facility in NorduGrid –in fact two independent systems “old” (grid00 + wn01..12), SLC3, older NG “new” (grid02 + wn16..33), SLC4, newer NG –runs most of the time for the ATLAS Mote Carlo production controlled centrally by the “production system” (role of the Tier 2) –higher priority for Swiss ATLAS members mapped to a different user ID, using a different batch queue –ATLAS releases installed locally and regularly updated some level of maintenance is involved
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June NorduGrid (ATLAS sites) n.b. you can submit to all ATLAS NG resources prerequisites –grid certificate –read NG user guide easier to deal with then ATLAS software
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June Current use (2) Interactive work like on lxplus.cern.ch –login with the same account (“afs account”) –see all the software installed on /afs/cern.ch, including the nightly releases –use the ATLAS software repository in CVS like on lxplus “cmt co …” works –same environment to work with Athena, just follow the “ATLAS workbook” Offer advantages of lxplus without the limitations –You should be able to do the same things the same way. –Your disk space can easily be ~TB and not ~100 MB. Roughly speaking we have a factor of –Around 18’000 people have accounts on lxplus…
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June Problems and failure of grid02 The PC Sun Fire X4500 “Thumper” 48 x 500 GB of disk space, 17 TB effective storage 2 Dual Core AMD Opteron processors, 16 GB of RAM one Gb network connection to CERN, one Gb to the University network including all worker nodes (not enough) SLC4 installed by Frederik last January working as a file server, batch server (Torque) and Nordu Grid server (manager + ftp + info) The problems batch system stopping often when many jobs running (18 OK, 36 not OK, 64 definitely not OK) high “system load” caused by data copying (scp) can take minutes to respond to a simple command The crash More jobs allowed again last Thursday, to understand scaling problems. More jobs sent. A few minutes later the system stopped responding. It could not be shut down. After a power cycle it is corrupted. It will be restored under SLC4. What to do in the long run is a good question. Any advice?
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June Plan of the system for Autumn 2007 new hardware before end of September one idea of the architecture 60 TB, 176 cores some open questions comments are welcome
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June Comments about future evolution Interactive work is vital. –Everyone needs to login somewhere. We are not supporting ATLAS software or Grid tools on laptops and desktops. –The more we can do interactively, the better for our efficiency. –A larger fraction of the cluster will be available for login. Maybe even all. Plan to remain a Grid site. –Bern and Geneva have been playing a role of a Tier 2 in ATLAS. We plan to continue that. Data transfer are too unreliable in ATLAS. –Need to find ways to make them work much better. –Data placement from FZK directly to Geneva would be welcome. No way to do that (LCG>NorduGrid) at the moment. Choice of gLite vs NorduGrid is (always) open. It depends on –feasibility of installation and maintenance by one person not full time –reliability (file transfers, job submissions) –integration with the rest of ATLAS computing Colleagues in Geneva working for Neutrino experiments would like to use the cluster. We are planning to make it available to them and use priorities. Be careful with extrapolations from present experience. Real data volume will be 200x larger then a large Monte Carlo production.
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June You can always use the (Nordu) GRID directly: –make your own.xrsl file (see next page) –submit jour jobs and get back the results (ngsub, ngstat, ngcat, ngls) –write your own scripts to mange this when you have 100 or 1000 similar jobs to submit, each one with different input data –many people work like that now –if you like to work like that, less problems for me I think we will use higher level tools which: –operate on data sets and not on files –know about ATLAS catalogues of data sets –automatize the process: make many similar jobs and look after them (split the input data, merging of output) –work with many Grids Higher level GRID tools
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June An example of the job description in.xrsl &("executable" = "SUSYWG_prewrapper.sh" )("inputFiles" = ("SUSYWG_wrapper.sh" "SUSYWG_wrapper.sh" ) ("SUSYWG_prewrapper.sh" "SUSYWG_prewrapper.sh" ) ("SUSYWG_csc_evgen_trf.sh" "SUSYWG_csc_evgen_trf.sh" ) ("SUSYWG_csc_atlfast_trf.sh" "SUSYWG_csc_atlfast_trf.sh" ) ("SUSYWG_SetCorrectMaxEvents.py" "SUSYWG_SetCorrectMaxEvents.py" ) ("z1jckkw_nunu_PT40_ unw" "gsiftp://lheppc50.unibe.ch/se1/borgeg/IN_FILES/zNj_nnPT40_in_noEF/z1jckkw_nunu_PT40_ unw" ) ("z1jckkw_nunu_PT40_01.001_unw.par" "gsiftp://lheppc50.unibe.ch/se1/borgeg/IN_FILES/zNj_nnPT40_in_noEF/z1jckkw_nunu_PT40_ _unw.par.ok" ) ("SUSYWG_DC AlpgenJimmyToplnlnNp3.py" "SUSYWG_DC AlpgenJimmyToplnlnNp3.py" ) ("SUSYWG_csc_atlfast_log.py" "SUSYWG_csc_atlfast_log.py" ) )("outputFiles" = ("z1jckkw_nunu_PT40_ Atlfast12064.NTUP.root" "gsiftp://lheppc50.unibe.ch/se1/borgeg/atlfast12064/zNj_nnPT40_noEF/z1jckkw_nunu_PT40_ Atlfast12064.NTUP.root" ) ("z1jckkw_nunu_PT40_ Atlfast12064.AOD.pool.root" "gsiftp://lheppc50.unibe.ch/se1/borgeg/atlfast12064/zNj_nnPT40_noEF/z1jckkw_nunu_PT40_ Atlfast12064.AOD.pool.root" ) ("z1jckkw_nunu_PT40_ Atlfast12064.log" "gsiftp://lheppc50.unibe.ch/se1/borgeg/atlfast12064/zNj_nnPT40_noEF/z1jckkw_nunu_PT40_ Atlfast12064.log" ) ("z1jckkw_nunu_PT40_ Atlfast12064.logdir.tar.gz" "gsiftp://lheppc50.unibe.ch/se1/borgeg/atlfast12064/zNj_nnPT40_noEF/z1jckkw_nunu_PT40_ Atlfast12064.logdir.tar.gz" ) )("memory" = "1000" )("disk" = "5000" )("runTimeEnvironment" = "APPS/HEP/ATLAS _COTRANSFER" )("stdout" = "stdout" )("stderr" = "stderr" )("gmlog" = "gmlog" )("CPUTime" = "2000" )("jobName" = "noEF_z1jckkw_nunu_PT40_ Atlfast " ) ) ) May 07 02:01:18 RELATION: jobName = SEQUENCE( LITERAL( noEF_z1jckkw_nunu_PT40_ Atlfast ) )
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June Higher level GRID tools (2) GridPilot –a Graphical User Interface (GUI) in Java –developed by Frederik Orellana and Cyril Topfel –only works with Nordu Grid so far –used so far for beam test analysis and simulation as a tool to get data to Bern and to Geneva Ganga –a command line tool (in Python) or a GUI –an “official” development, well supported –several people, including Till, have used it with success in the command line mode –can work with NorduGrid, LCG and plain batch system pathena –the easiest to use, you run pathena like athena –developed by Tadashi Maeno, BNL –needs a server that does all and a pilot process on every node –for the moment the only server is at BNL, but Tadashi will move here… The plan is to try Ganga and GridPilot on some real cases and see how we like them.
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June Summary The ATLAS cluster in Geneva is a large Tier 3 –now 108 cores and 26 TB –in a few months 178 cores and 70 TB In NorduGrid since a couple of years, runs ATLAS simulation like a Tier 2. Plan to continue that. Recently more interactive use by the Geneva group, plan to continue and to develop further. Not so good experience with Sun Fire X4500 running Scientific Linux. Need to decide if one should change to Solaris. Choice of middleware can be discussed. Room for more exchange of experiences between the Swiss WLCG experts: hardware, cluster architecture, OS, cluster management tools, middleware, data transfer tools
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June backup slides
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June NorduGrid and LCG When we started with the Grid in 2004, the LCG was not an option –Not applicable to small clusters like Bern ATLAS and Geneva ATLAS. You would need 4 machines to run services. –You need to reinstall the system and have SLC4. Not possible on the Ubelix cluster in Bern. –Support was given to large sites only. The situation now –NG in Bern and Geneva, NG + LCG in Manno –Ubelix (our biggest resource) will never move to SLC4. –Our clusters are on SLC4 and have become larger, but still a sacrifice of 4 machines would be significant. –The change would be disruptive and might not pay off in the long term. NG still has a better reputation for what concerns stability. We see it as a low(er) maintenance solution. –Higher level Grid tools promise to hide the differences to the users. Do not change now, but reevaluate the arguments regularly.
Dario Barberis: ATLAS Computing Plans for 2007 WLCG Workshop - 22 January Computing Model: central operations l Tier-0: nCopy RAW data to Castor tape for archival nCopy RAW data to Tier-1s for storage and subsequent reprocessing nRun first-pass calibration/alignment (within 24 hrs) nRun first-pass reconstruction (within 48 hrs) nDistribute reconstruction output (ESDs, AODs & TAGs) to Tier-1s l Tier-1s: nStore and take care of a fraction of RAW data (forever) nRun “slow” calibration/alignment procedures nRerun reconstruction with better calib/align and/or algorithms nDistribute reconstruction output (AODs, TAGs, part of ESDs) to Tier-2s nKeep current versions of ESDs and AODs on disk for analysis l Tier-2s: nRun simulation (and calibration/alignment when appropriate) nKeep current versions of AODs on disk for analysis
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June General comments The tier 3 is not counted for the ATLAS “central operations”. We can use it as we prefer. We have done quite some Tier 2 duty already. We are not obliged. It makes sense to keep the system working, but our work has a higher priority. How are we going to work when the data comes? Be careful when you extrapolate from your current experience –A large simulation effort in ATLAS (the Streaming Test) produces “10 runs of 30 minute duration at 200 Hz event rate (3.6 million events)”. That is ~10 -3 of the data sample produced by ATLAS in one year. Need to prepare for a much larger scale. FDR tests of the Computing this Summer. Need to connect to them, exercise the flow of data from Manno, FZK and CERN. Also exercise sending our jobs to the data and collecting the output.
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June To Do list, for discussion see with Olivier if GridPilot can work for him one way of working with Ganga on our cluster –from lxplus only grid00-cern is visible and ngsub command has a problem with double identity –on atlas-ui01 there is a library missing fix Swiss ATLAS computing documentation pages –many pages obsolete, a clearer structure is needed –need updated information about Grid tools and ATLAS software exercise data transfers –from CERN (T0), Karlsruhe (T1) and Manno (T2), –a few different ways and tools if thermal modeling for the SCT goes ahead, help to use the cluster
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June To Do list, for discussion (2) Medium term –consolidate system administration see if we can make routing more efficient make all software run on atlas-ui01, then make all the ui identical update the system on the new workers when ATLAS ready change old workers to SLC4 and setup some monitoring of the cluster –prepare next round of hardware, possibly evaluate prototypes from Sun –get Manno involved in the “FDR” (tests of ATLAS computing model) –decide about Ganga vs GridPilot, provide instructions how to use one Longer term –try pathena when available at CERN –try PROOF –setup a web server –try the short-lived credential service of Switch Help would be welcome!
S. Gadomski, "The ATLAS cluster in Geneva", Swiss WLCG experts workshop, CSCS, June Some open questions Local accounts vs afs accounts. Direct use of the batch system –with standard command-line tools or as back-end to Ganga –would require some work (queues in the batch system, user accounts on every worker) but can be done Use of the system by the Neutrino group.