Download presentation
Presentation is loading. Please wait.
Published byMilton Pitts Modified over 9 years ago
1
EGEE is a project funded by the European Union under contract IST-2003-508833 Large scale simulations on the EGEE Grid Jiri Chudoba FZU and CESNET, Prague EGEE Summer School, Budapest, 13.7.2005 www.eu-egee.org
2
EGEE Summer School, Budapest, 13.7.2005 - 2 Contents Single job vs. big production ATLAS Data Challenges Job rates, distributions Problems – global, local Outlook
3
EGEE Summer School, Budapest, 13.7.2005 - 3 From 1 Job to Big Productions basic job submition on the EGEE grid: edg-job-submit my.jdl edg-job-getstatus –i myjobs edg-job-getoutput –i myjobs ~100 of jobs – still manageable with a couple of shell scripts with loops around basic commands ~1000 of jobs – need some small database to be able to resubmit failed jobs LCG experiments require much more – complicated systems around basic commands ATLAS Requirements for 2008 CPU: 51 MSI2K (34 000 current processors) Storage: disk: 25 000 TB tape: 17 000 TB
4
EGEE Summer School, Budapest, 13.7.2005 - 4 Example: ATLAS Data Processing ATLAS (A Toroidal LHC Apparatus) experiment at the Large Hadron Collider at CERN will start taking data in 2007. proton-proton collisions with a 14 TeV center-of-mass energy with rate 10 9 Hz, storage rate 200 Hz Total amount of “raw” data: 1 PB/year Analysis Object Data target size 100 kB per event Each collaborator must have transparent access to data ~2000 Collaborators ~150 Institutes 34 Countries
5
EGEE Summer School, Budapest, 13.7.2005 - 5 ATLAS Productions Large scale tests of readiness for many components ATLAS Data Challenge 1 (2002) – production sites independent, no grid tools drawbacks: long delays for some jobs, although some sites akready idle at least 1 ATLAS person per site ATLAS DC2 (2004) – first big test of the ATLAS production system upgrade of LCG mw during production, several versions of ATLAS sw ATLAS Rome production – simulations for the ATLAS Physics Workshop in Rome (June 2005)
6
EGEE Summer School, Budapest, 13.7.2005 - 6 ATLAS Production System ATLAS uses 3 grids: LCG (=EGEE) Nordugrid (evolved from EDG) GRID3 (US) plus possibility for local batch submition 4 interfaces Input and output data must be accessible from all grids jobs vary in requirements on I/O, CPU time and RAM ATLAS developed custom system from several components
7
EGEE Summer School, Budapest, 13.7.2005 - 7 ATLAS Production System Overview LCGNGGrid3LSF LCG exe LCG exe NG exe G3 exe LSF exe super prodDB dms RLS jabber soap jabber Don Quijote Windmill Lexor AMI Capone Dulcinea
8
EGEE Summer School, Budapest, 13.7.2005 - 8 ATLAS Production Rates
9
EGEE Summer School, Budapest, 13.7.2005 - 9 DC2 on LCG
10
EGEE Summer School, Budapest, 13.7.2005 - 10 Production Rates July – September 2004 DC2 GEANT4 simulation (long jobs) LGC/EGEE : GRID3 : NorduGrid = 40 : 30 : 30 October – December 2004 DC2 digitization and reconstruction (short jobs) February – May 2005 Rome production (mix of jobs) LGC/EGEE : GRID3 : NorduGrid = 65 : 24 : 11 CondorG improved efficiency of the LCG sites usage
11
EGEE Summer School, Budapest, 13.7.2005 - 11 ATLAS Rome Production: countries (sites) Austria (1) Canada (3) CERN (1) Czech Republic (2) Denmark (3) France (4) Germany (1+2) Greece (1) Hungary (1) Italy (17) Netherlands (2) Norway (2) Poland (1) Portugal (1) Russia (2) Slovakia (1) Slovenia (1) Spain (3) Sweden (5) Switzerland (1+1) Taiwan (1) UK (8) USA (19) 17 countries; 51 sites 7 countries; 14 sites 22 countries 84 sites
12
EGEE Summer School, Budapest, 13.7.2005 - 12 Rome Production Statistics 73 data sets containing 6.1M events simulated and reconstructed (without pile-up) Total simulated data: 8.5M events Pile-up done later (for 1.3M events done, 50K reconstructed)
13
EGEE Summer School, Budapest, 13.7.2005 - 13 Rome Production: Number of Jobs As of 17 June 2005
14
EGEE Summer School, Budapest, 13.7.2005 - 14
15
EGEE Summer School, Budapest, 13.7.2005 - 15
16
EGEE Summer School, Budapest, 13.7.2005 - 16 Critical Servers RB, BDII, RLS, SE, UI, DQ, MyProxy, DB servers just 1 combo machine with RB, UI, BDII, DQ in the beginning of DC2 quickly evolved in a complex system of many services running on many machines RB – several machines, 1 RB at CERN with reliable disk array, can change if 1 machine has problem (but some jobs may be lost) UI – 1 or 2 machines per submitter, up to 1000 jobs handled by 1 instance of lexor. Big memory requirements BDII – 2 machines behind DNS alias RLS – single point of failure, cannot be replaced by other machine, problems in the beginning of May stopped production and data replication for several days. Missing features implemented in new catalogues.
17
EGEE Summer School, Budapest, 13.7.2005 - 17 Critical servers (cont) DQ server for data management production DB – Oracle server shared with other clients, service guaranteed by CERN db group other DB’s required by ATLAS sw: geometryDB, conditionsDB MySQL servers Hard limit of a 1000 connections to a server was hit during Rome production. Replica servers were quickly introduced and code was change to select between them. SE – problems if input data are on a SE which is down
18
EGEE Summer School, Budapest, 13.7.2005 - 18 Jobs validation output files are marked by suffix with attempt number only when job is validated (by parsing its log file) output files are renamed and job marked as validated renaming of physical files would be difficult (may be on a tape!) only entry in the catalogue is changed pollution of disk and tape space with files from failed attempts, no systematic clean up done
19
EGEE Summer School, Budapest, 13.7.2005 - 19 Monitoring Production overview: via proddb – ATLAS specific Grid monitors: GOC monitor: http://goc.grid-support.ac.uk/gridsite/monitoring/ Site Functional Tests BDII monitors (several) http://hpv.farm.particle.cz/chudoba/atlas/lcg/bdii/html/latest.html http://www.nordugrid.org/applications/prodsys/lcg2-atlas.php http://www.mi.infn.it/~gnegri/rome_bdii.htm
20
EGEE Summer School, Budapest, 13.7.2005 - 20 Monitoring (con’t) GRIDICE ATLAS VO view: http://gridice2.cnaf.infn.it:50080/gridice/vo/vo_details.php?voName=atlas http://gridice2.cnaf.infn.it:50080/gridice/vo/vo_details.php?voName=atlas ATLAS production view http://atlfarm003.mi.infn.it/~negri/cgi- bin/rome_jm.cgihttp://atlfarm003.mi.infn.it/~negri/cgi- bin/rome_jm.cgi
21
EGEE Summer School, Budapest, 13.7.2005 - 21 Black Holes Wrongly configured site can attract many jobs, since it process them in very short time Protection: ATLAS sw not found – job sends a mail and sleeps for 4 hours often caused by nfs server problems Automatic site exclusion from the BDII if it does not pass SFT SFT run once a day – too long delay for big production sites Sometimes error caused by external site possibility to include/exclude sites by hand (2 persons in different time zones to cover almost 24 hours/day) since spring 2005 possibility to select which tests are critical –VO dependent selection! statistics from ATLAS prodDB
22
EGEE Summer School, Budapest, 13.7.2005 - 22 Monitor of Blocked CPUs Started by doing qstat on a local farm Later extended to more sites using globus-job-run Only some sites scanned, LSF not supported http://www-hep2.fzu.cz/~chudoba/atlas/lcg/last-bad.html
23
EGEE Summer School, Budapest, 13.7.2005 - 23 Jobs distribution Some sites had many jobs in queues, some had free resources Different ranking expression tried based on ERT, number of waiting jobs,... Local sites of submitters were better used Problems if site publishes incorrect info (local MDS or BDII stuck or died) Missing information per VO, new Glue scheme should help CondorG bypassed RB – better distribution LHCb approach: submit many jobs as placeholders, actual content of a job defined only when the job starts
24
EGEE Summer School, Budapest, 13.7.2005 - 24 Local problems (experience from 2 CZ sites) Load on the local SE, when many jobs start at once no crash, but enormous increase of clock time per job dCache installed on a new machine, under tests now Stability of the NFS server solved by several upgrades of kernel Disk array crash backplane problem lead to data loss RLS entries for lost files removed HyperThreading not used on nodes for ATLAS production some increase of performance in simulation tests, but introduces more dependence on other jobs running on the same node
25
EGEE Summer School, Budapest, 13.7.2005 - 25 Local problems (cont) Job distribution based on the default expression for ERT no new jobs for bigger site if some jobs were already running although some CPU still free OK after change in evaluation of ERT Remote SE overloaded or down input data on SE with problems – lcg-cp command blocks –timeout introduced later in lexor predefined SE for output not available – lcg-cr blocks Misbehaved jobs had infinite loop with output to a log file, took all available space on a local disk crashed the other job on the WN too $EDG_JOB_SCRATCH definition was missing on a reinstalled WN job filled shared /home partition
26
EGEE Summer School, Budapest, 13.7.2005 - 26 Special Jobs (DC2) Pileup jobs combine signal event with several background events Required > 1 GB RAM per job, 700 GB of input data with background events Data were copied to selected sites in advance using DQ Special “InputHint” for these jobs Each job used 8 input files with min. bias (~250MB each), downloaded from closeSE, and 1 input file with signal Selected Sites allowed atlas jobs only on machines with enough RAM
27
EGEE Summer School, Budapest, 13.7.2005 - 27 Conclusions ATLAS DC2 and Rome production were done on 3 grids, LCG/EGEE had the biggest share Rate of several thousands jobs/day achieved MW is not yet mature enough, many problems met, but all were solved Productions require a lot of manpower mostly covered by ATLAS good support from EIS team some services managed by other CERN groups (DB, RLS, BDII, CASTOR,...)
28
EGEE Summer School, Budapest, 13.7.2005 - 28 Pileup jobs
29
EGEE Summer School, Budapest, 13.7.2005 - 29 Outlook Only large scale tests as DC can find problems and bottlenecks in the system – must continue Service Challenges several phases, start with tests of basic services and add more phase 3 during 2 nd half of 2005 will include LHC experiments gLite components should solve some problems (WMS, new data catalogues) ATLAS DC3 (Computing System Commissioning) in 2006 LHC experiment data taking starts already 2007 – we must have reliable well tested system then
30
EGEE Summer School, Budapest, 13.7.2005 - 30 Thanks to my colleagues from ATLAS production team for a good cooperation leading to such good results special thanks to Gilbert Poulard for plots organizers of this EGEE school
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.