June 29, 2006 P. Capiluppi The First CMS Data Challenge (~1998/99) Using Condor
2 P. Capiluppi June 29, 2006 Disclaimer u Official presentations of those activities are no more available… Long time ago Used machines already dismissed since time Files lost in the dismissed disks Only fragment of information still around è Mosltly on “printed” slides and unlinked Web pages And … my memory is not as it was at that time … u However I could find some information and, without surprise, a number of “well known” names The list of them will certainly forget somebody, so I’ll avoid to do it, but PAOLO MAZZANTI is worth to be mentioned !
3 P. Capiluppi June 29, 2006 Environment and History u CMS Simulation program (CMSIM) using Geant3 (Fortran) Different versions in rapid development u Objectivity at that time for CMS u First CMS reconstruction programs using C++ u SUN OS and HP Unix were the CMS basic operating systems But Linux rapidly growing And we had a legacy of a lot of Digital Alphas from LEP u ~Year 2000 INFN started to fund PCs Farms u In 1999 INFN launched the INFN-Grid project u The MONARC project was running at CERN Then … we were flooded by GRID and Tiers
4 P. Capiluppi June 29, 2006 The Data Challenge start … u From the minutes of a meeting of 14 May 1998: Need to generate events of single muon (3 different momenta) and events of Higgs -> 2 muons (3 different masses) To be done over Condor, starting June 98 CMSIM code has been ported from SUN to Alpha: needs to be “linked” with Condor libraries Local running of tests on Higgs simu gave ~1.4 min/event on both Alpha and SUN (with ~ 5 min of program initialization): >700 hours of CPU time for that sample of events u From another meeting of 13 May 1998: Planning the National (INFN) Condor pool (~57 machine available) CMSIM is one of the possible applications over WAN GARFIELD (electric field simulation of the CMS Muon Detectors, DC cells) will run only locally (checkpoint file too big! Less than a today mail attachment …)
5 P. Capiluppi June 29, 2006 The challenge … before starting u We (CMS Bologna) were already using Condor!
June 29, 2006 P. Capiluppi 6 Method Used to Produce the Drift Times Full simulation on ALPHA machines ; Bologna Condor facility used ; Four tracks for each x, , B considered. For each track we assumed the drift time is given by : 50% one electron 40% two electrons 10% three electrons 3 9/12/1998 Report
June 29, 2006 P. Capiluppi 7 Drift Lines when B w = 0.3 T 5 9/12/1998 Report
8 P. Capiluppi June 29, 2006 And we did started u A strange (to my mind) CMS Simulation statement (dated 20 Apr 1998) u The objective was to measure the throughput (in terms of CMS simulated events per hour) of the our Condor Pool … At the beginning we had some compatibility problem between the CERN Library and the Condor libraries, but the Condor Team promptly solved these problems. This has to be stressed again: the support from the Condor team is very good! u Indeed we (CMS Italy) started in that period to support (in concrete, even if small contribution) the Condor team u The number of machines running simulation under Condor was from 9 to 19! The 40% of jobs were checkpointed (we note that in the CMS case the checkpoint file was of the order of 66 MB!).
9 P. Capiluppi June 29, 2006 The real challenge (1/2) u CMSIM jobs were mostly CPU intensive Very small I/O, compared to the CPU time required by the simulation of the number of events/job (carefully chosen) Executable of the order of 140 Mbytes Some of the Simulation programs required the access to input data (via RPC, not NFS, even in the “local” environment of Bologna). è Small in size in any case: ~130 KBytes/event read, same amount write è Some of the jobs had a larger I/O: ~600 KBytes/event u Propagation of the random seed for Simulation among the jobs Required a careful bookkeeping (Hand made at that time) u Coordination between different activities over the Condor Pool(s) We were not the only user, and some of the time constraints for the production, required a co-ordination In particular, when going to the national WAN implementation, we faced large fluctuations in response time and in consistency of local machines è Well known, nowadays in Grid …
10 P. Capiluppi June 29, 2006 The real challenge (2/2) u SUN OS to Alpha OS required some different configuration, and of course, compilation Some of the CMSIM Fortran packages for a CMS sub-detector could not be exported, so were dropped è fortunately not important for the Physics scope All the jobs were submitted via a single SUN station è Limited resources for the many jobs input and output è Complicated procedure to get the Alpha executable available è Single point of failure for all the simulations è And all the participating persons had to have a local account and coordinate themselves u The results of the simulation had to be available to all CMS Some GByte of data over AFS? Not possible at that time: Procedures to get the data exported (FTS) and permanently stored (Local Tape trivial System)
11 P. Capiluppi June 29, 2006 The successful Challenge u Looking back to the (lost) Web Pages: available in Bologna (Oct 1998) 30 datasets, each of 4000 events, of single muon signal at 4 GeV 30 datasets, each of 4000 events, of single muon signal at 25 GeV 30 datasets, each of 4000 events, of single muon signal at 200 GeV 30 datasets, each of 1000 events, of Higgs events at the planned masses u All the data were produced in a considerable short time, given the resources that CMS had dedicated to the Experiment in Bologna As an example a dataset was produced in a 3 days time over the Condor Pool, against a 17 days time on a dedicated machine ! u Condor proved to be VERY robust against machine crashes and network interruptions We experimented both the Network and the machine crashes: in both cases we could recuperate the “running” jobs without human intervention (more or less …) Checkpointing of Condor was a key issue in this scenario
12 P. Capiluppi June 29, 2006 And we continued … (Bo+Pd) October 1999 Report 15 days on 6 SUN in the Condor Pool of Padova Same effort on the Bologna Pool
13 P. Capiluppi June 29, 2006 The machines (resources) used u Bologna Condor Pool 19 Digital Alpha Unix 4.0x 3 HP-UX 8 PC Linux è We used them ! 2 SGI IRIX 6.2 or 6.3 1 SUN Solaris 2.5 Located in two WAN connected sites: RPC access u The INFN Wan Condor 48 Digital Alpha (various Unix releases) 14 HP-UX 17 PC Linux 2 SGI IRIX 1 SUN Solaris
14 P. Capiluppi June 29, 2006 Performance evaluations (CMSIM on Condor) u A Computer-Science Thesis by Ilaria Colleoni ( ) (Co-tutor: C. Grandi) Attempt to numerically evaluate the running of CMSIM on Condor With “real” simulation jobs of different computing loads è Single Muons (4 GeV, 25 GeV, 200 GeV) è Higgs (2muons) of different masses è CPU times/job: from ~4 hours up to ~45 hours Both in a Local Condor Pool (Bologna) and in the INFN WAN Condor environment Alpha platform used, but submitting machine was a SUN Checkpointing enabled (exe ~140 MB) All I/O operations (when needed) via RPC
15 P. Capiluppi June 29, 2006 Single muon events Local Pool u Increasing computational load for the different momenta 4 GeV, 25 GeV, 200 GeV u Comparison of the CPU time on Condor with a locally-run identical simulations u Normalization of the CPU time on Condor accounting for the different CPU power of the used nodes (+ some other consideration, like memory, etc.)
16 P. Capiluppi June 29, 2006 Single muon events WAN Pool u Same kind of jobs Would have required ~week to execute on the Local Pool Got the results in ~3 days u Same Normalization of the CPU time u Estimate of the WAN running load
17 P. Capiluppi June 29, 2006 Some (historical) Issues u During that “first Data Challenge” we faced for the first time the “data” problem: We were worried of the I/O of jobs, over the LAN and WAN è And we discovered that the simulation jobs are so CPU intensive that it was a negligible problem, even with those bandwidths è It might be a problem with the current CPUs But we had to cope with the disk space of the submitting machine è And then we had to find a way to make the produced data available for access (copies) Nowadays we know that the real problem is not the distributed computing, but the distribution of data accesses u Another point was the predictability of the Condor System I remember long discussions with Miron and Paolo (in his office), to try to understand if Condor could be a solution for “Distributed Analysis” è Is it solved?
18 P. Capiluppi June 29, 2006 Conclusion u CMS (Bologna) started at that time to use the “distributed computing” to perform a “simulation challenge” We found everything (mostly) ready, thanks to Condor And it was a success ! u CMS (at large) has gone through many “computing, data and analysis challenges” since then Many of them were successful (and we hope we will be successful with the “real challenge” of “real data”) However from that exercise in we learnt a lot: è Distributed Services, Coordination, etc. è And very important: Robustness of the underlying software ! u That (modest) Data Challenge was the precursor of a GRID activity, that, since then, took most of our time …
19 P. Capiluppi June 29, 2006 u First evaluations (Ilaria) u Running the production Problems People Pools Resources u Results (Bo+Pd) u Some issues Historical (Miron & Paolo presentations) Dependencies of the available Condor (CPU vs I/O) Predictability of the results, or simulation vs analyses u Conclusions First “distributed” CMS challenge Grid precursor