Presentation is loading. Please wait.

Presentation is loading. Please wait.

June 29, 2006 P. Capiluppi The First CMS Data Challenge (~1998/99) Using Condor.

Similar presentations

Presentation on theme: "June 29, 2006 P. Capiluppi The First CMS Data Challenge (~1998/99) Using Condor."— Presentation transcript:

1 June 29, 2006 P. Capiluppi The First CMS Data Challenge (~1998/99) Using Condor

2 2 P. Capiluppi June 29, 2006 Disclaimer u Official presentations of those activities are no more available…  Long time ago  Used machines already dismissed since time  Files lost in the dismissed disks  Only fragment of information still around è Mosltly on “printed” slides and unlinked Web pages  And … my memory is not as it was at that time … u However I could find some information and, without surprise, a number of “well known” names  The list of them will certainly forget somebody, so I’ll avoid to do it,  but PAOLO MAZZANTI is worth to be mentioned !

3 3 P. Capiluppi June 29, 2006 Environment and History u CMS Simulation program (CMSIM) using Geant3 (Fortran)  Different versions in rapid development u Objectivity at that time for CMS u First CMS reconstruction programs using C++ u SUN OS and HP Unix were the CMS basic operating systems  But Linux rapidly growing  And we had a legacy of a lot of Digital Alphas from LEP u ~Year 2000 INFN started to fund PCs Farms u In 1999 INFN launched the INFN-Grid project u The MONARC project was running at CERN Then … we were flooded by GRID and Tiers

4 4 P. Capiluppi June 29, 2006 The Data Challenge start … u From the minutes of a meeting of 14 May 1998:  Need to generate 360.000 events of single muon (3 different momenta) and 30.000 events of Higgs -> 2 muons (3 different masses)  To be done over Condor, starting June 98  CMSIM code has been ported from SUN to Alpha: needs to be “linked” with Condor libraries  Local running of tests on Higgs simu gave ~1.4 min/event on both Alpha and SUN (with ~ 5 min of program initialization): >700 hours of CPU time for that sample of events u From another meeting of 13 May 1998:  Planning the National (INFN) Condor pool (~57 machine available)  CMSIM is one of the possible applications over WAN  GARFIELD (electric field simulation of the CMS Muon Detectors, DC cells) will run only locally (checkpoint file too big! Less than a today mail attachment …)

5 5 P. Capiluppi June 29, 2006 The challenge … before starting u We (CMS Bologna) were already using Condor!

6 June 29, 2006 P. Capiluppi 6 Method Used to Produce the Drift Times  Full simulation on ALPHA machines ;  Bologna Condor facility used ;  Four tracks for each x, , B considered.    For each track we assumed the drift time is given by :  50% one electron  40% two electrons  10% three electrons 3 9/12/1998 Report

7 June 29, 2006 P. Capiluppi 7 Drift Lines when B w = 0.3 T 5 9/12/1998 Report

8 8 P. Capiluppi June 29, 2006 And we did started u A strange (to my mind) CMS Simulation statement (dated 20 Apr 1998) u The objective was to measure the throughput (in terms of CMS simulated events per hour) of the our Condor Pool … At the beginning we had some compatibility problem between the CERN Library and the Condor libraries, but the Condor Team promptly solved these problems. This has to be stressed again: the support from the Condor team is very good! u Indeed we (CMS Italy) started in that period to support (in concrete, even if small contribution) the Condor team u The number of machines running simulation under Condor was from 9 to 19! The 40% of jobs were checkpointed (we note that in the CMS case the checkpoint file was of the order of 66 MB!).

9 9 P. Capiluppi June 29, 2006 The real challenge (1/2) u CMSIM jobs were mostly CPU intensive  Very small I/O, compared to the CPU time required by the simulation of the number of events/job (carefully chosen)  Executable of the order of 140 Mbytes  Some of the Simulation programs required the access to input data (via RPC, not NFS, even in the “local” environment of Bologna). è Small in size in any case: ~130 KBytes/event read, same amount write è Some of the jobs had a larger I/O: ~600 KBytes/event u Propagation of the random seed for Simulation among the jobs  Required a careful bookkeeping (Hand made at that time) u Coordination between different activities over the Condor Pool(s)  We were not the only user, and some of the time constraints for the production, required a co-ordination  In particular, when going to the national WAN implementation, we faced large fluctuations in response time and in consistency of local machines è Well known, nowadays in Grid …

10 10 P. Capiluppi June 29, 2006 The real challenge (2/2) u SUN OS to Alpha OS required some different configuration, and of course, compilation  Some of the CMSIM Fortran packages for a CMS sub-detector could not be exported, so were dropped è fortunately not important for the Physics scope  All the jobs were submitted via a single SUN station è Limited resources for the many jobs input and output è Complicated procedure to get the Alpha executable available è Single point of failure for all the simulations è And all the participating persons had to have a local account and coordinate themselves u The results of the simulation had to be available to all CMS  Some GByte of data over AFS? Not possible at that time: Procedures to get the data exported (FTS) and permanently stored (Local Tape trivial System)

11 11 P. Capiluppi June 29, 2006 The successful Challenge u Looking back to the (lost) Web Pages: available in Bologna (Oct 1998)  30 datasets, each of 4000 events, of single muon signal at 4 GeV  30 datasets, each of 4000 events, of single muon signal at 25 GeV  30 datasets, each of 4000 events, of single muon signal at 200 GeV  30 datasets, each of 1000 events, of Higgs events at the planned masses u All the data were produced in a considerable short time, given the resources that CMS had dedicated to the Experiment in Bologna  As an example a dataset was produced in a 3 days time over the Condor Pool, against a 17 days time on a dedicated machine ! u Condor proved to be VERY robust against machine crashes and network interruptions  We experimented both the Network and the machine crashes: in both cases we could recuperate the “running” jobs without human intervention (more or less …)  Checkpointing of Condor was a key issue in this scenario

12 12 P. Capiluppi June 29, 2006 And we continued … (Bo+Pd) October 1999 Report 15 days on 6 SUN in the Condor Pool of Padova Same effort on the Bologna Pool

13 13 P. Capiluppi June 29, 2006 The machines (resources) used u Bologna Condor Pool  19 Digital Alpha Unix 4.0x  3 HP-UX 10.20  8 PC Linux è We used them !  2 SGI IRIX 6.2 or 6.3  1 SUN Solaris 2.5  Located in two WAN connected sites: RPC access u The INFN Wan Condor  48 Digital Alpha (various Unix releases)  14 HP-UX  17 PC Linux  2 SGI IRIX  1 SUN Solaris    

14 14 P. Capiluppi June 29, 2006 Performance evaluations (CMSIM on Condor) u A Computer-Science Thesis by Ilaria Colleoni (1998-99) (Co-tutor: C. Grandi)  Attempt to numerically evaluate the running of CMSIM on Condor  With “real” simulation jobs of different computing loads è Single Muons (4 GeV, 25 GeV, 200 GeV) è Higgs (2muons) of different masses è CPU times/job: from ~4 hours up to ~45 hours  Both in a Local Condor Pool (Bologna) and in the INFN WAN Condor environment  Alpha platform used, but submitting machine was a SUN  Checkpointing enabled (exe ~140 MB)  All I/O operations (when needed) via RPC

15 15 P. Capiluppi June 29, 2006 Single muon events Local Pool u Increasing computational load for the different momenta  4 GeV, 25 GeV, 200 GeV u Comparison of the CPU time on Condor with a locally-run identical simulations u Normalization of the CPU time on Condor accounting for the different CPU power of the used nodes (+ some other consideration, like memory, etc.)

16 16 P. Capiluppi June 29, 2006 Single muon events WAN Pool u Same kind of jobs  Would have required ~week to execute on the Local Pool  Got the results in ~3 days u Same Normalization of the CPU time u Estimate of the WAN running load

17 17 P. Capiluppi June 29, 2006 Some (historical) Issues u During that “first Data Challenge” we faced for the first time the “data” problem:  We were worried of the I/O of jobs, over the LAN and WAN è And we discovered that the simulation jobs are so CPU intensive that it was a negligible problem, even with those bandwidths è It might be a problem with the current CPUs  But we had to cope with the disk space of the submitting machine è And then we had to find a way to make the produced data available for access (copies)  Nowadays we know that the real problem is not the distributed computing, but the distribution of data accesses u Another point was the predictability of the Condor System  I remember long discussions with Miron and Paolo (in his office), to try to understand if Condor could be a solution for “Distributed Analysis” è Is it solved?

18 18 P. Capiluppi June 29, 2006 Conclusion u CMS (Bologna) started at that time to use the “distributed computing” to perform a “simulation challenge”  We found everything (mostly) ready, thanks to Condor  And it was a success ! u CMS (at large) has gone through many “computing, data and analysis challenges” since then  Many of them were successful (and we hope we will be successful with the “real challenge” of “real data”)  However from that exercise in 1998-99 we learnt a lot: è Distributed Services, Coordination, etc. è And very important: Robustness of the underlying software ! u That (modest) Data Challenge was the precursor of a GRID activity, that, since then, took most of our time …

19 19 P. Capiluppi June 29, 2006 u First evaluations (Ilaria) u Running the production  Problems  People  Pools  Resources u Results (Bo+Pd) u Some issues  Historical (Miron & Paolo presentations)  Dependencies of the available Condor (CPU vs I/O)  Predictability of the results, or simulation vs analyses u Conclusions  First “distributed” CMS challenge  Grid precursor

Download ppt "June 29, 2006 P. Capiluppi The First CMS Data Challenge (~1998/99) Using Condor."

Similar presentations

Ads by Google