June 29, 2006 P. Capiluppi The First CMS Data Challenge (~1998/99) Using Condor.

Slides:

Advertisements

Similar presentations

31/03/00 CMS(UK)Glenn Patrick What is the CMS(UK) Data Model? Assume that CMS software is available at every UK institute connected by some infrastructure.

Advertisements

LNL CMS M.Biasotto, Bologna, 29 aprile LNL Analysis Farm Massimo Biasotto - LNL.

Lecture 11: Operating System Services. What is an Operating System? An operating system is an event driven program which acts as an interface between.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

GRID Workload Management System Massimo Sgaravatto INFN Padova.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

Workload Management Massimo Sgaravatto INFN Padova.

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Hall D Online Data Acquisition CEBAF provides us with a tremendous scientific opportunity for understanding one of the fundamental forces of nature. 75.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

The D0 Monte Carlo Challenge Gregory E. Graham University of Maryland (for the D0 Collaboration) February 8, 2000 CHEP 2000.

History of the National INFN Pool P. Mazzanti, F. Semeria INFN – Bologna (Italy) European Condor Week 2006 Milan, 29-Jun-2006.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

A Distributed Computing System Based on BOINC September - CHEP 2004 Pedro Andrade António Amorim Jaime Villate.

Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.

October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

Claudio Grandi INFN Bologna CMS Operations Update Ian Fisk, Claudio Grandi 1.

+ discussion in Software WG: Monte Carlo production on the Grid + discussion in TDAQ WG: Dedicated server for online services + experts meeting (Thusday.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.

8th November 2002Tim Adye1 BaBar Grid Tim Adye Particle Physics Department Rutherford Appleton Laboratory PP Grid Team Coseners House 8 th November 2002.

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

11 December 2000 Paolo Capiluppi - DataGrid Testbed Workshop CMS Applications Requirements DataGrid Testbed Workshop Milano, 11 December 2000 Paolo Capiluppi,

6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.

5 May 98 1 Jürgen Knobloch Computing Planning for ATLAS ATLAS Software Week 5 May 1998 Jürgen Knobloch Slides also on:

The ALICE short-term use case DataGrid WP6 Meeting Milano, 11 Dec 2000Piergiorgio Cerello 1 Physics Performance Report (PPR) production starting in Feb2001.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

4/5/20071 The LAW (Linux Applications on Windows) Project Sudhamsh Reddy University of Texas at Arlington.

CMS Software at RAL Fortran Code Software is mirrored into RAL AFS cell every 24 hours  /afs/rl.ac.uk/cms/ Binary libraries available for: HPHP-UX

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

The impacts of climate change on global hydrology and water resources Simon Gosling and Nigel Arnell, Walker Institute for Climate System Research, University.

O PERATING S YSTEM. What is an Operating System? An operating system is an event driven program which acts as an interface between a user of a computer,

HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

GRID activities in Wuppertal D0RACE Workshop Fermilab 02/14/2002 Christian Schmitt Wuppertal University Taking advantage of GRID software now.

Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.

PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.

Liverpool Experience of MDC 1 MAP (and in our belief any system which attempts to be scaleable to 1000s of nodes) broadcasts the code to all the nodes.

6 march Building the INFN Grid Proposal outline a.ghiselli,l.luminari,m.sgaravatto,c.vistoli INFN Grid meeting, milano.

Upgrade Software University and INFN Catania Upgrade Software Alessia Tricomi University and INFN Catania CMS Trigger Workshop CERN, 23 July 2009.

MC Production in Canada Pierre Savard University of Toronto and TRIUMF IFC Meeting October 2003.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

Predrag Buncic (CERN/PH-SFT) Software Packaging: Can Virtualization help?

January 20, 2000K. Sliwa/ Tufts University DOE/NSF ATLAS Review 1 SIMULATION OF DAILY ACTIVITITIES AT REGIONAL CENTERS MONARC Collaboration Alexander Nazarenko.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.

Hans Wenzel CDF CAF meeting October 18 th -19 th CMS Computing at FNAL Hans Wenzel Fermilab  Introduction  CMS: What's on the floor, How we got.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

1 June 11/Ian Fisk CMS Model and the Network Ian Fisk.

Enabling Grids for E-sciencE LRMN ThIS on the Grid Sorina CAMARASU.

1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.

2. OPERATING SYSTEM 2.1 Operating System Function

Virtualisation for NA49/NA61

Dag Toppe Larsen UiB/CERN CERN,

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Dag Toppe Larsen UiB/CERN CERN,

Chapter 2: System Structures

Virtualisation for NA49/NA61

US CMS Testbed.

OffLine Physics Computing

Lecture Topics: 11/1 General Operating System Concepts Processes

ATLAS DC2 & Continuous production

Designing a PC Farm to Simultaneously Process Separate Computations Through Different Network Topologies Patrick Dreher MIT.

Presentation transcript:

June 29, 2006 P. Capiluppi The First CMS Data Challenge (~1998/99) Using Condor

2 P. Capiluppi June 29, 2006 Disclaimer u Official presentations of those activities are no more available…  Long time ago  Used machines already dismissed since time  Files lost in the dismissed disks  Only fragment of information still around è Mosltly on “printed” slides and unlinked Web pages  And … my memory is not as it was at that time … u However I could find some information and, without surprise, a number of “well known” names  The list of them will certainly forget somebody, so I’ll avoid to do it,  but PAOLO MAZZANTI is worth to be mentioned !

3 P. Capiluppi June 29, 2006 Environment and History u CMS Simulation program (CMSIM) using Geant3 (Fortran)  Different versions in rapid development u Objectivity at that time for CMS u First CMS reconstruction programs using C++ u SUN OS and HP Unix were the CMS basic operating systems  But Linux rapidly growing  And we had a legacy of a lot of Digital Alphas from LEP u ~Year 2000 INFN started to fund PCs Farms u In 1999 INFN launched the INFN-Grid project u The MONARC project was running at CERN Then … we were flooded by GRID and Tiers

4 P. Capiluppi June 29, 2006 The Data Challenge start … u From the minutes of a meeting of 14 May 1998:  Need to generate events of single muon (3 different momenta) and events of Higgs -> 2 muons (3 different masses)  To be done over Condor, starting June 98  CMSIM code has been ported from SUN to Alpha: needs to be “linked” with Condor libraries  Local running of tests on Higgs simu gave ~1.4 min/event on both Alpha and SUN (with ~ 5 min of program initialization): >700 hours of CPU time for that sample of events u From another meeting of 13 May 1998:  Planning the National (INFN) Condor pool (~57 machine available)  CMSIM is one of the possible applications over WAN  GARFIELD (electric field simulation of the CMS Muon Detectors, DC cells) will run only locally (checkpoint file too big! Less than a today mail attachment …)

5 P. Capiluppi June 29, 2006 The challenge … before starting u We (CMS Bologna) were already using Condor!

June 29, 2006 P. Capiluppi 6 Method Used to Produce the Drift Times  Full simulation on ALPHA machines ;  Bologna Condor facility used ;  Four tracks for each x, , B considered.   For each track we assumed the drift time is given by :  50% one electron  40% two electrons  10% three electrons 3 9/12/1998 Report

June 29, 2006 P. Capiluppi 7 Drift Lines when B w = 0.3 T 5 9/12/1998 Report

8 P. Capiluppi June 29, 2006 And we did started u A strange (to my mind) CMS Simulation statement (dated 20 Apr 1998) u The objective was to measure the throughput (in terms of CMS simulated events per hour) of the our Condor Pool … At the beginning we had some compatibility problem between the CERN Library and the Condor libraries, but the Condor Team promptly solved these problems. This has to be stressed again: the support from the Condor team is very good! u Indeed we (CMS Italy) started in that period to support (in concrete, even if small contribution) the Condor team u The number of machines running simulation under Condor was from 9 to 19! The 40% of jobs were checkpointed (we note that in the CMS case the checkpoint file was of the order of 66 MB!).

9 P. Capiluppi June 29, 2006 The real challenge (1/2) u CMSIM jobs were mostly CPU intensive  Very small I/O, compared to the CPU time required by the simulation of the number of events/job (carefully chosen)  Executable of the order of 140 Mbytes  Some of the Simulation programs required the access to input data (via RPC, not NFS, even in the “local” environment of Bologna). è Small in size in any case: ~130 KBytes/event read, same amount write è Some of the jobs had a larger I/O: ~600 KBytes/event u Propagation of the random seed for Simulation among the jobs  Required a careful bookkeeping (Hand made at that time) u Coordination between different activities over the Condor Pool(s)  We were not the only user, and some of the time constraints for the production, required a co-ordination  In particular, when going to the national WAN implementation, we faced large fluctuations in response time and in consistency of local machines è Well known, nowadays in Grid …

10 P. Capiluppi June 29, 2006 The real challenge (2/2) u SUN OS to Alpha OS required some different configuration, and of course, compilation  Some of the CMSIM Fortran packages for a CMS sub-detector could not be exported, so were dropped è fortunately not important for the Physics scope  All the jobs were submitted via a single SUN station è Limited resources for the many jobs input and output è Complicated procedure to get the Alpha executable available è Single point of failure for all the simulations è And all the participating persons had to have a local account and coordinate themselves u The results of the simulation had to be available to all CMS  Some GByte of data over AFS? Not possible at that time: Procedures to get the data exported (FTS) and permanently stored (Local Tape trivial System)

11 P. Capiluppi June 29, 2006 The successful Challenge u Looking back to the (lost) Web Pages: available in Bologna (Oct 1998)  30 datasets, each of 4000 events, of single muon signal at 4 GeV  30 datasets, each of 4000 events, of single muon signal at 25 GeV  30 datasets, each of 4000 events, of single muon signal at 200 GeV  30 datasets, each of 1000 events, of Higgs events at the planned masses u All the data were produced in a considerable short time, given the resources that CMS had dedicated to the Experiment in Bologna  As an example a dataset was produced in a 3 days time over the Condor Pool, against a 17 days time on a dedicated machine ! u Condor proved to be VERY robust against machine crashes and network interruptions  We experimented both the Network and the machine crashes: in both cases we could recuperate the “running” jobs without human intervention (more or less …)  Checkpointing of Condor was a key issue in this scenario

12 P. Capiluppi June 29, 2006 And we continued … (Bo+Pd) October 1999 Report 15 days on 6 SUN in the Condor Pool of Padova Same effort on the Bologna Pool

13 P. Capiluppi June 29, 2006 The machines (resources) used u Bologna Condor Pool  19 Digital Alpha Unix 4.0x  3 HP-UX  8 PC Linux è We used them !  2 SGI IRIX 6.2 or 6.3  1 SUN Solaris 2.5  Located in two WAN connected sites: RPC access u The INFN Wan Condor  48 Digital Alpha (various Unix releases)  14 HP-UX  17 PC Linux  2 SGI IRIX  1 SUN Solaris    

14 P. Capiluppi June 29, 2006 Performance evaluations (CMSIM on Condor) u A Computer-Science Thesis by Ilaria Colleoni ( ) (Co-tutor: C. Grandi)  Attempt to numerically evaluate the running of CMSIM on Condor  With “real” simulation jobs of different computing loads è Single Muons (4 GeV, 25 GeV, 200 GeV) è Higgs (2muons) of different masses è CPU times/job: from ~4 hours up to ~45 hours  Both in a Local Condor Pool (Bologna) and in the INFN WAN Condor environment  Alpha platform used, but submitting machine was a SUN  Checkpointing enabled (exe ~140 MB)  All I/O operations (when needed) via RPC

15 P. Capiluppi June 29, 2006 Single muon events Local Pool u Increasing computational load for the different momenta  4 GeV, 25 GeV, 200 GeV u Comparison of the CPU time on Condor with a locally-run identical simulations u Normalization of the CPU time on Condor accounting for the different CPU power of the used nodes (+ some other consideration, like memory, etc.)

16 P. Capiluppi June 29, 2006 Single muon events WAN Pool u Same kind of jobs  Would have required ~week to execute on the Local Pool  Got the results in ~3 days u Same Normalization of the CPU time u Estimate of the WAN running load

17 P. Capiluppi June 29, 2006 Some (historical) Issues u During that “first Data Challenge” we faced for the first time the “data” problem:  We were worried of the I/O of jobs, over the LAN and WAN è And we discovered that the simulation jobs are so CPU intensive that it was a negligible problem, even with those bandwidths è It might be a problem with the current CPUs  But we had to cope with the disk space of the submitting machine è And then we had to find a way to make the produced data available for access (copies)  Nowadays we know that the real problem is not the distributed computing, but the distribution of data accesses u Another point was the predictability of the Condor System  I remember long discussions with Miron and Paolo (in his office), to try to understand if Condor could be a solution for “Distributed Analysis” è Is it solved?

18 P. Capiluppi June 29, 2006 Conclusion u CMS (Bologna) started at that time to use the “distributed computing” to perform a “simulation challenge”  We found everything (mostly) ready, thanks to Condor  And it was a success ! u CMS (at large) has gone through many “computing, data and analysis challenges” since then  Many of them were successful (and we hope we will be successful with the “real challenge” of “real data”)  However from that exercise in we learnt a lot: è Distributed Services, Coordination, etc. è And very important: Robustness of the underlying software ! u That (modest) Data Challenge was the precursor of a GRID activity, that, since then, took most of our time …

19 P. Capiluppi June 29, 2006 u First evaluations (Ilaria) u Running the production  Problems  People  Pools  Resources u Results (Bo+Pd) u Some issues  Historical (Miron & Paolo presentations)  Dependencies of the available Condor (CPU vs I/O)  Predictability of the results, or simulation vs analyses u Conclusions  First “distributed” CMS challenge  Grid precursor