CMS Data Challenge 2004 Claudio Grandi CMS Grid Coordinator EGEE Cork Conference, April 19th 2004 CMS Data Challenge 2004 Claudio Grandi CMS Grid Coordinator www.eu-egee.org EGEE is a project funded by the European Union under contract IST-2003-508833
Contents Definition of DC04 Pre-Challenge Production (PCP04) PCP on grid Description of DC04 setup RLS Preliminary results Appendix: DC04 setup schemas (for the really interested people) EGEE Cork Conference, April 19th, 2004 - 2
Definition of DC04 Aim of DC04: reach a sustained 25Hz reconstruction rate in the Tier-0 farm (25% of the target conditions for LHC startup) register data and metadata to a catalogue transfer the reconstructed data to all Tier-1 centers analyze the reconstructed data at the Tier-1’s as they arrive publicize to the community the data produced at Tier-1’s monitor and archive of performance criteria of the ensemble of activities for debugging and post-mortem analysis Not a CPU challenge, but a full chain demonstration! Pre-challenge production in 2003/04 70M Monte Carlo events (20M with Geant-4) produced Classic and grid (CMS/LCG-0, LCG-1, Grid3) productions Digitization still going-on “in background” EGEE Cork Conference, April 19th, 2004 - 3
Pre-Challenge Production setup Dataset metadata Phys.Group asks for a new dataset JDL Grid (LCG) Scheduler LCG-0/1 RLS Job metadata DAG job DAGMan (MOP) Chimera VDL Virtual Data Catalogue Planner Grid3 Production Manager defines assignments RefDB Computer farm shell scripts Data-level query Local Batch Manager Job level query BOSS DB McRunjob + plug-in CMSProd Site Manager starts an assignment Push data or info Pull info EGEE Cork Conference, April 19th, 2004 - 4
Statistics for PCP 750K jobs 3500 KSI2000 months 700K files Simulation 80 TB of data Simulation Digitization Start of DC04 Start of DC04 EGEE Cork Conference, April 19th, 2004 - 5
PCP on grid: CMS-LCG Gen+Sim on LCG CMS/LCG-0 LCG-1 CMS-LCG Regional Center - 0.5 Mevts “heavy” pythia: ~2000 jobs ~8 hours each, ~10 KSI2000 months - 2.1 Mevts cmsim+oscar: ~8500 jobs ~10hours each, ~130 KSI2000 months ~2 TB data Gen+Sim on LCG RefDB Dataset metadata CMS/LCG-0 LCG-1 RLS UI CE CMS/LCG-0 Joint project CMS-LCG-EDT Based on LCG pilot distribution Including GLUE, VOMS, GridICE, RLS About 170 CPU’s and 4 TB disk Sites: Bari Bologna Bristol Brunel CERN CNAF Ecole Polytechnique Imperial College ISLAMABAD-NCP Legnaro Milano NCU-Taiwan Padova U.Iowa SE McRunjob + ImpalaLite JDL RB CE CE SE SE bdII WN BOSS CE Job metadata SE Push data or info Pull info EGEE Cork Conference, April 19th, 2004 - 6
PCP on Grid: Grid3 MOP System Simulation on Grid3 USMOP Regional Center - 7.7 Mevts pythia: ~30000 jobs ~1.5min each, ~0.7 KSI2000 months - 16 Mevts cmsim+oscar: ~65000 jobs ~10hours each, ~1000 KSI2000 months ~12 TB data Still running!!! Simulation on Grid3 MOP System Master Site Remote Site 1 MCRunJob mop_submitter DAGMan Condor-G GridFTP Batch Queue Remote Site N Grid3 US grid projects + US LHC expt.’s Over 2000 CPU’s in 25 sites MOP Dagman and Condor-G for specification and submission Condor-based match-making process selects resources EGEE Cork Conference, April 19th, 2004 - 7
DC04 layout Tier-2 Tier-0 Tier-2 Tier-2 Tier-1 Tier-1 Tier-1 Castor IB fake on-line process RefDB POOL RLS catalogue TMDB ORCA RECO Job GDB data distribution agents EB Tier-2 Physicist T2 storage ORCA Local Job Tier-2 Physicist T2 storage ORCA Local Job Tier-2 Physicist T2 storage ORCA Local Job LCG-2 Services Tier-1 agent T1 storage ORCA Analysis Job MSS Grid Job Tier-1 agent T1 storage ORCA Analysis Job MSS Grid Job Tier-1 agent T1 storage ORCA Analysis Job MSS Grid Job EGEE Cork Conference, April 19th, 2004 - 8
Main aspects of DC04 1/2 Maximize reconstruction efficiency no interactions of Tier-0 jobs with outside components Automatic registration and distribution of data via a set of loosely coupled agents Support a (reasonable) variety of data transfer tools SRB (RAL, GridKA, Lyon, with Castor, HPSS and Tivoli SE) LCG Replica Manager (CNAF, PIC, with SE/Castor) SRM (FNAL, with d-chache/Enstore) Use a single file catalogue (accessible from Tier-1’s) RLS used for data and metadata (POOL) by all transfer tools Test replica at CNAF (via ORACLE multi-master mirroring) Transfer Management DB (TMDB) used for assigning data to Tier-1’s and for inter-agent communication Failover systems and automatic recovery EGEE Cork Conference, April 19th, 2004 - 9
Main aspects of DC04 2/2 Monitor and archive resource and process information MonaLisa used on almost all resources GridICE used on all LCG resources (including WN’s) LEMON on all IT resources Ad-hoc monitoring of TMDB information Job submission at Regional Centers left to their choice Using LCG-2 in Italy and Spain and at most Tier-2’s Copy of the LCG-2 bdII at CERN includes also CMS-only resources Submission via a dedicated Resource Broker at CERN Using the official RLS at CERN. Will use the RLS mirror at CNAF Using the official LCG-2 VOMS Software installation via the new LCG tools (CMS Software Manager) User analysis Prototyping GROSS: based on BOSS, supports user analysis on LCG EGEE Cork Conference, April 19th, 2004 - 10
CMS software and POOL Reconstruction and analysis: using ORCA Runs Reconstruction and analysis: using ORCA DST have links to raw data but may be processed without raw data Event streams operational Persistency through POOL All jobs use local XML catalogues Updates to central RLS catalogue only done for successful jobs: Trigger Digis L1 DiMuon Stream Tracks and Partial Muon Reconstruction Full DST including Tracks, Muons, Cluster, jets using an external agent for reconstruction jobs at Tier-0 in the job wrapper for user jobs SCRAM re-creates run-time environment on Worker Nodes EGEE Cork Conference, April 19th, 2004 - 11
Use of POOL-RLS catalogue RLS used as a POOL catalogue Register files with their POOL metadata Query metadata to determine where to send files Register physical location of files on Tier-0 Export Buffers Use catalogue to replicate files to Tier-1’s Tools have been developed to synchronize SRB-GMCAT and RLS Local POOL catalogues at Tier-1’s are optionally populated Analysis jobs on LCG use the catalogue through the Resource Broker to submit jobs close to the data Analysis jobs on LCG register their private data Replication via ORACLE multi-master mirroring EGEE Cork Conference, April 19th, 2004 - 12
Description of RLS usage Local POOL catalogue TMDB Tier-1 Transfer agent SRB GMCAT Replica Manager RM/SRM/SRB EB agents 4. Copy files to Tier-1’s Resource Broker 3. Copy files to export buffers 5. Submit analysis job POOL RLS catalogue ORCA Analysis Job Configuration agent 2. Find Tier-1 Location (based on metadata) 6. Process DST and register private data 1. Register Files XML Publication Agent ORACLE mirroring RLS replica LCG Grid Production Job EGEE Cork Conference, April 19th, 2004 - 13
RLS performance April 2nd, 18:00 0.4 files/s 25 Hz ● Time to register the output of a single job (16 files) – left axis ● Load on client machine at the time of registration – right axis EGEE Cork Conference, April 19th, 2004 - 14
Statistics for DC04 2200 jobs/day (about 500 CPU’s) running at Tier-0 4 MB/s produced and distributed to each Tier-1 0.4 files/s registered to RLS (with POOL metadata) Reconstruction 25 Hz 15 Mevt/week EGEE Cork Conference, April 19th, 2004 - 15
Preliminary results The full chain is demonstrated for limited amount of time Reconstruction, data transfer and analysis may run at 25 Hz When too many files are registered in the system, it slows down below the 25 Hz threshold Identified the main areas for improvement: Reduce number of files (increase <#events>/<#files>) more efficient use of bandwidth fixed time to “start-up” dominates command execution times E.g. Java for EDG commands, or positioning of tape drivers address scalability of MSS systems reduce load on databases indexed by files (e.g. POOL cat.) Improve handling of file metadata in catalogues RLS too slow both inserting and extracting full file records introduce the concept of “file-set” to support bulk operations Need to manage read-write “objects” to store event metadata needed to cope with evolving datasets! EGEE Cork Conference, April 19th, 2004 - 16
Appendix DC04 set-up schemas …only if you’re really interested!
Fake on-line operations Input Buffer RefDB Digi+Hits COBRA metadata POOL RLS catalogue TMDB 4. get POOL fragment 5. register PFN & metadata Digi files 6. insert new “request” 3. attachRun 7. insert 1. get dataset file names 25Hz fake on-line process Dataset priority list (PRS) Castor 2. stage The Input Buffer (IB) is a Castor stage area and is available through rfio (assumed in the following) or gridftp from all Tier-0 Worker nodes (WN). Files for all digi runs are in Castor. POOL metadata for all digi runs are in the RefDB. PRS have provided a sorted list of runs (i.e. group of files) to be injected in the system. An empty version of the COBRA Metadata for all digi that have to be processed is available on the Input Buffer. An empty version of the hits COBRA Metadata is available (with the geometry). The Fake on-line process stages on the Input Buffer a number of runs with a frequency corresponding to 25 Hz (e.g. 1 run of 250 events every 10 seconds). The sequnce of the operations is: the fake on-line process reads from the Dataset priority list the group of files to be staged. the fake on-line process stages the files from Castor to the Input buffer the fake on-line process updates the relevant COBRA metadata files with the info of the new run (i.e. attachRun) the fake on-line process updates extracts the POOL metadata of the run from the RefDB the fake on-line process registers the staged files in the CERN RLS POOL catalogue with their metadata the fake on-line process registers the request for a new digi run to be processed in a dedicated DB (the RefDB is assumed in the following) the fake on-line process registers the digi files in the Transfer Management Database (if they have to be transferred to Tier-1’s) reference to push or create read Pre-condition empty Digi and Hits COBRA metadata available RefDB has POOL metadata for Digis Post-conditions input buffer filled with Digi files and consistent COBRA metadata POOL catalogue correctly filled entry in RefDB specifies new job to be run entry in Transfer Management DB for digi files (if transferring Digi files) EGEE Cork Conference, April 19th, 2004 - 18
Tier-0 job preparation operations Input Buffer ORCA RECO script & .orcarc XML catalogue Digi+Hits COBRA metadata 3. McRunJob create Digi files 2b. POOL publish Job preparation agent 4. McRunJob run LSF General Dist. Buffer Empty Reco COBRA metadata 2a. POOL cat. read 1. discover RefDB The General Distribution Buffer (GDB) is a Castor stage area and is available through rfio (assumed in the following) or gridftp from all Tier-0 Worker nodes (WN). An empty version of the Reco COBRA Metadata is available on the General Distribution Buffer. The sequnce of the operations is: The job preparation agent discovers from the RefDB that a new run is available for processing The job preparation agent prepares an XML POOL catalogue with the information about the input files and the empty Reco COBRA metadata file, which is assumed to be local to the job at run time The job preparation agent invokes McRunJob (or equivalent) to producce the job script and the other needed files (e.g. the .orcarc) are produced The job preparation agent invokes McRunJob (or equivalent) to submit the job to the local resource manager (LSF) It is assumed that BOSS is used for job monitoring, but it is not relevant for this discussion POOL RLS catalogue Pre-condition Empty Reco COBRA metadata file is available and registered in POOL Post-conditions XML catalogue to be used by the job is ready execution script and accessory files are ready job is submitted to LSF EGEE Cork Conference, April 19th, 2004 - 19
4. update XML with local copy of Reco COBRA metadata Tier-0 reconstruction 13. read e-mail LSF ORCA RECO script & .orcarc Original XML catalogue Input Buffer Summary e-mail RefDB updater Digi COBRA metadata 1. execute RefDB 14. update 2. read catalogue 8. send e-mail Digi files Checksum file 11. Discover cksm file TMDB Agent 3. rfcp (download) ORCA RECO Job 7. write General Dist. Buffer TMDB Empty Reco COBRA metadata 12. insert 6. diff 5. rfcp (upload) XML fragment The sequnce of the operations is: LSF sends the job to a Worker Node. The job will use the executable script prepared at the previous step. The job reads the XML catalogue to discover what input files are needed. The job downloads the needed input files to the local WN work area (an alternative is that they’re accessed from the Input Buffer through rfio; the Empty Reco CORA metadata file is ALWAYS downloaded to the local work area). The job updates the local version of the XML catalogue to use the local files. In any case the local version of the Reco COBRA metadata is used. At the end of the execution the job uploads the new Reco files to the General Distribution buffer. The files are automatically archived in Castor. The local version of the Reco COBRA metadata is discarded. The job updates the XML catalogue with the location of the files in the General Distribution Buffer (i.e. Castor) and writes an XML fragment The job writes a file with checksums of produced files The job sends an e-mail with summary information of the job to RefDB The XML Publication Agent discovers new XML catalogue files The XML Publication Agent publishes the XML catalogue in the CERN RLS catalogue. The TMDB agent discovers anew checksum files The TMDB agent inserts new files in the Transfer Management DB (TMDB) The RefDB updater process reads e-mail The RefDB updater process updates the RefDB with job information 9. Discover XML catalog XML Publ. Agent Reco files POOL RLS catalogue 10. register files & metadata 4. update XML with local copy of Reco COBRA metadata Castor Post-conditions Reco files are on the General Distribution Buffer and on tape POOL catalogue correctly updated with Reco files Reco file entries are inserted in the Transfer Management Database EGEE Cork Conference, April 19th, 2004 - 20
Data distribution @ Tier-0 POOL RLS catalogue 2. get metadata Configuration agent 3. Assign file to Tier-1 1. new file discovery Tier-1 Transfer Manag. DB 8. discover 9. update 13. check Clean-up agent 6. add PFN 15. update 4. discover 12. Delete PFN 14. purge 7. update dCache SRM Input Buffer General Dist. Buffer 10. check 5b. copy (write) Digi files It is assumed that the CERN Castor will have an SRM interface. The Castor stage area is assumed to be big enough to be used as SRM Export area for SRM Tier-1’s. The sequnce of the operations is: The Configuration agent queries the Transfer Management Database to discover the existence of new files to be transferred. The Configuration agent extracts the metadata from the RMC to deterine the stream to which the file belongs. The Configuration agent updates the Transfer Management DB to assign files to Tier-1’s. The operation should assure that each Tier-1 will receive files in complete sets that are usable by the analysis jobs. The SRM, SRB and Replica Manager (RM) EB agents query the TMDB to discover new files to be transferred. The SRM, SRB and RM EB agents copy the files from the General Distribution Buffer (Reco files) and Input Buffer (Digi files) to the appropriate export buffer. NOTE that an empty version of the COBRA metadata for all Digi and Reco Owners must be available for distribution on the export buffers (and registered to the RLS). The SRM, SRB and RM EB agents insert in the RLS the PFN for the new location on the export buffer (at least for SRM and RM agents). The SRM, SRB and RM EB agents update the TMDB with information about the new status of the files. The Tier-1 discovers the existence of files to be downloaded from the TMDB. The Tier-1 updates the TMDB with the information about the status of the transfer. The SRM, SRB and RM clean-up agents query the TMDB to check whether files have already been transferred to all Tier-1’s to which they were supposed to go The SRM, SRB and RM clean-up agents delete the physical instances of the files in the EB The SRM, SRB and RM clean-up agents delete the entries for files deleted from EB in the RLS The Global clean-up agent checks the status of the transfers in the Transfer Management DB The Global clean-up agent purge files in the IB and GDB (via Castor) The Global clean-up agent updates the file status in the TMDB 5a. copy (read) RM/SRM/SRB clean-up agent EB agent SE RM 11. delete Reco files SRB Vault SRB Pre-conditions: Digi and reco files are registered in the Transfer Management DB Post-conditions: Input and general distribution buffer are cleared of any files already at Tier-1’s All data files assigned (copied) to Tier-1 as decided by Configuration agent logic Transfer Management DB and POOL RLS cat. kept up-to-date with file locations EGEE Cork Conference, April 19th, 2004 - 21
Tier-1 RM data import/export TMDB local POOL catalogue 1. discover 6.FCpublish (if not an RLS mirror) Tier-1 agent 5. update 5.update MSS 3. replicate 2.lookup 4ac.lookup & update RM POOL RLS catalogue SRM 4b. copy 4ac.lookup & update GMCAT 4d.add SFN SRB 7. discover The RLS catalogue used by the Tier-1’s is either the CERN one or another one synchronized (via Oracle mirroring) with the CERN one. A local POOL catalogue (MySQL?) is also foreseen in case a synchronized RLS is not available locally. The Transfer Management DB must be accessible from the Tier-1’s via a predefined authentication method. The sequnce of the operations is: The Tier-1 agent discovers new files to be transferred from the Transfer Manager DB. The Tier-1 agent queries the RLS to find the source file name (SURL) for the GUID found in step 1: for RM: the query is not needed for SRM: translate the SURL to transfer file name (TURL) using a query to the RM (to be verified) for SRB: the SURL is enough The Tier-1 agent starts the repication of the files via the chosen method. The copy is performed by the chosen tool: The RM accepts the GUID found at step 1 and the destination Storage Element and internally queries the RLS to find the TURL, copies the file and updates the RLS with the new SURL SRM accepts the TURL found at step 2 and the destination SRM and copies the file SRB accepts the SURL found at step 2 and the destination SRB vault and copies the file. Internally the GMCAT is queried/updated and the GMCAT itself adds new SFN to RLS. The The Tier-1 agent updates the Transfer Management DB with the new status of the files In case a local POOL catalogue is to be used by the local job, the Tier-1 agent inserts the information of the new files (publish) In case the local store is an SRB vault and the data have to be exproted to Tier-2’s that are using LCG-2 tools, an SRB2LCG agent is needed. It discovers via a Sls in SRB the existence of new files The SRB2LCG agent downloads the files from the local SRB vault (Sget) The SRB2LCG agent uploads the files to an LCG Storage Element (gridftp) The SRB2LCG agent adds the new SFN in the RLS catalogue Steps 7 to 10 may have an equivalent for SRM. In case the SRM-SE will be managed by the Replica Manager, the specific SRM use case will disappear (and will become equivalent to the RM one). 8. Sget LCG SE SRB2LCG agent 9. gridftp 10.add SFN Pre-conditions: POOL RLS catalogue is either the CERN one or a local mirror Transfer Manag. DB at CERN is accessible from Tier-1’s Post-conditions: data copied at Tier-1 on MSS and available to Tier-2 CERN POOL RLS catalogue and local POOL catalogue updated Transfer Management DB updated EGEE Cork Conference, April 19th, 2004 - 22
Tier-1 analysis job preparation Loal storage or SE ORCA RECO script & .orcarc XML catalogue EmptyReco COBRA metadata 3. McRunJob create Reco files 2b. POOL publish Job preparation agent Local or Grid Resource Manager 4. McRunJob run 2a. POOL cat. read Global (RLS) POOL catalogue 1. discover Local POOL catalogue The preparation goes in a way which is similar to that of the Tier-0. Main differences are: Files may be either in the local sotrrage or on a grid Storage Element for LCG-2 Tier-1’s. The discovery of new runs to process is not done through the RefDB but through the Global RLS catalogue (the local is not enough because it is not possible to identify all files needed for running a job). Alternatively a DSdump on the metadata file of the run may do the job. The output data are assumed not to be part of a new dataset (e.g. they’re root files or ntuples). To be verified The local storage is available from all Worker Nodes either through unix cp, or rfcp or gridftp. The sequence of the operations is: The job preparation agent discovers from the RLS catalogue that a new run is available for processing (still to be defined) The job preparation agent prepares an XML POOL catalogue with the information about the input files The job preparation agent invokes McRunjob (or equivalent) to produce the job script and the other needed files (e.g. the .orcarc). The job preparation agent invokes McRunJob (or equivalent) to submit the job to the local resource manager or to the Grid scheduler. It is assumed that BOSS is used for job monitoring, but it is not relevant for this discussion. Pre-conditions: a local POOL catalogue is populated with at least the local files (may be an RLS) the list of files of a given run is provided by the global POOL catalogue (i.e. RLS) Post-conditions: XML catalogue to be used by the job is ready execution script and accessory files are ready job is submitted to a local or grid resource manager EGEE Cork Conference, April 19th, 2004 - 23
4. update XML with local copy files Tier-1 analysis ORCA RECO script & .orcarc Original XML catalogue Local storage or SE Empty Reco COBRA metadata 2. read catalogue Reco files 3. file download ORCA Analysis Job Resource Manager 1. execute root or ntuple files 6a. file upload 5. attachRun on the local copy of the COBRA metadata 6.b register new files (if on grid) The job running goes in a way which is similar to that of the Tier-0. Main differences are: Files may be either in the local sotrage or on a grid Storage Element for LCG-2 Tier-1’s The Reco COBRA metadata file is empty and attachRun needs to be done on the WN with the needed input files Output files are not POOL files and it is not needed to publish the local XML catlogue to the RLS at the end of the job The sequence of the operations is: The local or grid resource manager sends the job to a WN. The job will use the executable script prepared at the previous step possibly sent in an “input sandbox”. The job reads the XML catalogue (possibly sent via input sandbox) to discover what input files are needed. The job downloads the needed input files to the local WN work area (an alternative is that they’re accessed from the local storage if an access methodis provided). The Empty Reco CORA metadata file is ALWAYS downloaded to the local work area The job updates the local version of the XML catalogue to use the local files. In any case the local version of the Reco COBRA metadata is used. The job registers the input Reco files in the COBRA metadata file via attachRun. At the end of the execution the job uploads the new root or ntuple files to the local storage or to a Storage Element. If on the grid this operation implies registration to the RLS catalogue RLS catalogue 4. update XML with local copy files (only if downloaded) Post-conditions Root or ntuple files are on the local storage or on a storage element RLS updated if on the grid Note: if the Tier-1 uses SRB the local storage may be an SRB vault and the RLS catalogue is replaced by the GMCAT EGEE Cork Conference, April 19th, 2004 - 24