Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003 CMS Distributed Data Analysis Challenges Claudio Grandi on behalf of the CMS Collaboration
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 2 Outline CMS Computing Environment CMS Computing Milestones OCTOPUS: CMS Production System 2002 Data productions 2003 Pre-Challenge production (PCP03) PCP03 on grid 2004 Data Challenge (DC04) Summary
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003 CMS Computing Environment
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 4 CMS computing context LHC will produce 40 million bunch crossing per second in the CMS detector (1000 TB/s) The on-line system will reduce the rate to 100 events per second (100 MB/s raw data) –Level-1 trigger: hardware –High level trigger: on-line farm Raw data (1MB/evt) will be: –archived on persistent storage (~1 PB/year) –reconstructed to DST (~0.5 MB/evt) and AOD (~20 KB/evt) Reconstructed data (and part of raw data) will be: –distributed to computing centers of collaborating institutes –analyzed by physicists at their own institutes
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 5 CMS Data Production at LHC Level 1 Trigger High Level Trigger 40 MHz (1000 TB/sec) 75 KHz (50 GB/sec) 100 Hz (100 MB/sec) Data Recording & Offline Analysis
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 6 CMS Distributed Computing Model Tier 1 Tier2 Center Online System CERN Center PBs of Disk; Tape Robot FNAL Center IN2P3 Center INFN Center RAL Center Institute Workstations ~ MBytes/sec Gbps 0.1 to 10 Gbps Physics data cache ~PByte/sec ~ Gbps Tier2 Center ~ Gbps Tier 0 +1 Tier 3 Tier 4 Tier2 Center Tier 2 Experiment
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 7 CMS software for Data Simulation Event Generation –Pythia and other generators Generally Fortran programs. Produce N-tuple files (HEPEVT format) Detector simulation –CMSIM (uses GEANT-3) Fortran program. Produces Formatted Zebra (FZ) files from N-tuples –OSCAR (uses GEANT-4 and the CMS COBRA framework) C++ program. Produces POOL files (hits) from N-tuples Digitization (DAQ simulation) –ORCA (uses the CMS COBRA framework) C++ program. Produces POOL files (digis) from hits POOL files or FZ Trigger simulation –ORCA Reads digis POOL files Normally run as part of the reconstruction phase
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 8 CMS software for Data Analysis Reconstruction –ORCA Produces POOL files (DST and AOD) from hits or digis POOL files Analysis –ORCA Reads POOL files in (hits, digis,) DST, AOD formats –IGUANA (uses ORCA and OSCAR as back-end) Visualization program (event display, statistical analysis)
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 9 CMS software: ORCA & C. Pythia Zebra files with HITS HEPEVT Ntuples CMSIM (GEANT3) ORCA/COBRA Digitization Digis Database (POOL) ORCA/COBRA Hit Formatter Hits Database (POOL) OSCAR/COBRA (GEANT4) ORCA Reconstruction or User Analysis Ntuples or Root files Database (POOL) IGUANA Interactive Analysis Other Generators Merge signal and pile-up Data Simulation Data Analysis
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003 CMS Computing Milestones
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 11 CMS computing milestones DAQ TDR (Technical Design Report) Spring-2002 Data Production Software Baselining Computing & Core Software TDR 2003 Data Production (PCP04) 2004 Data Challenge (DC04) Physics TDR 2004/05 Data Production (DC05) Data Analysis for physics TDR “Readiness Review” 2005 Data Production (PCP06) 2006 Data Challenge (DC06) Commissioning
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 12 Average slope =x2.5/year DC04 Physics TDR DC05 LCG TDR DC06 Readiness LHC 2E33 LHC 1E34 DAQTDR Size of CMS Data Challenges 1999: 1TB – 1 month – 1 person : 27 TB – 12 months – 30 persons 2002: 20 TB – 2 months – 30 persons 2003: 175 TB – 6 months – <30 persons
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 13 World-wide Distributed Productions CMS Production Regional Centre CMS Distributed Production Regional Centre
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 14 CMS Computing Challenges CMS Computing challenges include: –production of simulated data for studies on: Detector design Trigger and DAQ design and validation Physics system setup –definition and set-up of analysis infrastructure –definition of computing infrastructure –validation of computing model Distributed system Increasing size and complexity Tightened to other CMS activities –provide computing support for all CMS activities
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003 OCTOPUS CMS Production System
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 16 BOSS DB Dataset metadata Job metadata OCTOPUS Data Production System McRunjob + plug-in CMSProd Site Manager starts an assignment RefDB Phys.Group asks for a new dataset shell scripts Local Batch Manager Computer farm Job level query Data-level query Production Manager defines assignments Push data or info Pull info JDL Grid (LCG) Scheduler LCG RLS POOL DAG job DAGMan (MOP) Chimera VDL Virtual Data Catalogue Planner DPE
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 17 Remote connections to databases Job Wrapper (job instru- mentation) User Job Journal writer Remote updater Job input Job output Journal Catalog Metadata DB Job input Job output Journal Catalog Asynchronous updater Worker Node User Interface Metadata DB are RLS/POOL, RefDB, BOSS DB Direct connection from WN
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 18 Job production MCRunJob –Modular: produce plug-in’s for: reading from RefDB reading from simple GUI submitting to a local resource manager submitting to DAGMan/Condor-G (MOP) submitting to the EDG/LCG scheduler producing derivations in the Chimera Virtual Data Catalogue –Runs on the user (e.g. site manager) host –Defines also the sandboxes needed by the job –If needed, the specific submission plug-in takes care of: preparing the XML POOL catalogue with input files information moving the sandbox files to the worker nodes CMSProd
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 19 Job Metadata management Job parameters that represent the job running status are stored in a dedicated database: –when did the job start? –is it finished? but also: –how many events did it produce so far? BOSS is a CMS-developed system that does this extracting the info from the job standard input/output/error streams –The remote updater is based on MySQL –Remote updater are being developed now based on: R-GMA (still has scalability problems) Clarens (just started)
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 20 Dataset Metadata management Dataset metadata are stored in the RefDB: –by what (logical) files is it made of? but also: –what input parameters to the simulation program? –how many events have been produced so far? Information may be updated in the RefDB in many ways: –manual Site Manager operation –automatic from the job –remote updaters based on R-GMA and Clarens (similar to those developed for BOSS) will be developed Mapping of logical names to physical file names will be done on the grid by RLS/POOL
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec Data Productions
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK production statistics Used Objectivity/DB for persistency 11 Regional Centers, more than 20 sites, about 30 site managers Spring 2002 Data production –Generation and detector simulation: 6 million events in 150 physics channels –Digitization: >13 million events with different configuration (luminosity) –about 200 KSI2000 months –more than 20 TB digitized data Fall 2002 Data production –10 million events, full chain (small output) –about 300 KSI2000 months –Also productions on grid!
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 23 Spring 2002 production history 1.5 million events per month CMSIM
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 24 Fall 2002 CMS grid productions CMS/EDG Stress Test on EDG testbed & CMS sites Top-down approach: more functionality but less robust, large manpower needed USCMS IGT Production in the US Bottom-up approach: less functionality but more stable, little manpower needed 1.2 million events in 2 months 260,000 events in 3 weeks
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec Pre-Challenge Production
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 26 PCP04 production statistics Started in july. Supposed to end by Xmas. Generation and simulation: –48 million events with CMSIM 50 150 KSI2K s/event, 2000 KSI2K months ~ 1MB/event, 50 TB hit-formatting in progress. POOL format reduces size of a factor of 2! –6 million events with OSCAR 100 200 KSI2K s/event, 350 KSI2K months (in progress) Digitization just starting –need to digitize ~70 million events. Not all in time for DC04! Estimated: ~30-40 KSI2K s/event, ~950 KSI2K months ~1.5 MB/event, 100 TB Data movement to CERN –~1TB/day for 2 months
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 27 PCP 2003 production history 13 million events per month CMSIM
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003 PCP04 on grid
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 29 US DPE production system Running on Grid2003 – ~ 2000 CPU’s – Based on VDT – EDG VOMS for authentication – GLUE Schema for MDS Information Providers – MonaLisa for monitoring – MOP for production control - Dagman and Condor-G for specification and submission - Condor-based match-making process selects resources US DPE Production on Grid2003 Master Site Remote Site 1 MCRunJobmop_submitter DAGMan Condor-G GridFTP Batch Queue GridFTP Remote Site N Batch Queue GridFTP MOP System
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 30 Performance of US DPE USMOP Regional Center Mevts pythia: ~30000 jobs ~1.5min each, ~0.7 KSI2000 months Mevts cmsim: ~9000 jobs ~10hours each, ~90 KSI2000 months ~3.5 TB data Now running OSCAR productions CMSIM
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 31 CMS/LCG-0 testbed CMS/LCG-0 is a CMS-wide testbed based on the LCG pilot distribution (LCG-0), owned by CMS –joint CMS – DataTAG-WP4 – LCG-EIS effort –started in june 2003 –Components from VDT and EDG 1.4.X (LCG pilot) –Components from DataTAG (GLUE schemas and info providers) –Virtual Organization Management: VOMS –RLS in place of the replica catalogue (uses rlscms by CERN/IT) –Monitoring: GridICE by DataTAG –tests with R-GMA (as BOSS transport layer for specific tests) –no MSS direct access (bridge to SRB at CERN) About 170 CPU’s, 4 TB disk –Bari Bologna Bristol Brunel CERN CNAF Ecole Polytechnique Imperial College ISLAMABAD-NCP Legnaro Milano NCU-Taiwan Padova U.Iowa Allowed to do CMS software integration while LCG-1 was not out
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 32 User Interface CMS/LCG-0 Production system OCTOPUS installed on User Interface CMS software (installed on Computing Elements as RPM’s) BOSS DB McRunjob + ImpalaLite RefDB JDL Grid (LCG) Scheduler RLS SE CE CMS software CE CMS software CE CMS software CE SE WN SE CE CMS software Job metadata Dataset metadata Push data or info Pull info Grid Information System (MDS)
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 33 CMS/LCG-0 performance CMS-LCG Regional Center based on CMS/LCG Mevts “heavy” pythia: ~2000 jobs ~8hours each, ~10 KSI2000 months 1.5 Mevts cmsim: ~6000 jobs ~10hours each, ~55 KSI2000 months ~2.5 TB data Inefficiency estimation: –5% to 10% due to sites’ misconfiguration and local failures –0% to 20% due to RLS unavailability –few errors in execution of job wrapper –Overall inefficiency: 5% to 30% Pythia + CMSIM Now used as a play-ground for CMS grid-tools development
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003 Data Challenge 2004 (DC04)
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK Data Challenge Test the CMS computing system at a rate which corresponds to the 5% of the full LHC luminosity –corresponds to the 25% of the LHC startup luminosity –for one month (February or March 2004) –25 Hz data taking rate at a luminosity of 0.2 x cm -2 s -1 –50 million events (completely simulated up to digis during PCP03) used as input Main tasks –Reconstruction at Tier-0 (CERN) at 25 Hz (~40 MB/s) –Distribution of DST to Tier-1 centers (~5 sites) –Re-calibration at selected Tier-1 centers –Physics-groups analysis at the Tier-1 centers –User analysis from the Tier-2 centers
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 36 DC04 Analysis challenge DC04 Calibration challenge T0 T1 T2 T1 T2 Fake DAQ (CERN) DC04 T0 challenge SUSY Background DST HLT Filter ? CERN disk pool ~40 TByte (~20 days data) TAG/AOD (replica) TAG/AOD (replica) TAG/AOD (20 kB/evt) Replica Conditions DB Replica Conditions DB Higgs DST Event streams Calibration sample Calibration Jobs MASTER Conditions DB 1 st pass Recon- struction 25Hz 1.5MB/evt 40MByte/s 3.2 TB/day Archive storage CERN Tape archive Disk cache 25Hz 1MB/e vt raw 25Hz 0.5MB reco DST Higgs background Study (requests New events) Event server 50M events 75 Tbyte 1TByte/day 2 months PCP CERN Tape archive
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 37 Tier-0 challenge Data serving pool to serve digitized events at 25Hz to the computing farm with 20/24 hour operation. –40 MB/s –Adequate buffer space (at least 1/4 of the digi sample in the disk buffer). –Pre-staging software. File locking while in use, buffer cleaning and restocking as files have been processed Computing Farm: approximately 400 CPU’s –jobs running 20/24 hours. 500 events/job, 3 hour/job –Files in buffer locked till successful job completion –No dead-time can be introduced to the DAQ. Latencies must be no more than of order 6-8 hours CERN MSS: ~50 MB/s archiving rate –archive ~ 1.5 MB * 25 Hz raw data (digis) –archive ~0.5 MB * 25 Hz reconstructed events (DST) File catalog: POOL/RLS –Secure and complete catalog of all data input/products –Accessible and/or replicable to the other computing centers
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 38 Data distribution challenge Replication of the DST and part of raw data at one or more Tier-1 centers –possibly using the LCG replication tools –foreseen some event duplication –At CERN ~3 GB/s traffic without inefficiencies (about 1/5 at Tier-1) Tier-0 catalog accessible by all sites Replication of calibration samples (DST/raw) to selected Tier-1 Transparent access of jobs at the Tier-1 sites to the local data whether in MSS or on disk buffer Replication of any Physics-Groups (PG) data produced at the Tier-1 sites to the other Tier-1 sites and interested Tier-2 sites Monitoring of Data Transfer activites –e.g. with MonaLisa
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 39 Calibration challenge Selected sites will run calibration procedures Rapid distribution of the calibration samples (within hours at most) to the Tier-1 site and automatically scheduled jobs to process the data as it arrives. Publication of the results in an appropriate form that can be returned to the Tier-0 for incorporation in the calibration “database” Ability to switch calibration “database” at the Tier-0 on the fly and to be able to track from the meta-data which calibration table has been used.
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 40 Tier-1 analysis challenge All data distributed from Tier-0 safely inserted to local storage Management and publication of a local catalog indicating status of locally resident data –define tools and procedures to synchronize a variety of catalogs with the CERN RLS catalog (EDG-RLS, Globus-RLS, SRB-Mcat, …) –Tier-1 catalog accessible to at least the “associated” Tier-2 centers Operation of the physics-group (PG) productions on the imported data –“production-like” activity Local computing facilities made available to Tier-2 users –Possibly via the LCG job submission system Export of the PG-data to requesting sites (Tier-0, -1 or -2) Registration of the data produced locally to the Tier-0 catalog to make them available to at least selected sites –possibly via the LCG replication tools
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 41 Tier-2 analysis challenge Point of access to computing resources of the physicists Pulling of data from peered Tier-1 sites as defined by the local Tier-2 activities Analysis on the local PG-data produces plots and/or summary tables Analysis on distributed PG-data or DST available at least at the reference Tier-1 and “associated” Tier-2 centers. –Results are made available to selected remote users possibly via the LCG data replication tools. Private analysis on distributed PG-data or DST is outside DC04 scope but will be kept as a low-priority milestone –use of a Resource Broker and Replica Location Service to gain access to appropriate resources without knowing where the input data are –distribution of user-code to the executing machines –user-friendly interface to prepare, submit and monitor jobs and to retrieve results
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 42 Summary of DC04 scale Tier-0 –Reconstruction and DST production at CERN 75 TB Input Data 180 KSI2K months = 400 hour operation SI2K/CPU) 25TB Output data 1-2 TB/Day Data Distribution from CERN to sum of T1 centers Tier-1 –Assume all (except CERN) “CMS” Tier-1’s participate CNAF, FNAL, Lyon, Karlsruhe, RAL –Share the T0 output DST between them (~5-10TB each) 200 GB/day transfer from CERN (per T1) –Perform scheduled analysis group “production” ~100 KSI2K months total = ~50 CPU per T1 (24 hrs/30 days) Tier-2 –Assume about 5-8 T2 may be more… Store some of PG-data at each T2 (500GB-1TB) Estimate 20 CPU at each center for 1 month
Claudio Grandi INFN Bologna 3-Dec-2003 ACAT'03 - KEK 43 Summary Computing is a CMS-wide activity –18 regional centers, ~ 50 sites Committed to support other CMS activities –support analysis for DAQ, Trigger and Physics studies Increasing in size and complexity –1 TB in 1 month at 1 site in 1999 –170 TB in 6 months at 50 sites today –Ready for full LHC size in 2007 Exploiting new technologies –Grid paradigm adopted by CMS –Close collaboration with LCG and EU and US grid projects –Grid tools assuming more and more importance in CMS