CMS Report – GridPP Collaboration Meeting VIII Peter Hobson, Brunel University22/9/2003 CMS Applications Progress towards GridPP milestones Data management (Bristol) Monitoring (Brunel + Imperial) Bristol, Brunel and Imperial (1.5 GRIDPP FTE in total)
CMS Report – GridPP Collaboration Meeting VIII Peter Hobson, Brunel University22/9/2003 DC04 Pre-Challenge Production Data Challenge: March 2004 nominal (see Hugh Tallini’s talk) An end-to-end test of the CMS offline computing system ‘Play back’ digi data, emulating CMS DAQ -> storage, reconstruction, calibration, data reduction and analysis at T0 & external T1’s Pre-challenge production >70M fully simulated, hit-formatted, digitised events required for DC04 Using both Geant3 and Geant4 simulation; based on POOL persistency UK status In production for ~3 months at RAL T1 (Bristol-managed) & Imperial UK has contributed ~25% of production so far RAL is also a major data store and hosts the central catalogue for current data management solution (SRB) [v. high-profile contribution] Next steps: Digitisation of simulated data – much more demanding of farms Production will continue for the rest of 2003 (though not at all sites) Large-scale replication of digis to CERN (Castor) via WAN
CMS Report – GridPP Collaboration Meeting VIII Peter Hobson, Brunel University22/9/2003 Production Stats We are late! => Need to maximise use of RAL farm until the last minute (November) Hand over resources to Atlas as they are required (i.e. keep the queues full, ramp down CMS production as Atlas ramps up through queue policies). Migrate to LCG farm…?
CMS Report – GridPP Collaboration Meeting VIII Peter Hobson, Brunel University22/9/2003 Data management (wide-area) Short-term solution (PCP03) Using SRB for all data management, across ~20 sites After considerable effort by RAL e-science staff and CMS people: it works very nicely (deployed across all sites in ~10 days). RAL doing a highly professional job of hosting central MCAT The medium term (DC04) Move towards LCG ASAP; Introduce middleware components into the running production, as they are released and tested (LCG timescale?) Potential problem with data management MSS interface timescales (ask technical gurus for details); currently discussing our approach One possibility: Integrate SRB (incl. MSS interface) below LCG RLS Of potential interest to BaBar, Belle, US Grid projects – will discuss at SLAC Abstract submitted to ACAT ’03 Alternative: Each T1 implements its own MSS interface At RAL, will probably be SRB-ADS anyway, since this is tested and working The longer term (analysis of data for Physics TDR): LCG Will need a transparent migration of current catalogues, etc
CMS Report – GridPP Collaboration Meeting VIII Peter Hobson, Brunel University22/9/2003 Data Management (local) Digitisation setup Data (hits) serving for realistic full-pileup (25 overlapping events) digitisation is very demanding Current RAID disk servers + LAN don’t scale to 100’s of CPUs Performance scales roughly as number of spindles, so bigger disks don’t gain us much. Solution: use distributed disk resources, localised in ‘sub-farms’ Use Dcache as the local data management solution FNAL and RAL are the testbeds for this approach POOL POOL release 1.3 now integrated within the CMS COBRA framework Functional / performance testing & development of catalogue handling approach under way within CMS (incl. Bristol) Full integration of POOL catalogue with local + wide area data management is the next step (work within LCG + CMS) Also re-examining data clustering strategy for wide-area optimisation
CMS Report – GridPP Collaboration Meeting VIII Peter Hobson, Brunel University22/9/2003 Stress testing BOSS with RGMA The CMS job submission and monitoring system BOSS is now GRID enabled using the R-GMA middleware from EDG WP3.
CMS Report – GridPP Collaboration Meeting VIII Peter Hobson, Brunel University22/9/2003 Stress testing BOSS with RGMA Static information is relayed via an information service architecture, operating system, CPU details, disk capacity, access policy and application version. query/ response semantics only Dynamic information is relayed via a monitoring service CPU load, fraction of disk used, network speed and application trace data. both query/ response semantics and publish/ subscribe semantics.
CMS Report – GridPP Collaboration Meeting VIII Peter Hobson, Brunel University22/9/2003 Stress testing BOSS with RGMA Use Case Name:Production The production coordinator submits 10,000 production jobs using BOSS ( from a single Grid node. Each job takes of the order of 10 hours to run on a CPU with speed of the order of 1GHz and produces output files of the order of 500Mb. The jobs are likely to be distributed to around 10 sites. Each job may contain up to 20 messages inserted by the physicist for the purposes of alert, or, more rarely, alarm.
CMS Report – GridPP Collaboration Meeting VIII Peter Hobson, Brunel University22/9/2003 Stress testing BOSS with RGMA BOSS DB UI IMPALA/BOSS WN Sandbox BOSS wrapper Job Tee OutFile R-GMA API Farm servlets Receiver servlets Registry Receiver a5b 6 Tested on CMS-LCG0 testbed at IC and Brunel
CMS Report – GridPP Collaboration Meeting VIII Peter Hobson, Brunel University22/9/2003 Stress testing BOSS with RGMA Plausible sensor data volume for a single BOSS job Plan: Submit 50 real production jobs to a local batch system, and deduce an approximation to the distribution of intervals between sensor messages and the size of those messages. The sensor data produced will be fed directly into R-GMA to investigate scaling and failure modes. Results will be presented at the IEEE NSS conference in Oregon in October 2003
CMS Report – GridPP Collaboration Meeting VIII Peter Hobson, Brunel University22/9/2003 Summary Sucesses 1/4 of all pre-production data produced in the UK. SRB for pre-production challenge data management has worked well. POOL release 1.3 now integrated within the CMS COBRA framework. Problems Late start to the pre-production challenge. Some concerns over the stability and scalability of RGMA.