Produzioni MC ai Tiers CMS nel 2007: prospettive CMS-wide e contributo italiano Università, Politecnico e INFN Bari N. De Filippis M. Abbrescia, G. Cuscela, G. Donvito, G. Maggi, S. My, A. Pierro, A. Pompili, + contribution of developers (Kavka, Fanfani, Codispoti, Bacchi)
Outline Status of CMS Monte Carlo production: organization and current requests Monte Carlo production in Italy: Activity post –CSA06 Problems with sites Efficiency of italian sites Reliability of sites CMS plans and milestones for 2007
Goal of MC production: to produce events for CMSSW validation (simulation/reconstruction) and physics studies Small RelVal samples upon a new CMSSW release PhysVal / HLT groups make requests in form of cfg´s Experts provide ProdAgent Workflows Assignment to Production Teams posted on twiki: Currently 6 teams: LCG(1,2,3,5,6) and OSG Each team has O(10) dedicated T1/T2 sites' When done, files merged and injected to PhEDEx Too many manual steps and too many extra-prod. duties (e.g. monitoring/dealing with sites availability & stability) A lot of pressure from SDV group ( P. Janot) to produce events ASAP MC production cycle
After CSA06: CMSSW_1_1_1 and 1_1_2 used until Xmas CMSSW_1_2_0 released mid-Dec06 Production with CMSSW_1_2_0 running continously since Dec06 PhysVal requests (10M w/o PU M w PU) HLT requests (100M w/o PU+ 20M w PU x 2) HLT + PU in 2 steps GEN-SIM / DIGI-RECO about 20M done, many running, but very tight schedule! some samples: –QCD di-jets (0 < pt-bin< 3.5TeV), w & w/o PU –Excl. W & Z decays, Wjets(0 < pt < 1TeV) w & w/o PU –Inclusive ttbar, … see Current official requests P. Kreuzer
PhysVal samples with CMSSW_1_2_0 LCG (3)
HLT samples with CMSSW_1_2_0 LCG (3) After120 bulk production over, a few «special» requests will be addressed: – Muon Enriched sample with 121: few hundredK events – Cosmics for Tracker with122: M events
On going effort of the OSG, LCG1,2,5,6 Conclusions of P. Kreuzer: with2 new and efficient production teams on board, remaining120 assignments should be delivered(at least partially) within 10 days.
MC production in Italy
Post-CSA06 activity (1) Official CSA06 note complete Internal CMS note on CSA06 in italian tiers complete CSA06 analyses completed
Post-CSA06 activity (2) Since October 2006 until today the LCG(3) team: re-started the Monte Carlo production withous stops also during the Xmas break has increased the number of esperts to run ProdAgent has exported the monitoring tool developed at Bari also at the other LCG teams has produced about 15 M events for the studies of Physics validation and HLT with and without PU…..1/3 of the entire production in CMS has used the European LCG resources with continuity, giving enormous feedback for the problem resolution of remote sites
Sites used by the LCG(3) team CERN used intensively before and after Xmas Italian sites English sites Hungary Taiwan IN2P3
On going effort of LCG (3) On going GEN-SIM and DIGI-RECO with low luminosity Pileup
Issues about ProdAgent Production setup at Bari: 3 instances of PA running at Bari: two for FEVT and GEN-SIM production one for DIGI-RECO production with PU one machine for on-line dump of the DBs Monitoring tool exported to other LCG teams with positive feedback. The submission of jobs is somehow slow (up to 2-3 job/min) due to: performances of the PA machines which are two years old overhead of the RBs no bulk submission The control of jobs that failed or aborted because of the middleware problems is difficult. Killing jobs of a given production or submitted to a given site was problematic PA developers provided a script to do this. LCG(3) will smoothly leave English CEs to LCG (6) (the english team) and IN2P3 to LCG(5) (the belgian team) w.r.t debugging & intensive use. On the long run: BulkSubmission& Resource Monitor
Most of LCG(3) sites had various problems before and during the Xmas break November: Bari, Pisa, Roma when restarting production, CNAF: problems with castor English sites and IN2P3 had alternate periods of activity also during last month. Italian sites were really efficient during last month. Debugging of sites is tipically really painful and requires continous interaction with the site administrators. Problems: stage out was the main cause of job failures. site validation: storage, software tag, software mount points, local copy of PU grid problems: instabilities of the CE because of high load, overload of RBs which caused: RB didn´t change status of jobs («Waiting» status forever) No chance to monitor: FWJobreport and log files lost Difficult/tedious for prod. teams to kill jobs via BOSS commands The debugging of sites is not a task to be covered by production teams. CMS is reacting and preparing centralized tests to ensure the reliability of sites. Problems with sites
Efficiency of the italian sites (last month): CNAF No PU CE replaced Except for few days CNAF worked very well to ensure high efficiency of the CMS production during last month
CPU hours and the percentage % of Tier-1 resources used by CMS: Month-week | CPU hr | % jan 21 jan : 33.4% 22 jan 28 jan : 19.0% 29 jan 4 feb : 24.8% 5 feb 11 feb : 22.4% Statistics of use of CNAF (last month) The percentage of use depends on the fairshare setup at CNAF Successful jobs Queues always full of jobs, CMS at maximum of use at CNAF.
Efficiency of the italian sites (last month): INFN Except for limited problems with the storage at Bari, Pisa and Rome all the Italian tier-2 like sites worked very well during last month.
Statistics from dashboard
Reliability of sites: tests 1)Submit a small processing job for each advertised CMSSW release at a site. This job checks: Job can be submitted to site Local stage out can be done report can be made back via grid middleware 10 event Minimum Bias? test frontier access as well? 2)Following completion of the test job, submit a read back job: verifies job submission checks data access clean up file to test cleanup procedure 3)Check global DBS datasets at site: check read access to all fileblocks at site report back bad files and invalidate in DBS perhaps randomly select a dataset to test every day/week etc. Following the feedback of problems found by production operators CMS is defining centralized tests to be run every given time to certify sites for production and analysis. The ideas are:
Reliability of sites: SAM tests SAM (Service Availibility Monitoring) Hopefully the human resources needed for MC production are expected to decrease so less production teams submitting jobs to any sites
Plans for MC production in 2007
Finalize 120 Production (aim for mid-Feb!) Expecting small 12x requests (RelVal, Muon-enrichedHLT, …) 130 Release (all HLT components) end Feb07 130 HLT Production in Mar07 In parallel, Alpgen Integration in Production Timescale: integrate till Mar07 + test samples, PH prod. Apr-May07 140 Release (new geo) end Mar07 140 Physics production Apr-May07 (30M / month) 150 Release mid-May07 with improved reco algorithms(re-RECO) Launch CSA07 with16x end-July07 To be defined the contribution of Italy to the previous activities and the manpower. In addition the CSA07 during summer could be a real problem milestones
Conclusions Monte Carlo production of LCG(3) team run continuosly since the end of CSA06 until now About 15M of events produced (1/3 of the overall CMS productio) Italian sites are working very well during last month to unsure high efficiency production. Warning: keep high the attention to Italian Tiers, mainly at CNAF Effective interaction between operators and developers of PA The load of production operators should decrease as soon as (possible) the centralized SAM tests will run to certify sites for production. The Italian contribution to the activities in preparation and for CSA07 has to be discussed.