Download presentation
Presentation is loading. Please wait.
Published byBarrie Roberts Modified over 9 years ago
1
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly
2
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 2 Current Production Switched to raw data production with a single master-job per each global run, under the control of Lightweight Production Monitor (LPM) Main problems during the Easter break: 1.Successful jobs (output written at the WN) end up in "EXPIRED" after few hours they stay in "SAVING" (ALICE::CERN::CASTOR2 not responding). 2.Jobs go to "ERROR_E" due to zero CPU consumption in the last 20 mins (probably the first time some raw file is accessed, it takes longer to be staged). Solved using LPM, after re-submission the number of "ERROR_E" decreases significantly. 3.User alidaq was squeezed by aliprod on the production partition "rawreco": CEs are CERN::LCG and CERN::CERN-gLite Could run not more than 20 jobs in parallel when CEs overloaded Now it has a higher priority (200 jobs in concurrent mode)
3
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 3 Output Restructuring AliEn Catalog (raw data)..../LHC08a/000026001/raw/08000026001002.10.root../LHC08a/000026001/raw/08000026001012.10.root../LHC08a/000026001/raw/08000026001012.20.root../LHC08a/000026007../LHC08a/000026020.. AliEn Catalog (reconstructed data)..../000026001/ESDs/pass1/08000026001002.10.root/AliESDs.root../000026001/ESDs/pass1/08000026001002.10.root/.QA*.root../000026001/ESDs/pass1/08000026001002.10.root/.RecPoints.root../000026001/ESDs/pass1/08000026001002.10.root/ debug.root../000026001/ESDs/pass1/08000026001002.10.root/rec.log|stdout|stderr../000026001/ESDs/pass1/08000026001002.10.root/root_archive../000026001/ESDs/pass1/08000026001002.10.root/log_archive.. 1..~40
4
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 4 Production using LPM Master-job kept running until 95% of the sub-jobs are DONE This may slow down/stuck the production in the long term if too many ERROR_V
5
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 5 Task queue automatically re-filled if number of waiting jobs < 3500 Production using LPM (1) resubmitting error jobs pid 12028083 had 7 error jobs (job is 30% done - 877 out of 2846) pid 12051531 had 0 error jobs (job is 49% done - 112 out of 228) pid 12051672 had 0 error jobs (job is 0% done - 0 out of 2) pid 12051673 had 797 error jobs (job is 56% done - 2185 out of 3878) total resubmitted : 804 there are 4192 jobs waiting in queue for user alidaq target queue size is 4000
6
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 6 All the urgent runs are already scheduled in LPM for production At this rate (~900jobs/h) we would terminate reconstruction in two days Production using LPM (2) Jobs running successfully but fail when write outputs to CASTOR2 (expired)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.