9 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From to
Weekly Running Jobs by User Notes: 1. zhanglei & huangy submit BES jobs 2. CEPC’s production user weiyq keep running MC jobs. itemvalue active users3 max running jobs785 average running jobs394 total executed jobs31.4 k
Final Status of Running Jobs Failed Reasonpercent Application Error3.99 % upload/download failed3.77 % stalled1.10 % other0.025 % CEPC upload err huangy 74# pkg download error
Output Data Generated and Transfered Total: 7.48 TB ~1.07 TB/day good quality
Running job by Site Notes: – UMN’s max running jobs is set to 500 – WHU finished 47.7% jobs.
Job Final Status at Each Site WHU 92.0% upload failed GRID.JINR 100% UMN 91.3% 74# BES OpenNebula 96.5% OpenStack 85.8% 11# CEPC UCAS 96.7% 66# BES
Failed Types at Site: Description GRID.JINR is good UCAS is good too, but still has 66# randomtrg download errors OpenNebula is good OpenStack has problem in a short time when VM started. (occasionally), 900 jobs failed in 1 hour WHU still has upload error. It will be better when CEPC run long jobs. UMN is good. But jobs submit to UMN failed because input data download error arise when large amount of jobs get data from DIRAC server simutaneously :07:39 UTC DataManagement/StorageElement NOTICE: Returning response ( :43120)[bes_user:huangy] (30.23 secs) ERROR: Failed to get file dips://besdirac02.ihep.ac.cn:9148/DataManagement/StorageElement/bes/user/h/huangy/Upload/524 0cc601bde6085a9baf9185a2cc22979c2624bc76705cb10e5301e0fd3fcfb :07:39 UTC DataManagement/StorageElement ERROR: Error processing proposal Error while sending: [('SSL routines', 'SSL3_WRITE_PENDING', 'bad write retry')]
Cumulative User Jobs Total user jobs: 31.4 k
Running jobs and Walltime Usage of VOs Walltime usage: BES 17.6% CEPC 82.4%
New Features in Frontend Added this week to fulfill user’s need gangaBOSS – add support BOSS (done). Only large-scale test needed. dsub – support setting start evt number and batches to run (done and in production use now to deal with 200k events input data) – can handle input stdhep which has events during [0, evtmax] in each job.