Download presentation
Presentation is loading. Please wait.
Published byDamon Tucker Modified over 8 years ago
1
Eric, Sabine, Luc, Manu, Wenjing, Irena CAF dec 6 2010 Squad Report 19 Nov – 6 Dec 2010
2
Efficiencies on all clouds 19 nov- 5dec FR efficiency 90,9 %
3
Sucesfull jobs all clouds 19.11-5.12 2 M jobs
4
Errors on all clouds 19.11- 5.12. 2010
5
French Cloud Efficiency 19.11.-5.12. Lowest Efficiency in LPC RO-02 Offline, CC-T2 77 jobs holding
6
Sucesfull jobs on Fr Cloud 19.11.-5.12 259815 sucesfull jobs
7
Errors on French cloud 19.11-5.12 26366 failing jobs
8
Errors in CC IN2P3
9
Monday Nov 22 (Irena on shift) ● IN2P3-CC_DataDisk reopened ● 11K running and 8k jobs transfering: number of pilots lowered for the night ● Message from Graeme: Fr pilot factory is « agressive » in their polling of the Panda server. Actions taken: – Vobox 04 stopped – sleep to 120 in 02 and 03/vo/atlas/panda/Submit_production – sleep to 240 in 02 and 03 /vo/atlas/panda/Submit_analysis python /opt/panda/bin/factory.py --sleep=240 --conf... & sleep 240 ● Catherine modifies the factory_prod.conf on 02 : reduction of pilotlimit by 4 ( max 2000 if 2 Vobox), and depthboost by 2
10
Monday (cont) ● Consequences: not much.. ● Nb of jobs running in LAPP reduced, but not in other T2s ● Nb of global jobs running of FR stable ● Nb of jobs running in prod T1 stable ● Dcache stable ● Sites already offline: Romania 02 and 07
11
Tuesday 23 Nov ● Dcache Ok through the night ● On VO box 03, reduction in pilot limits ● Nb of jobs running reduces slightly around 8:00 ● Nb of jobs writing/ waiting to write outputs reduces a little ● Original conf file put back on Vobox 02 and Vobox 03 ● Nb of jobs queued on T1 drops drastically at 10:00 ● On Stephane’s request, all AK47 transfers (merge.AOD and merge.DESD and..) are resumed. Triggers T1->Lyon transfers and Lyon->T2 transfers. Requested and implemented by Lyon yesterday ● LPC put to brokeroff ● CC understands the pb with CE’s (02 and 07: dead or almost !)
12
1.Submit_production restrated on Vobox 03 by Manu. Problem with python version in Submit_production for Vobox03: corrected 2.CE’s are beeing “killed” (“on tue les CE”) as too much activity on T1s 3.For Vobox 02, nqueue in factory_prod.conf for queues long queues reduced (from 1000 to 300) and for very long (from 600 to 300) cclcgeli07 et cclcgeli02. 4.Reactivation of SantaClaus RAW distri. From CERN to IN2P3-CC_DATATAPE Wednesday 24th nov
13
Wednesday 24 Nov (cont) 5. IN2P3-CC-T2-cclcgceli06*, IN2P3-CC-T2- cclcgceli09* ANALY_LYON_LYON-T2, ANALY_LYON-T2 put online 6.Vobox 04 stopped 7.Eric starts PD2P
14
Thursday Nov 25 - stopped pilot summission for production and analysis from Vobox 03 - stopped analysis pilots submission for Vobox04 - changed to disabled Lyon T2 in factory_production on Vobox03 and Vobox 02 - stopped Submit_atlasfr pilot (Vobox 02) submission - analy_Lyon put online - Changed the values of pilotLimit, pilotDepth, pilotDepthBoost in factory_production.conf and factory_analysis.conf on Vobox 03 so that they are identical to the values on Vobox 02.
15
● disabled T2 Lyon in factory_analysis.conf For T2, decoupling : one Vo Box for one CE: Vobox 03 CE06, Vobox 02 CE09 ● LCG-CEs of the T2 have an overload of jobs to treat: - 24 nov. Evening on cclcgceli09 - 24 nov. Afternoon on cclcgceli06 - 25 nov. morning sur cclcgceli06 ● VO box 03 stopped in the afternoon of nov 25 ● atlasfr pilot submission stopped on Vobox 02 nov 25 ● Analysis pilot submission on Vobox 04 stopped nov 25 evening ● In factory_production.conf T2 Lyon set to offline on Vobox02 and Vobox03 Thursay nov 25 (cont)
16
Friday Nov 26 - P2DP still not enabled due to a bug in ScahedConfig - on VO BOX 02 and 03, doubling the values of nqueue in factory_analysis.conf. - T1 were decoupled : one CE per one VO BOX: CE02 to VObox02, CE07 to VObox03 - Vobox 02: depth was increased for long queue 300 500 and pilot limit set to 2250 = 4500/2 (was 1600 before). - Vobox 03: pilot limit set to 2250 = 4500/2 - Nothing changed for the verylong queue.
17
Friday 26 (cont) ● RO_07-NIPNE-PRODDISK full and can not be cleaned manually ● PD2P restarted in the evening ● Zombies jobs found by Eric on cclchatlas02, generation a heavy load on japanese CE’s ● Asked submission of test jobs for LPC
18
Saturday 27 nov ● Installation Problems with 16.2.1 : Allesandro patches. Message from Pavel saying that France has made 54 HI jobs Sunday 28 nov ● Hardware problem (kernel) on the LPNHE: all queues closed ● Eric L. modifies pilot limit for T1 on vobox 03 and 02 to 2500
19
29 nov to 2 dec: Sabine on shift ● CC: 16.2.1 installation @ CC : ● Alessandro installation jobs blocked because T2prod queue is closed (T2prod = atlas099 + atlagrid) => queue reopen by Éric C. 20 jobs max ● 16.2.1 validated on cclcgceli02.in2p3.fr at 15h30
20
29 nov- Dec 2 : Sites ● RO-07 srm pblem, PRODDISK to be cleaned : ask for news in GGUS ● GRIF-LAL DATADISK transfert problem https://gus.fzk.de/ws/ticket_info.php?ticket=64702 no answer... 30/11 => set as slave of 64744https://gus.fzk.de/ws/ticket_info.php?ticket=64702 ● ANALY_GRIF-LAL automatically set brokeroff by hammercloud test NEW !!! seems to be a problem of SE => https://gus.fzk.de/ws/ticket_info.php?ticket=64743 problem disappeared, hammer tests OK : setting queue back online... where to comment ?https://gus.fzk.de/ws/ticket_info.php?ticket=64743
21
● IN2P3-LPC still blacklisted in DDM however they recover from downtime 2 days ago : https://savannah.cern.ch/support/index.php?118115 30/11 => unblacklisted https://savannah.cern.ch/support/index.php?118115 ● Excluded site in DDM in french cloud : SDU_LCG2 not in production site list of panda ? Éric : chinese T3 30/11 unblacklisted ● GRIF-LPNHE back from downtime asked for tests Shifter tests bad, send my own ones OK 30/11 prod back but analy brokeroff by shifter why ? set analy on ● ANALY_CPPM offline : disk server problem https://gus.fzk.de/ws/ticket_info.php?ticket=64732 30/11 solved back online https://gus.fzk.de/ws/ticket_info.php?ticket=64732 ● queued jobs on CC T2 disappeared
22
● ANALY_GRIF_LPNHE : set brokeroff by hammercloud in the night: one disk servers at GRIF-LPNHE crashed last night with memory problems, it was restarted and is currently running with reduced RAM. queue set back online in the morning few problem reappear in the day ● BEIJING : get and put error (dpm server down and promptly restarted) queues offline solved, in test ● ANALY_ROMANIA07 put back online (hammercloud tests OK) ● low FTS efficiency between IRFU and IN2P3 since few days IRFU as destination OK ● low FTS efficiency between IN2P3 and LAL since few days LAL as source OK ● Files lost @ LPC and DAST/ space Problem on ANALY_LPSC
23
Summary dec 2 – dec 4 (Sabine) ● LPSC lack of space on local disk. ● DAST message : some user jobs failed because of not enough space on WN scratchdisk; Scehdule Config default is 14G, to large for some WN ● RO-07 proddisk cleaned => queues to be restarted RO- 16 se variables updated in SchedConfig ● FR/LYON: jobs failing -> sw installation problem. Understood. 16.3.0 release has to be set exclusively using asetup.[c]sh script.
24
ANALY-LYON-T2 : no pilot queue online for cclcgceli06 all pilot unsubmitted : killed via condor_rm => no change queue offline for cclcgceli09 on vobox 02 => set all online, error message in the log : pblems with gatekeeper RO-07 tests ok => back online GRIF-IRFU_PRODDISK : SOURCE error during TRANSFER phase : SOLVED IN2P3-LPC_DATADISK SRM failure : SOLVED RO-16 new tests after Schedule config modif : failed Beijing online
25
-2 GGUS open for transferts problems from LAL with some update on elog -keep an eye on transferts : FTS channels and new DDM destination vs sources site -new version condor and factory for CC does't work (Manu follows this) -still to be done : reopen CC queues and set pilots for CC at the level before problems (waiting previous point to be solved) -check mailing lists changes ok - RO-16 to be set up correctly (Sabine)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.