Eric, Sabine, Luc, Manu, Wenjing, Irena CAF dec 6 2010 Squad Report 19 Nov – 6 Dec 2010.

Slides:

Advertisements

Similar presentations

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

Advertisements

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.

Backup and Recovery Part 1.

AMOD Report Simone Campana CERN IT-ES. Grid Services A very good week for sites – No major issues for T1s and T2s The only one to report is

FZU participation in the Tier0 test CERN August 3, 2006.

Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly )

11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.

PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI AMOD report – Fernando H. Barreiro Megino CERN-IT-ES-VOS.

Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.

CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

AMOD Report October 22-28, 2012 Torre Wenaus With thanks to Alexei Sedov, shadow shifter October 30, 2012.

Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

BNL DDM Status Report Hironori Ito Brookhaven National Laboratory.

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

WLCG Service Report ~~~ WLCG Management Board, 9 th August

EGI-InSPIRE EGI-InSPIRE RI DDM Site Services winter release Fernando H. Barreiro Megino (IT-ES-VOS) ATLAS SW&C Week November

Alberto Aimar CERN – LCG1 Reliability Reports – May 2007

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

The ATLAS Cloud Model Simone Campana. LCG sites and ATLAS sites LCG counts almost 200 sites. –Almost all of them support the ATLAS VO. –The ATLAS production.

INFSO-RI Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, ,

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

WLCG Service Report ~~~ WLCG Management Board, 23 rd November

BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez.

INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

First test of the PoC. Caveats I am not a developer ;) I was also beta tester of Crab3+WMA in 2011; I restarted testing it ~2 weeks ago to have a 1 to.

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.

GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals

WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.

News From CC-IN2P3 Tier 1 E.Vamvakopoulos and G.Rahal CAF meeting– CCIN2P3 24 Oct 2011.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Feedback to sites from the VO auger Jiří Chudoba (Institute of Physics and.

Status of the SL5 migration ALICE TF Meeting

Jean-Philippe Baud, IT-GD, CERN November 2007

ATLAS Use and Experience of FTS

Summary on PPS-pilot activity on CREAM CE

Bulk production of Monte Carlo

Data Challenge with the Grid in ATLAS

Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007

Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0

Farida Fassi, Damien Mercie

WLCG Management Board, 16th July 2013

The ADC Operations Story

glexec/SCAS pilot service

1 VO User Team Alarm Total ALICE ATLAS CMS

CC IN2P3 - T1 for CMS: CSA07: production and transfer

Short update on the latest gLite status

Artem Trunov and EKP team EPK – Uni Karlsruhe

Simulation use cases for T2 in ALICE

TCG Discussion on CE Strategy & SL4 Move

1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14

Take the summary from the table on

Organization of ATLAS computing in France

lundi 25 février 2019 FTS configuration

Roadmap for Data Management and Caching

Status and plans for bookkeeping system and production tools

Database Backup and Recovery

T2D Idea Metrics T2 directly connected to T1s

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

The LHCb Computing Data Challenge DC06

Presentation transcript:

Eric, Sabine, Luc, Manu, Wenjing, Irena CAF dec Squad Report 19 Nov – 6 Dec 2010

Efficiencies on all clouds 19 nov- 5dec FR efficiency 90,9 %

Sucesfull jobs all clouds M jobs

Errors on all clouds

French Cloud Efficiency Lowest Efficiency in LPC RO-02 Offline, CC-T2 77 jobs holding

Sucesfull jobs on Fr Cloud sucesfull jobs

Errors on French cloud failing jobs

Errors in CC IN2P3

Monday Nov 22 (Irena on shift) ● IN2P3-CC_DataDisk reopened ● 11K running and 8k jobs transfering: number of pilots lowered for the night ● Message from Graeme: Fr pilot factory is « agressive » in their polling of the Panda server. Actions taken: – Vobox 04 stopped – sleep to 120 in 02 and 03/vo/atlas/panda/Submit_production – sleep to 240 in 02 and 03 /vo/atlas/panda/Submit_analysis python /opt/panda/bin/factory.py --sleep=240 --conf... & sleep 240 ● Catherine modifies the factory_prod.conf on 02 : reduction of pilotlimit by 4 ( max 2000 if 2 Vobox), and depthboost by 2

Monday (cont) ● Consequences: not much.. ● Nb of jobs running in LAPP reduced, but not in other T2s ● Nb of global jobs running of FR stable ● Nb of jobs running in prod T1 stable ● Dcache stable ● Sites already offline: Romania 02 and 07

Tuesday 23 Nov ● Dcache Ok through the night ● On VO box 03, reduction in pilot limits ● Nb of jobs running reduces slightly around 8:00 ● Nb of jobs writing/ waiting to write outputs reduces a little ● Original conf file put back on Vobox 02 and Vobox 03 ● Nb of jobs queued on T1 drops drastically at 10:00 ● On Stephane’s request, all AK47 transfers (merge.AOD and merge.DESD and..) are resumed. Triggers T1->Lyon transfers and Lyon->T2 transfers. Requested and implemented by Lyon yesterday ● LPC put to brokeroff ● CC understands the pb with CE’s (02 and 07: dead or almost !)

1.Submit_production restrated on Vobox 03 by Manu. Problem with python version in Submit_production for Vobox03: corrected 2.CE’s are beeing “killed” (“on tue les CE”) as too much activity on T1s 3.For Vobox 02, nqueue in factory_prod.conf for queues long queues reduced (from 1000 to 300) and for very long (from 600 to 300) cclcgeli07 et cclcgeli02. 4.Reactivation of SantaClaus RAW distri. From CERN to IN2P3-CC_DATATAPE Wednesday 24th nov

Wednesday 24 Nov (cont) 5. IN2P3-CC-T2-cclcgceli06*, IN2P3-CC-T2- cclcgceli09* ANALY_LYON_LYON-T2, ANALY_LYON-T2 put online 6.Vobox 04 stopped 7.Eric starts PD2P

Thursday Nov 25 - stopped pilot summission for production and analysis from Vobox 03 - stopped analysis pilots submission for Vobox04 - changed to disabled Lyon T2 in factory_production on Vobox03 and Vobox 02 - stopped Submit_atlasfr pilot (Vobox 02) submission - analy_Lyon put online - Changed the values of pilotLimit, pilotDepth, pilotDepthBoost in factory_production.conf and factory_analysis.conf on Vobox 03 so that they are identical to the values on Vobox 02.

● disabled T2 Lyon in factory_analysis.conf For T2, decoupling : one Vo Box for one CE: Vobox 03 CE06, Vobox 02 CE09 ● LCG-CEs of the T2 have an overload of jobs to treat: - 24 nov. Evening on cclcgceli nov. Afternoon on cclcgceli nov. morning sur cclcgceli06 ● VO box 03 stopped in the afternoon of nov 25 ● atlasfr pilot submission stopped on Vobox 02 nov 25 ● Analysis pilot submission on Vobox 04 stopped nov 25 evening ● In factory_production.conf T2 Lyon set to offline on Vobox02 and Vobox03 Thursay nov 25 (cont)

Friday Nov 26 - P2DP still not enabled due to a bug in ScahedConfig - on VO BOX 02 and 03, doubling the values of nqueue in factory_analysis.conf. - T1 were decoupled : one CE per one VO BOX: CE02 to VObox02, CE07 to VObox03 - Vobox 02: depth was increased for long queue 300  500 and pilot limit set to 2250 = 4500/2 (was 1600 before). - Vobox 03: pilot limit set to 2250 = 4500/2 - Nothing changed for the verylong queue.

Friday 26 (cont) ● RO_07-NIPNE-PRODDISK full and can not be cleaned manually ● PD2P restarted in the evening ● Zombies jobs found by Eric on cclchatlas02, generation a heavy load on japanese CE’s ● Asked submission of test jobs for LPC

Saturday 27 nov ● Installation Problems with : Allesandro patches. Message from Pavel saying that France has made 54 HI jobs Sunday 28 nov ● Hardware problem (kernel) on the LPNHE: all queues closed ● Eric L. modifies pilot limit for T1 on vobox 03 and 02 to 2500

29 nov to 2 dec: Sabine on shift ● CC: CC : ● Alessandro installation jobs blocked because T2prod queue is closed (T2prod = atlas099 + atlagrid) => queue reopen by Éric C. 20 jobs max ● validated on cclcgceli02.in2p3.fr at 15h30

29 nov- Dec 2 : Sites ● RO-07 srm pblem, PRODDISK to be cleaned : ask for news in GGUS ● GRIF-LAL DATADISK transfert problem no answer... 30/11 => set as slave of 64744https://gus.fzk.de/ws/ticket_info.php?ticket=64702 ● ANALY_GRIF-LAL automatically set brokeroff by hammercloud test NEW !!! seems to be a problem of SE => problem disappeared, hammer tests OK : setting queue back online... where to comment ?

● IN2P3-LPC still blacklisted in DDM however they recover from downtime 2 days ago : 30/11 => unblacklisted ● Excluded site in DDM in french cloud : SDU_LCG2 not in production site list of panda ? Éric : chinese T3 30/11 unblacklisted ● GRIF-LPNHE back from downtime asked for tests Shifter tests bad, send my own ones OK 30/11 prod back but analy brokeroff by shifter why ? set analy on ● ANALY_CPPM offline : disk server problem 30/11 solved back online ● queued jobs on CC T2 disappeared

● ANALY_GRIF_LPNHE : set brokeroff by hammercloud in the night: one disk servers at GRIF-LPNHE crashed last night with memory problems, it was restarted and is currently running with reduced RAM. queue set back online in the morning few problem reappear in the day ● BEIJING : get and put error (dpm server down and promptly restarted) queues offline solved, in test ● ANALY_ROMANIA07 put back online (hammercloud tests OK) ● low FTS efficiency between IRFU and IN2P3 since few days IRFU as destination OK ● low FTS efficiency between IN2P3 and LAL since few days LAL as source OK ● Files LPC and DAST/ space Problem on ANALY_LPSC

Summary dec 2 – dec 4 (Sabine) ● LPSC lack of space on local disk. ● DAST message : some user jobs failed because of not enough space on WN scratchdisk; Scehdule Config default is 14G, to large for some WN ● RO-07 proddisk cleaned => queues to be restarted RO- 16 se variables updated in SchedConfig ● FR/LYON: jobs failing -> sw installation problem. Understood release has to be set exclusively using asetup.[c]sh script.

ANALY-LYON-T2 : no pilot queue online for cclcgceli06 all pilot unsubmitted : killed via condor_rm => no change queue offline for cclcgceli09 on vobox 02 => set all online, error message in the log : pblems with gatekeeper RO-07 tests ok => back online GRIF-IRFU_PRODDISK : SOURCE error during TRANSFER phase : SOLVED IN2P3-LPC_DATADISK SRM failure : SOLVED RO-16 new tests after Schedule config modif : failed Beijing online

-2 GGUS open for transferts problems from LAL with some update on elog -keep an eye on transferts : FTS channels and new DDM destination vs sources site -new version condor and factory for CC does't work (Manu follows this) -still to be done : reopen CC queues and set pilots for CC at the level before problems (waiting previous point to be solved) -check mailing lists changes ok - RO-16 to be set up correctly (Sabine)