CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Tier-0 CCRC’08 May Post-Mortem Miguel Santos Ricardo Silva IT-FIO-FS.

Slides:

Advertisements

Similar presentations

CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.

Advertisements

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status Tony Cass (With thanks to Miguel Coelho dos Santos & Alex Iribarren) LCG-LHCC.

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 6 th July 2009.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

CERN - IT Department CH-1211 Genève 23 Switzerland WLCG 2009 Data-Taking Readiness Planning Workshop Tier-0 Experiences Miguel Coelho dos.

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

CERN - IT Department CH-1211 Genève 23 Switzerland t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

DELETION SERVICE ISSUES ADC Development meeting

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Priorities update Andrea Sciabà IT/GS Ulrich Schwickerath IT/FIO.

CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.

8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Castor incident (and follow up) Alberto Pace.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.

Tier-1 Andrew Sansum Deployment Board 12 July 2007.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

CERN - IT Department CH-1211 Genève 23 Switzerland Tape Operations Update Vladimír Bahyl IT FIO-TSI CERN.

CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.

CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.

Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.

Patricia Méndez Lorenzo Status of the T0 services.

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.

WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

CERN IT Department CH-1211 Genève 23 Switzerland t CCRC’08 Review from a DM perspective Alberto Pace (With slides from T.Bell, F.Donno, D.Duelmann,

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.

Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.

Cross-site problem resolution Focus on reliable file transfer service

Database Services at CERN Status Update

Service Challenge 3 CERN

Experiences and Outlook Data Preservation and Long Term Analysis

Update on Plan for KISTI-GSDC

Castor services at the Tier-0

Olof Bärring LCG-LHCC Review, 22nd September 2008

WLCG Service Report 5th – 18th July

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

Presentation transcript:

CERN - IT Department CH-1211 Genève 23 Switzerland Tier-0 CCRC’08 May Post-Mortem Miguel Santos Ricardo Silva IT-FIO-FS

CERN - IT Department CH-1211 Genève 23 Switzerland Outlining Overall Summary Disk Cache –CASTOR backend –SRM Tape Castor monitoring Batch LFC WMS CCRC’08 May Tier-0 Post-Mortem- 2

CERN - IT Department CH-1211 Genève 23 Switzerland Overall Summary Services ran generally smoothly The exercise combined activity was handled well –Was the activity representative of production load? Inter-Experiment interference from the combined exercise was limited –Some interference at the SRM layer –Reduced interference at the tape layer Good experience, some lessons learnt CCRC’08 May Tier-0 Post-Mortem- 3

CERN - IT Department CH-1211 Genève 23 Switzerland Normal activity In general expecting more load on the system Tier-0 storage works well A few issues on the CASTOR backend, specially garbage collection on CASTORCMS Alert mailing list used for the first time SRM2 ‘teething’ problems Monitoring improvements Glad to have caught the GC problems now rather later! Good to have exercised and debugged the alert procedure now! Disk Cache Overview CCRC’08 May Tier-0 Post-Mortem- 4  

CERN - IT Department CH-1211 Genève 23 Switzerland System performance CASTOR disk cache throughput: On the bright side, it is nice to know the capacity is there! Peaks of 9GBytes/s (IN+OUT) Three CMS GC problems contributed to large peaks of internal traffic  CASTORCMS disk cache throughput: CCRC’08 May Tier-0 Post-Mortem- 5

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR incidents 05/05: CMS GC problem (filesize based) 11/05: CMS t1transfer has large migration queue 14/05: CMS GC (SRM set weight function incrementing weight +base line weight) 20/05: CMS DB slowdown 20/05: LHCBFAILOVER has large migration queue 24/05: PUBLIC slowdown 29/05: CMS GC (SRM default weight) CCRC’08 May Tier-0 Post-Mortem- 6 In general the incidents were detected and resolved within short periods of time Compartmentalized impact Reduced interference: - within the instance - with other instances

CERN - IT Department CH-1211 Genève 23 Switzerland Alert mailing lists May 24 th PUBLIC slowdown was detected and fixed at 9 a.m. The problem had propagated to the shared SRM and from there to the other SRMs. The SRM outage was noticed and reported by ATLAS and a mail was sent to the atlas-operator-alarm list. The 24/7 operator had noticed a high load on the SRM db service due to a high number of oracle sessions. The operator rebooted the DB server at ~11:30 after which the SRM endpoints slowly recovered. In parallel the data service standby service manager reported the problem to the SRM service manager.SRM The operator procedure has been improved with clearer instructions on how to check the information, verify the service is covered by piquet and how to contact the piquet. CCRC’08 May Tier-0 Post-Mortem- 7

CERN - IT Department CH-1211 Genève 23 Switzerland SRM When it works it works well A large volume of data was transferred The average rate was high Reliability is still an issue ~10 incidents with impact ranging from service degradation to complete unavailability CCRC’08 May Tier-0 Post-Mortem- 8

CERN - IT Department CH-1211 Genève 23 Switzerland SRM Incidents & Conclusions May 5 – redundant SRM back-ends lock each other in database [ALL VOs] May 13 th – lack of space on SRM DB [LHCb] May 13th – DB “extreme locking” / DB deadlocks [ALL VOs] May 9 th, May 14 th, May 19 th – SRM ‘stuck’ / no threads to handle requests [ATLAS] May 21 st, May 24 th – slow stager backend causes SRM stuck / DB overload [ All VOs] May 30th – get Timeouts due to slowness on Castor backend [ATLAS, LHCb] 3 times in May – problematic use of soft pinning caused GC problems [CMS] June 6 th – patch update crashed backend servers [ATLAS, ALICE, CMS] To be improved: Better resiliency to problems More service decoupling Some bugs need to be fixed Better testing needs to be done CCRC’08 May Tier-0 Post-Mortem- 9

CERN - IT Department CH-1211 Genève 23 Switzerland SRM2 Outlook Separate out LHC VOs from shared instance Migrate all SRM databases to Oracle RAC (done for ATLAS) Upgrade to SRM 2.7 and deploy on SLC4 –Redundant backends –Uses CASTOR API which allows deployment of redundant stager daemons –Deploy fixes for identified bugs Configure SRM DLF to send logs to appropriate stager DLF –Improve our debugging response time Continue improving service monitoring CCRC’08 May Tier-0 Post-Mortem- 10

CERN - IT Department CH-1211 Genève 23 Switzerland Tape Overview Writing efficiency continues to improve –File sizes, especially Atlas –Write policies working well –Tiny files still an issue, follow-up will continue High read activity continues. –Data volume per mount low –Non-production access to Tier-0 continues Tape service passed CCRC May without issues. More load was expected. CCRC’08 May Tier-0 Post-Mortem- 11

CERN - IT Department CH-1211 Genève 23 Switzerland Operational Issues Maintenance took place but without significant disruption –IBM robot failure lasting 3 hours –Sun robot arm failure –Firmware upgrade of both IBM and Sun robots –Rolling upgrades for Castor tape server code Outlook –Tape queue prioritisation of production users –Tuning bulk transfers –Improve automation of metrics CCRC’08 May Tier-0 Post-Mortem- 12

CERN - IT Department CH-1211 Genève 23 Switzerland File size and performance DateAliceAtlasCMSLHCb CCRC May ’08322 MB1291 MB872 MB1327 MB March ‘08143 MB230 MB1490 MB865 MB CCRC Feb ’08340 MB320 MB1470 MB550 MB Jan ’08200 MB250 MB2000 MB200 MB CCRC’08 May Tier-0 Post-Mortem- 13

CERN - IT Department CH-1211 Genève 23 Switzerland Castor monitoring Monitoring improvements continue New metrics being proposed, implemented and deployed in a close collaboration between IT-DM and IT-FIO Deeper understanding of various activities Some examples follow... CCRC’08 May Tier-0 Post-Mortem- 14

CERN - IT Department CH-1211 Genève 23 Switzerland Files migrated per day (CMS & ATLAS) From 11/05 to 13/05 ~95K files were written to default pool. Top users, number of files: files 6080 files 4938 files CCRC’08 May Tier-0 Post-Mortem- 15 The average file size was 48.8MBytes. We will follow-up on such issues.

CERN - IT Department CH-1211 Genève 23 Switzerland ATLAS migrations defaultt0atlas T0ATLAS is working well.Already mentioned issue on default pool... CCRC’08 May Tier-0 Post-Mortem- 16

CERN - IT Department CH-1211 Genève 23 Switzerland Repeat tape mounts Repeated (read) tape mounts per VO CCRC’08 May Tier-0 Post-Mortem- 17 avg repeat mount per vo per day Metric was being deployed during CCRCHigh value of repeat mounts for reading... 

CERN - IT Department CH-1211 Genève 23 Switzerland Power cut, 30 th May What happened? –Equipment failure internal to CERN caused power to fail on the 18kV loop feeding B513. What went right? –Resupply of critical zone from dedicated supply. –Communications with CERN Control Centre much improved wrt previous incidents. –Recovery: Most services fully operational within 4 hours of power being restored. What went wrong? –Incorrect power connections in critical area, notably for essential network components (DNS, timeservers). –Communications between CC Operations and Physics Database Support team: services assumed to be OK due to presence of critical power. What next? –Re-organisation of power connections where necessary (many already done; network equipment pending). –Review recovery procedures for equipment split across physics and critical power (many physics database services). –Regular (3-monthly) live tests of physics power failure in the critical zone. First in June if possible. CCRC’08 May Tier-0 Post-Mortem- 18

CERN - IT Department CH-1211 Genève 23 Switzerland CPU Services

CERN - IT Department CH-1211 Genève 23 Switzerland Activity overview Overall smooth running No visible problems found with the system 2.4 M jobs executed including 490 k GRID jobs Low average number of pending jobs CCRC’08 May Tier-0 Post-Mortem- 20

CERN - IT Department CH-1211 Genève 23 Switzerland CCRC’08 May Tier-0 Post-Mortem- 21 Preparations (1) Anticipating heavy usage new resources where added to the dedicated (T0) and public shares New version of LSF deployed after problems with double logging appeared in the end of the February run of CCRC’08 –The load we saw in May now was not enough to properly test the system in production WN and CE Software versions were updated Old CE hardware replaced by new machines As recommended we changed back to publish physical CPUs instead of cores

CERN - IT Department CH-1211 Genève 23 Switzerland Preparations (2) CCRC’08 May Tier-0 Post-Mortem- 22

CERN - IT Department CH-1211 Genève 23 Switzerland At the start of the exercise (1) The number of pending jobs in the system dropped No problems found in the system Maybe we looked “small” so we got less grid jobs? –In GridMap CERN was a ‘small’ site: Not everybody had the same understanding of the recommendations So we did the same, we now publish cores as physical CPUs We saw almost no effect in the number of pending jobs CCRC’08 May Tier-0 Post-Mortem- 23

CERN - IT Department CH-1211 Genève 23 Switzerland At the start of the exercise (2) Decreased activity overall? –2.4 M jobs executed including 490 k GRID jobs 33% more than in April –Average of 80 k jobs per day Increased capacity caused faster draining of queues? CCRC’08 May Tier-0 Post-Mortem- 24

CERN - IT Department CH-1211 Genève 23 Switzerland Job throughput CCRC’08 May Tier-0 Post-Mortem- 25

CERN - IT Department CH-1211 Genève 23 Switzerland Job throughput Is this production running level? CCRC’08 May Tier-0 Post-Mortem- 26 If yes, we are ready! ;)

CERN - IT Department CH-1211 Genève 23 Switzerland No problems until… Powercut at 6:00 am on Friday (30 th May) lxmaster01 survived (critical power), scheduling was stopped Running jobs were not re-queued and were lost (~5k) –It would be nice to have at least local jobs marked as re-queable! Queues were reopened at 11:00 when CASTOR became available again Another 5k jobs terminated a bit to quickly Overall ~10k jobs were affected by the event CCRC’08 May Tier-0 Post-Mortem- 27

CERN - IT Department CH-1211 Genève 23 Switzerland LFC Service Smooth operation (except for the power cut) Two bugs found in the gLite Middleware –Savannah #36550 and #36508 –Monitoring modified to avoid service impact Software version used: – sec.slc4 CCRC’08 May Tier-0 Post-Mortem- 28

CERN - IT Department CH-1211 Genève 23 Switzerland WMS Service Smooth operation overall A few problems –2 nd May: Large fraction of CMS jobs aborted Due to misconfiguration –6 th, 14 th, 27 th May: CMS WMSes overloaded Will hopefully be fixed when moving to SLC4 gLite 3.1 WMS (July?) –30 th May: Power cut corrupted active job list on 1 WMS node. Painful recovery, some jobs were lost. gLite 3.1 WMS "jobdir" functionality will avoid such problems, but is not yet configured In parallel with CCRC'08: –Pilot SLC4 gLite 3.1 WMS installed: Used by CMS Worked OK (but there are known issues) Note: LCG-RBs still being used a lot CCRC’08 May Tier-0 Post-Mortem- 29

CERN - IT Department CH-1211 Genève 23 Switzerland All services Scheduled software upgrade was deployed including a kernel upgrade –Reboots needed in all machines –No problems seen in most of the services VOBoxes –Some internal confusion in CMS lead to “unexpected” reboot of one important machine LXBuild –Kernel was downgraded in ATLAS machines because it caused problems with their software. A preproduction machine is missing to test these changes before they go into production CCRC’08 May Tier-0 Post-Mortem- 30

CERN - IT Department CH-1211 Genève 23 Switzerland Questions? CCRC’08 May Tier-0 Post-Mortem- 31