WLCG Service Report ~~~ WLCG Management Board, 9 th December 2008.

Slides:



Advertisements
Similar presentations
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
Advertisements

New VOMS servers campaign GDB, 8 th Oct 2014 Maarten Litmaath IT/SDC.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI AMOD report – Fernando H. Barreiro Megino CERN-IT-ES-VOS.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
F. Rademakers - CERN/EPLinux Certification - FOCUS Linux Certification Fons Rademakers.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
WLCG Collaboration Workshop 7 – 9 July, Imperial College, London In Collaboration With GridPP Workshop Outline, Registration, Accommodation, Social Events.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Planning Issues GDB June Harry Renshall, Jamie Shiers.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
Summary of 2008 LCG operation ~~~ Performance and Experience ~~~ LCG-LHCC Mini Review, 16 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
2011/11/03 Partial downtimes management Pierre Girard WLCG T1 Service Coordination Meeting.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Patricia Méndez Lorenzo Status of the T0 services.
SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG Tier0 – Tier1 Service Coordination Meeting Update
WLCG Collaboration Workshop 21 – 25 April 2008, CERN Remaining preparations GDB, 2 nd April 2008.
Operation team at Ccin2p3 Suzanne Poulat –
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
INFSO-RI Enabling Grids for E-sciencE FTS Administrators Tutorial for Tier-2s Paolo Badino
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
WLCG Management Board, 30th September 2008
Service Challenge 3 CERN
~~~ WLCG Management Board, 10th March 2009
WLCG Management Board, 16th July 2013
Olof Bärring LCG-LHCC Review, 22nd September 2008
WLCG Service Interventions
WLCG Service Report 5th – 18th July
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 9 th December 2008

FTS SLC3 Services at CERN - RIP Access to the CERN-PROD SLC3-based FTS services (FTS- T0-EXPORT, FTS-T1-IMPORT, FTS-T2-SERVICE) were STOPPED at on Monday 8th December All production activity should now be on the new SLC4- based FTS services. See for details. 2

GGUS Summary 3 VOAlarmTeamTotal ALICE001 (was 0) ATLAS011 (was 4)39 (was 46) CMS006 (was 6) LHCb008 (was 7) Comment – the total number of GGUS tickets per LHC VO is surprisingly(!) constant over at least a few weeks At this level the service is smooth and under control but…

Summary of the week  Relatively smooth week – until Friday…  Fri 5 th : CASTOR ATLAS SRM “meltdown” - more later…  Mon 8 th : Scheduled CASTOR ATLAS stager DB intervention at CERN ran into overtime [ and so did the others … ] This was an upgrade to and hence not possible as a rolling upgrade (scheduled for 1 hour – took 4.5) Similar SRM symptoms to Friday also seen this Monday…  Fire in Taipei - Sinica computing center (not ASGC) – message received Saturday – update received last night after prompting…  ATLAS e-p (CNAF) in unscheduled downtime Friday evening due to gpfs problems – fixed over w/e (v. early Sat morning) Luca joined call on Monday – even though holiday in Italy (thanks!) – but unable to follow meeting (phone problems) Plus ça change… 4

CASTOR ATLAS SRM The problems on srm-atlas.cern.ch last Friday arose after the srm-atlas endpoint had been upgraded to the CASTOR SRM 2.7 release last Wednesday. The 2.7 release is a major upgrade (it also include an upgrade from SLC3->4), which had been tested by the 4 LHC VOs for several weeks on the srm--pps endpoints. The upgrade last week was actually implemented as a switch- over DNS alias srm-atlas-pps -> srm-atlas in order to minimize the impact for atlas. The new SRM worked fine between Wednesday and Friday morning when ATLAS started to report FTS timeouts, which we traced to an accumulation of requests in the SRM. One of the things that comes with 2.7 is a change in the scheduling of the gridftp transfers, which seemed to have been the cause for the problems we had on Friday (I think the developers are working on a full explanation for the problems). Why this didn't come up in the VO pps testing is not clear to me but it could be load related. 5

CASTOR ATLAS DB This was not a rolling upgrade operation, that is why we had requested a 1 hour slot intervention. We upgraded both the Clusterware and the RDBMS software from to , and we have to stop and reinstall everything. The problem that we faced, was that the Atlas Stager cluster had one of the first initial installation of the RAC software, non-RPM based and not following the current naming conventions that the RPMs do expect to find, so our Oracle RPM packages were not able to install the software. We had to manually clean up everything to be able to install from zero the software. We also discovered that the CDB info for these two nodes was not correct, and that prevented the Oracle NCM component from properly configuring the Oracle clusterware. … 6 For the rest of this week upgrades, we will continue with the standard procedure and stick to the assigned 1 hour slot.

CASTOR CMS DB (today…) Dear all, We are late with the Oracle upgrade due to RAC cluster configuration problems. Sorry about that. I will keep you informed. Best regards...Ignacio... Service back up – announced to users at 12:57 7

CASTOR LHCb DB Scheduled 09:00 – 10:00 Started Hello, We have discovered that after the "successful" upgrade of the databases to , the clusterware refuses to start up after the reboot of the machines (to take the latest kernel into acccount). More news later.. ciao. Ended (OK from CASTOR OPS): 12: :10 Dear all, The service is back. Apologies for the delay. Best regards...Ignacio…

TAIPEI FIRE & NETWORK ISSUES the network problem should have been resolved 2hr right after the first event escalated (around 4:30pm). service provider able to input different power source with separated power generator to avoid exhausted of UPS system supporting ASGC 10G network. before that, two affect links have been reroute through AMS and HK routes, latency might increase to two or four times connecting to US and JP (KEK and HIROSHIMA) sites respectively. connection with Tokyo Univ have minor impact as reroute at HKiX ?. ? local service provider will submit quality improvement report as well as risk assessment asap and review by local network operation administrators. Report from NOC in daily meeting minutes (today’s…) 9

Pros & Cons – Managed Services Predictable service level and interventions; fewer interventions, lower stress level and more productivity, good match of expectations with reality, steady and measurable improvements in service quality, more time to work on the physics, more and better science, …  Stress, anger, frustration, burn- out, numerous unpredictable interventions, including additional corrective interventions, unpredictable service level, loss of service, less time to work on physics, less and worse science, loss and / or corruption of data, … All too often it is the 2 nd column that typifies our services! (By “all too often” I mean > 1 per week…)

Historical Note Almost exactly 25 years ago I joined CERN in the IT operations group (DD-CO) Key technologies: clusters, storage, Oracle, VMS, just a little Unix… Key customers: the LEP experiments VXCRNA came later & had a single 450MB disk per experiment!  Key issues: change management, incident management Worked in 3 departments (DD, CO, IT), 3 buildings (513, 31, 28) and 9 groups 11

Summary WLCG Operations Page: Linked directly from WLCG home page WLCG Operations mailing list: WLCG “Service Coordinator on Duty” WLCG Collaboration Workshop pre-CHEP Registration deadline is December 28 th 2008  People are already asking about agenda before they reserve for CHEP and book flights / hotels… I leave on holiday December 17 th … Let’s look at agenda… 12