WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.

Slides:



Advertisements
Similar presentations
Jan 2010 Current OSG Efforts and Status, Grid Deployment Board, Jan 12 th 2010 OSG has weekly Operations and Production Meetings including US ATLAS and.
Advertisements

Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
WLCG GDB, CERN, 10th December 2008 Latchezar Betev (ALICE-Offline) and Patricia Méndez Lorenzo (WLCG-IT/GS) 1.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Database Administrator RAL Proposed Workshop Goals Dirk Duellmann, CERN.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007.
Experiment Operations: ALICE Report WLCG GDB Meeting, CERN 14th October 2009 Patricia Méndez Lorenzo, IT/GS-EIS.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons Maarten Litmaath On behalf of the WG participants GDB 09/09/2015.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
Summary on PPS-pilot activity on CREAM CE
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008

Introduction This ‘weekly’ report covers two weeks (MB summer schedule) Last week (21 to 27 July): This week(28 July to 3 August): Notes from the daily meetings can be found from: (Some additional info from CERN C5 reports & other sources) 2

GGUS Operator Alarm Tickets At the beginning of July the GGUS operator alarm tickets template was deployed allowing 3-4 users per experiment, identified by their grid certificates, to submit GGUS tickets directly to Tier-1 operations via a local mechanism. The end points are given in: The mechanism was tested over the next few weeks and by 25 July only FNAL and CERN were failing. This was reported at CERN to the official GGUS contact list of grid-cern-prod- with no reaction – too many cooks ? In fact the source address at GGUS simply had to be added as owner of the CERN Simba lists -operator-alarmgrid-cern-prod- A trailing issue is site dependency of what can be in these tickets. The SA1 USAG group proposal states: Grid Partners, especially VOs, require a direct way to report urgent problems, that must be solved within hours, to the service experts/site responsibles. CERN, for example, restricts this to Data Management services (Castor, FTS, SRM, LFC). Should there be a common expectation or simply a per-site definition (to be added into the Twiki) ?

Site Reports (1/2) CERN: Friday 24 July a replacement of NAS disk switches in front of infrastructure Oracle databases stopped Twiki editing and indexing from to Both ATLAS and LHCb have since emphasised the importance of the Twiki service to them. A post-mortem of last weeks failure of the Atlas offline DB streams replication to Tier 1s between Saturday 26 July and Wednesday (4 days) has been prepared (see The issue was caused by the occurrence of a gap in the archivelogs sequence propagated to the downstream capture database. Improvements in the monitoring will be needed to spot similar problems (assigned to development) There was a problem submitting jobs to the CERN T0-export service (FTS-T0-EXPORT) on July 28 between CEST and and the day after 29-7 between CEST to CEST. During this period, all FTS job submission attempts failed with an Oracle error. The root cause of this was an unexpected behaviour of Oracle datapump (export/import). The problem has been fixed and is documented in August a converter between fibre and router at P5 failed at cutting off the CMS pit. Fixed by network piquet at

Site Reports (2/2) BNL: 30 July there was an unexpected change in the published BNL site name where a legacy value was previously being used. It was changed in OSG which propogated to WLCG and stopped the FTS channel to BNL. CERN and T1 sites had to run a daily reconfig tool by hand (the following morning). Atlas Tier-2: Site AGLT2 (U.Michigan) – one of 4 used for ATLAS muon calibration – disappeared from WLCG bdii on 16 July. Found to be a new version of the way OSG-WLCG mapping is made – sites have to be marked as interroperable at several levels in order to be propogated. Fixed 30 July a few days after the problem was noticed. CNAF: 21 July installed LFC so experienced the repeated crashes. Installed a cron to check/restart but put in negative logic so restarted each 2 minutes. Corrected now so back to 1-2 days between crashes. NDGF: Pnfs log full over weekend of 26/27 stopped their dcache from working. General: LFC (which fixes the periodic LFC crashes) has entered PPS in release 34 last Friday. The plan is to accelerate its passage through PPS.

Experiment reports (1/3) ALICE: Presented results of integration of CREAM-CE with ALIEN to recent ALICE task force meeting (P.Mendez): FZK-PPS vobox fully configured CREAM CE access with 30 WNs behind increased to 330 Access to the local WMS ensured tested the submission following 2 different approaches Submission through the WMS: ALICE submit from a VOBOX with many delegations in the user proxy – from the UI, entering the VObox, passing through WMS, passing through the CREAM CE then through BLAH into the CE exceeds current limit of 9 delegations. Bug report made. No further testing through the WMS yet. Direct submission to the CREAM CE: job submission takes 3 seconds compared with 10 secs for LCG-RB. Needs definition of a gridftp server at each site to return the output sandbox (currently leaving on CE with messy recovery/cleanup. Note such servers were dropped with gLite 3.1)

Experiment reports (2/3) LHCB: Heavy load on their CERN gLite 3 WMS around 23 rd with jobs/WMS/day. Borrowed a 3 rd from Atlas with plan to convert one to gLite 3.1 (can handle more jobs) for testing. In testing pilot jobs were found to exceed the limit of 10 proxy delegations in vdt 1.6 going through 3.1 WMS. Patch is said to be available – will need coordinated priority rollout. Proxy mixup bug preventing multiple FQANs found in 3.1 WMS when user wants to submit to same instance with different VOMS roles. CMS: Continuing the pattern of cosmics runs Wednesday and Thursday Oracle security patch upgrade of devdb10 caused unexpected interruption of pit to Tier-0 testing. Will be followed at regular CMS meeting tomorrow.

Experiment reports (3/3) ATLAS: Since end July ATLAS are in quasi-continuous cosmics data taking mode. System tests are done during the day, and there is combined running (with as many systems as possible) usually every night and over the weekends. ATLAS process these combined data at the Tier-0, which includes also registration with DDM. There is hence cosmics data flowing to the Tier-1s any time, without previous dedicated announcement. Data rates and volumes are hard to predict; during last week, e.g., there was ~9 TB of RAW, ~2 TB of ESD. In addition functional tests at 10% of nominal rates continue. Last week ALL Tier-1s received 100% of FT data from CERN (at the same time all cosmic data is also replicated) and there was no double registration for Tier 1 and Tier 2 (an old problem).

Summary 9 Continued solid progress on many fronts but several long outages emphasise the need for continual vigilance at all levels when data taking starts.