Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG Tier0 – Tier1 Service Coordination Meeting Update

Slides:

Advertisements

Similar presentations

Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.

Advertisements

Jan 2010 Current OSG Efforts and Status, Grid Deployment Board, Jan 12 th 2010 OSG has weekly Operations and Production Meetings including US ATLAS and.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG operations: communication channels Andrea Sciabà WLCG operations.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.

Integration and Sites Rob Gardner Area Coordinators Meeting 12/4/08.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

SRM 2.2: tests and site deployment 30 th January 2007 Flavia Donno, Maarten Litmaath IT/GD, CERN.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.

WLCG Service Report ~~~ WLCG Management Board, 24 th November

1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB

WLCG Service Report ~~~ WLCG Management Board, 1 st September

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

WLCG Collaboration Workshop 7 – 9 July, Imperial College, London In Collaboration With GridPP Workshop Outline, Registration, Accommodation, Social Events.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.

Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.

WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.

Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.

Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals

CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

LCG Report from GDB John Gordon, STFC-RAL MB meeting February24 th, 2009.

CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.

WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

Database Project Milestones (+ few status slides) Dirk Duellmann, CERN IT-PSS (

Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.

WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.

LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.

The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.

WLCG Collaboration Workshop 21 – 25 April 2008, CERN Remaining preparations GDB, 2 nd April 2008.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

Outcome should be a documented strategy Not everything needs to go back to square one! – Some things work! – Some work has already been (is being) done.

CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,

LCG Introduction John Gordon, STFC-RAL GDB June 11 th, 2008.

Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.

WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.

Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.

Database Readiness Workshop Intro & Goals

Workshop Summary Dirk Duellmann.

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

Presentation transcript:

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG Tier0 – Tier1 Service Coordination Meeting Update ~~~ WLCG Grid Deployment Board, 23 rd March 2010

Introduction  Since January 2010 a Tier1 Service Coordination meeting has been held covering medium term issues (e.g. a few weeks), complementing the daily calls and ensuring good information flow between sites, the experiments and service providersTier1 Service Coordination Every two weeks (the maximum practical frequency) at 15:30 using same dial-in details as the daily call & lasting around 90’ Also helps communication between teams at a site – including Tier0! Clear service recommendations are given – e.g. on deployment of new releases – in conjunction with input from the experiments, as well as longer-term follow-up on service issues The daily Operations call continues at 15:00 Geneva timeOperations call These calls focus on short-term issues: those that have come up since the previous meeting and short term fixes & last 20-30’ 2

WLCG T1SCM Summary 4 meetings held so far this year 5 th is this week (tomorrow) – agenda has already been circulatedagenda We stick closely to agenda times – minutes available within a few days (max) of the meetingminutes “Standing agenda” (next) + topical issues, including: Service security issues; Oracle 11g client issues (under “conditions data access”); FTS status and rollout; CMS initiative on handling prolonged site downtimes; Review of Alarm Handling & Incident Escalation for Critical Services;  Highlights of each of these issues plus summary of the standing agenda items will follow: see minutes for details! Good attendance from experiments, service providers at CERN and Tier1s – attendance list will be added as from this week 3

WLCG T1SCM Standing Agenda Data Management & Other Tier1 Service Issues Includes update on “baseline versions”, outstanding problems, release update etc. Conditions Data Access & related services FroNTier, CORAL server,... Experiment Database Service issues Reports from experiments & DBA teams, Streams replication, … [ AOB ]  As for daily meeting, minutes are Twiki based and pre- reports encouraged  Experts help to compile the material for each topic & summarize in the minutes that are distributed soon after 4

Key Issues by meeting January 21 Database & Data management service update Conditions data access and related services February 11 As above + FTS experiment issues with DM services Experiment Services: Security Issues February 25 FTS deployment + related issues Database services Prolonged Site Downtime - strategies March 11 FTS again Data management service – deployment status & plans Support, problem handling and alarm escalation Database services 5

Service security for ATLAS – generalize? A lot of detailed work presented on security for ATLAS services – too detailed for discussion here Some – or possibly many – of the points could be applicable also to other experiments To be followed up at future meetings… 6

Service security for ATLAS – generalize? ATLAS VOC for CERN/IT (Alarm Tickets, security, etc.) Management of ATLAS Central Services according to the SLA between CERN-IT and ATLAS (hardware requests, quattor, best practices, etc.) Providing assistance with machine, software and service management to ATLAS service developers and providers (distribution, software versions, rebooting, etc.) Provision of general frameworks, tools and practices to guarantee availability and reliability of services (squid/frontier distribution, web redirector, hot spares, sensors, better documentation, etc.) Collection of information and operational statistics (service inventory, machine usage, etc.) Enforcing the application of security policies Training of newcomers in the team Spreading knowledge among service developers and providers about the tools available within CERN/IT (Shibboleth2, SLS, etc.) 7

FTS status and rollout There were two serious bugs in FTS 2.2, one related to corruption of delegate proxies (#60095) and one related to agents crashing (#59955) – fixed in FTS 2.2.3# SiteStatusComment CERNInstalled ASGCNew m/cUPDATE : FTS installed, it will be put in production on BNLInstalled 23/02 CNAFInstalled 09/03 FNALInstalled IN2P3Scheduled 18/03UPDATE : FTS installed on KITInstalled 18/02 NDGF“Soon” NL-T1OngoingUPDATE : FTS installed on PICInstalledUPDATE : FTS installed on RALScheduledUPDATE : FTS installed on TRIUMFScheduled 24/03Assuming results of tests is positive Expect this Thursday’s meeting to ~conclude on FTS deployment

Handling prolonged site downtimes CMS activity, focusing on 4 main scenarios: 1.Data loss at a Tier1 2.Partial loss of a Tier1 3.Procurement failure at a Tier1 4.Extended Tier1 (site) outage These scenarios also apply to the Tier0 (including scenario 4…) Most Tier1s are also shared with ATLAS (and others…)  WLCG-level coordination proposed (session at workshop in July but work will progress in parallel before then with updates at the T1SCM) 9

Alarm Handling & Incident Escalation for Critical CERN CASTOR + SRM, grid services & databases presented the support levels in place since the reorganization of IT Services have moved into different groups FIO, DM, DES -> (CF), PES, DSS, DB  ATLAS pointed out some inconsistencies in the support for elements of what users see as the CASTOR service (e.g. CASTOR + SRM backend DBs)  CMS had asked for clarification of streams support for conditions online- offline outside working hours: currently 8x5 The bottom line: authorized alarmers in the experiments should use the alarm mechanism as previously agreed Propose an Alarm Test for all such services to check end-end flow Service Report will always drill-down on any alarm tickets at next MB + daily OPS follow-up 10

Data Management & Other Tier1 Service Issues Regular updates on storage, transfer and data management services FTS has been one of the main topics but updates on versions of storage systems installed at sites, recommended versions, plans, major issues have also been coveredinstalled The status of the installed versions (CASTOR, dCache, DPM, StoRM, …) and planned updates was covered in the last meeting revealing quite some disparity (abridged table at end)status  Important to have this global overview of the status of services and medium term planning at the sites… Release status handover “in practice”: from EGEE (IT-GT) to EGI (SP-PT) (Ibergrid) 11

Conditions Data Access & related services This includes COOL / CORAL server / FroNTier / Squid Issues reported not only (potentially) impact more than 1 VO but also sites supporting multiple VOsreported Discussions about caching strategies and issues, partitioning for conditions, support for CernVM at Tier3s accessing squids at Tier2s… The recommended versions for the FroNTier-squid server and for the FroNTier servlet are now included in the WLCG Baseline Versions Table Detailed report on problems seen with Oracle client and various attempts at their resolution (summarized next) 12

Oracle 11g client issues ATLAS reported crashes of the Oracle 11.2 client used by CORAL and COOL on AMD Opteron quad-core nodes at NDGF (ATLAS bug #62194).crashesbug #62194 No progress from Oracle in several weeks – some suggested improvements in reporting and followup ATLAS rolled back to 10.2 client in Februaryrolled back No loss of functionality for current production releases, although the 11g client does provide new features that are currently being evaluated for CORAL, like client result caching and integration with TimesTen caching Final resolution still pending – and needs to be followed up on: should we maintain an action list? Global or by area? 13

Experiment Database Service issues For a full DB service update – see IT-DB This slide just lists some of the key topics discussed in recent meetings: Oracle patch update – status at the sites & plans; Streams bugs, replication problems, status of storage and other upgrades (e.g. memory, RAC nodes); Need for further consistency checks – e.g. in the case of replication of ATLAS conditions to BNL (followed up on experiment side); Having DBAs more exposed to the WLCG service perspective – as opposed to the pure DB-centric view – is felt to have been positive 14

Topics for Future Meetings Key issues from LHCb Tier1 jamboree in NIKHEF e.g. “Usability” plots used as a KPI in WLCG operations reports to MB and other dashboard / monitoring issues;  Data access – a significant problem since long! (All VOs) (Update on Data access working group of Tech forum?) Update on GGUS ticket routing to the Tier0 Alarm testing of all Tier0 critical services Not the regular tests of the GGUS alarm chain… DB monitoring at Tier1 sites – update (BNL)  Jamborees provide very valuable input to the setting of Service Priorities for the coming months Preparation of July workshop + follow-on events 15

WLCG Collaboration Workshop The agenda for the July 7-9 workshop has been updated to give the experiments sequential slots for “jamborees” on Thursday 8 th, plus a summary of the key issues on Friday 9 th.workshop Protection set so that the experiments can manage their own sessions: Federico, Latchezar, Kors, Ian, Marco, Philippe  Non experiment-specific issues should be pulled out into the general sessions Sites are (presumably) welcome to attend all sessions that they feel are relevant – others too? There will be a small fee to cover refreshments + a sponsor for a “collaboration drink” Registration set to open 1 st May 16

Summary WLCG T1SCM is addressing a wide range of important service related problems on a timescale of 1-2 weeks+ (longer in the case of Prolonged Site downtime strategies) Complementary to other meetings such as daily operations calls, MB + GDB, workshops – to which regular summaries can be made – plus input to the WLCG Quarterly Reports Very good attendance so far and positive feedback received from the experiments – including suggestions for agenda items (also from sites)! Clear service recommendations are a believed strength…  Further feedback welcome! 17

BACKUP 18

19 SiteVersionComments CERNCASTOR (all), SRM (ALICE, CMS, LHCb) SRM (ATLAS) ASGC? BNLdCache CNAFCASTOR (ALICE), SRM (ALICE) StoRM (ATLAS, LHCb), StoRM 1.4 (CMS) 15/3: StoRM upgrade for CMS FNALdCache (admin nodes) dCache (pool nodes) IN2P3dCache with ChimeraApril: configuration update and tests on tape robot, integrate new drives Q3-Q4: one week downtime KITdCache (admin nodes) dCache (pool nodes) NDGFdCache dCache (some pools) dCache (some pilot admin nodes) NL-T1dCache next LHC stop: migrate three dCache admin nodes to new hardware PICdCache RALCASTOR (stagers) CASTOR (nameserver central node) CASTOR (nameserver local node on SRM machines) CASTOR , and (tape servers) SRM TRIUMFdCache