Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG Tier0 – Tier1 Service Coordination Meeting Update ~~~ WLCG Grid Deployment Board, 23 rd March 2010
Introduction Since January 2010 a Tier1 Service Coordination meeting has been held covering medium term issues (e.g. a few weeks), complementing the daily calls and ensuring good information flow between sites, the experiments and service providersTier1 Service Coordination Every two weeks (the maximum practical frequency) at 15:30 using same dial-in details as the daily call & lasting around 90’ Also helps communication between teams at a site – including Tier0! Clear service recommendations are given – e.g. on deployment of new releases – in conjunction with input from the experiments, as well as longer-term follow-up on service issues The daily Operations call continues at 15:00 Geneva timeOperations call These calls focus on short-term issues: those that have come up since the previous meeting and short term fixes & last 20-30’ 2
WLCG T1SCM Summary 4 meetings held so far this year 5 th is this week (tomorrow) – agenda has already been circulatedagenda We stick closely to agenda times – minutes available within a few days (max) of the meetingminutes “Standing agenda” (next) + topical issues, including: Service security issues; Oracle 11g client issues (under “conditions data access”); FTS status and rollout; CMS initiative on handling prolonged site downtimes; Review of Alarm Handling & Incident Escalation for Critical Services; Highlights of each of these issues plus summary of the standing agenda items will follow: see minutes for details! Good attendance from experiments, service providers at CERN and Tier1s – attendance list will be added as from this week 3
WLCG T1SCM Standing Agenda Data Management & Other Tier1 Service Issues Includes update on “baseline versions”, outstanding problems, release update etc. Conditions Data Access & related services FroNTier, CORAL server,... Experiment Database Service issues Reports from experiments & DBA teams, Streams replication, … [ AOB ] As for daily meeting, minutes are Twiki based and pre- reports encouraged Experts help to compile the material for each topic & summarize in the minutes that are distributed soon after 4
Key Issues by meeting January 21 Database & Data management service update Conditions data access and related services February 11 As above + FTS experiment issues with DM services Experiment Services: Security Issues February 25 FTS deployment + related issues Database services Prolonged Site Downtime - strategies March 11 FTS again Data management service – deployment status & plans Support, problem handling and alarm escalation Database services 5
Service security for ATLAS – generalize? A lot of detailed work presented on security for ATLAS services – too detailed for discussion here Some – or possibly many – of the points could be applicable also to other experiments To be followed up at future meetings… 6
Service security for ATLAS – generalize? ATLAS VOC for CERN/IT (Alarm Tickets, security, etc.) Management of ATLAS Central Services according to the SLA between CERN-IT and ATLAS (hardware requests, quattor, best practices, etc.) Providing assistance with machine, software and service management to ATLAS service developers and providers (distribution, software versions, rebooting, etc.) Provision of general frameworks, tools and practices to guarantee availability and reliability of services (squid/frontier distribution, web redirector, hot spares, sensors, better documentation, etc.) Collection of information and operational statistics (service inventory, machine usage, etc.) Enforcing the application of security policies Training of newcomers in the team Spreading knowledge among service developers and providers about the tools available within CERN/IT (Shibboleth2, SLS, etc.) 7
FTS status and rollout There were two serious bugs in FTS 2.2, one related to corruption of delegate proxies (#60095) and one related to agents crashing (#59955) – fixed in FTS 2.2.3# SiteStatusComment CERNInstalled ASGCNew m/cUPDATE : FTS installed, it will be put in production on BNLInstalled 23/02 CNAFInstalled 09/03 FNALInstalled IN2P3Scheduled 18/03UPDATE : FTS installed on KITInstalled 18/02 NDGF“Soon” NL-T1OngoingUPDATE : FTS installed on PICInstalledUPDATE : FTS installed on RALScheduledUPDATE : FTS installed on TRIUMFScheduled 24/03Assuming results of tests is positive Expect this Thursday’s meeting to ~conclude on FTS deployment
Handling prolonged site downtimes CMS activity, focusing on 4 main scenarios: 1.Data loss at a Tier1 2.Partial loss of a Tier1 3.Procurement failure at a Tier1 4.Extended Tier1 (site) outage These scenarios also apply to the Tier0 (including scenario 4…) Most Tier1s are also shared with ATLAS (and others…) WLCG-level coordination proposed (session at workshop in July but work will progress in parallel before then with updates at the T1SCM) 9
Alarm Handling & Incident Escalation for Critical CERN CASTOR + SRM, grid services & databases presented the support levels in place since the reorganization of IT Services have moved into different groups FIO, DM, DES -> (CF), PES, DSS, DB ATLAS pointed out some inconsistencies in the support for elements of what users see as the CASTOR service (e.g. CASTOR + SRM backend DBs) CMS had asked for clarification of streams support for conditions online- offline outside working hours: currently 8x5 The bottom line: authorized alarmers in the experiments should use the alarm mechanism as previously agreed Propose an Alarm Test for all such services to check end-end flow Service Report will always drill-down on any alarm tickets at next MB + daily OPS follow-up 10
Data Management & Other Tier1 Service Issues Regular updates on storage, transfer and data management services FTS has been one of the main topics but updates on versions of storage systems installed at sites, recommended versions, plans, major issues have also been coveredinstalled The status of the installed versions (CASTOR, dCache, DPM, StoRM, …) and planned updates was covered in the last meeting revealing quite some disparity (abridged table at end)status Important to have this global overview of the status of services and medium term planning at the sites… Release status handover “in practice”: from EGEE (IT-GT) to EGI (SP-PT) (Ibergrid) 11
Conditions Data Access & related services This includes COOL / CORAL server / FroNTier / Squid Issues reported not only (potentially) impact more than 1 VO but also sites supporting multiple VOsreported Discussions about caching strategies and issues, partitioning for conditions, support for CernVM at Tier3s accessing squids at Tier2s… The recommended versions for the FroNTier-squid server and for the FroNTier servlet are now included in the WLCG Baseline Versions Table Detailed report on problems seen with Oracle client and various attempts at their resolution (summarized next) 12
Oracle 11g client issues ATLAS reported crashes of the Oracle 11.2 client used by CORAL and COOL on AMD Opteron quad-core nodes at NDGF (ATLAS bug #62194).crashesbug #62194 No progress from Oracle in several weeks – some suggested improvements in reporting and followup ATLAS rolled back to 10.2 client in Februaryrolled back No loss of functionality for current production releases, although the 11g client does provide new features that are currently being evaluated for CORAL, like client result caching and integration with TimesTen caching Final resolution still pending – and needs to be followed up on: should we maintain an action list? Global or by area? 13
Experiment Database Service issues For a full DB service update – see IT-DB This slide just lists some of the key topics discussed in recent meetings: Oracle patch update – status at the sites & plans; Streams bugs, replication problems, status of storage and other upgrades (e.g. memory, RAC nodes); Need for further consistency checks – e.g. in the case of replication of ATLAS conditions to BNL (followed up on experiment side); Having DBAs more exposed to the WLCG service perspective – as opposed to the pure DB-centric view – is felt to have been positive 14
Topics for Future Meetings Key issues from LHCb Tier1 jamboree in NIKHEF e.g. “Usability” plots used as a KPI in WLCG operations reports to MB and other dashboard / monitoring issues; Data access – a significant problem since long! (All VOs) (Update on Data access working group of Tech forum?) Update on GGUS ticket routing to the Tier0 Alarm testing of all Tier0 critical services Not the regular tests of the GGUS alarm chain… DB monitoring at Tier1 sites – update (BNL) Jamborees provide very valuable input to the setting of Service Priorities for the coming months Preparation of July workshop + follow-on events 15
WLCG Collaboration Workshop The agenda for the July 7-9 workshop has been updated to give the experiments sequential slots for “jamborees” on Thursday 8 th, plus a summary of the key issues on Friday 9 th.workshop Protection set so that the experiments can manage their own sessions: Federico, Latchezar, Kors, Ian, Marco, Philippe Non experiment-specific issues should be pulled out into the general sessions Sites are (presumably) welcome to attend all sessions that they feel are relevant – others too? There will be a small fee to cover refreshments + a sponsor for a “collaboration drink” Registration set to open 1 st May 16
Summary WLCG T1SCM is addressing a wide range of important service related problems on a timescale of 1-2 weeks+ (longer in the case of Prolonged Site downtime strategies) Complementary to other meetings such as daily operations calls, MB + GDB, workshops – to which regular summaries can be made – plus input to the WLCG Quarterly Reports Very good attendance so far and positive feedback received from the experiments – including suggestions for agenda items (also from sites)! Clear service recommendations are a believed strength… Further feedback welcome! 17
BACKUP 18
19 SiteVersionComments CERNCASTOR (all), SRM (ALICE, CMS, LHCb) SRM (ATLAS) ASGC? BNLdCache CNAFCASTOR (ALICE), SRM (ALICE) StoRM (ATLAS, LHCb), StoRM 1.4 (CMS) 15/3: StoRM upgrade for CMS FNALdCache (admin nodes) dCache (pool nodes) IN2P3dCache with ChimeraApril: configuration update and tests on tape robot, integrate new drives Q3-Q4: one week downtime KITdCache (admin nodes) dCache (pool nodes) NDGFdCache dCache (some pools) dCache (some pilot admin nodes) NL-T1dCache next LHC stop: migrate three dCache admin nodes to new hardware PICdCache RALCASTOR (stagers) CASTOR (nameserver central node) CASTOR (nameserver local node on SRM machines) CASTOR , and (tape servers) SRM TRIUMFdCache