Presentation is loading. Please wait.

Presentation is loading. Please wait.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

Similar presentations


Presentation on theme: "Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006."— Presentation transcript:

1 Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006

2 WLCG Service Workshop - Operations model - 11/02/06 2 Overview Goal: what do we want to achieve? Overview of EGEE operations Present status Plan and timelines to converge

3 WLCG Service Workshop - Operations model - 11/02/06 3 Goal Operation process, includes: – Problem detection – Reporting – Problem solving – Escalation procedures To have a single operation process – No EGEE – LCG division; we have a single infrastructure and should have a single process In place when SC4 starts

4 WLCG Service Workshop - Operations model - 11/02/06 4 EGEE Operations Structure Operations Management Centre (OMC) Core Infrastructure Centres (CIC) – Manage daily grid operations – oversight, troubleshooting “Operator on Duty” – Run infrastructure services – UK/I, Fr, It, CERN, Ru,Taipei Regional Operations Centres (ROC) – Front-line support for user and operations issues – Provide local knowledge and adaptations – One in each region – many distributed User Support Centre (GGUS) – In FZK: provide single point of contact (service desk) + portal.

5 WLCG Service Workshop - Operations model - 11/02/06 5 EGEE Operations Process Grid operator on duty – 6 teams working in weekly rotation CERN, IN2P3, INFN, UK/I, Ru,Taipei – Crucial in improving site stability and management Operations coordination – Weekly operations meetings – Regular ROC, CIC managers meetings – Series of EGEE Operations Workshops Nov 04, May 05, Sep 05 Geographically distributed responsibility for operations: – There is no “central” operation – Tools are developed/hosted at different sites: GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon) Procedures described in Operations Manual – Introducing new sites – Site downtime scheduling – Suspending a site – Escalation procedures – etc

6 WLCG Service Workshop - Operations model - 11/02/06 6 Operations tools: Dashboard Dashboard provides top level view of problems: – Integrated view of monitoring tools (SFT, GStat) shows only failures and assigned tickets – Single tool for ticket creation and notification emails with detailed problem categorisation and templates – Detailed site view with table of open tickets and links to monitoring results – Ticket browser highlighting expired tickets Test summary (SFT,GSTAT) GGUS Ticket status ` Problem categories ` Sites list (reporting new problems) Developed and operated by CC- IN2P3: http://cic.in2p3.fr/http://cic.in2p3.fr/

7 WLCG Service Workshop - Operations model - 11/02/06 7 Monitoring shows a problem Operator- on-duty Site ROC Operator submits a GGUS ticket against the ROC and CC’s the site. The ticket is followed till it is solved ROC and Site work to resolve the problem Operations Support flow 1 st level support 2 nd level support

8 WLCG Service Workshop - Operations model - 11/02/06 8 LCG SC Operations Daily operational problems are followed using: service-challenge-tech mailing list Periodic meetings: – https://twiki.cern.ch/twiki/bin/view/LCG/ServiceChallengeMeeting s – daily meeting at cern 9.00 am: – weekly meeting, Monday at 16:00, with developers, sites and experiments represented

9 WLCG Service Workshop - Operations model - 11/02/06 9 Operations plan - I 1.Merge site contacts, so there is one single contact point for each site, and register them in GOCDB. End of February 2.Existing SC contacts: https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails ASCC lcg-sc@lists.grid.sinica.edu.tw BNL bnl-sc@rcf.rhic.bnl.gov FNAL cms-t1@fnal.gov GRIDKA Service.Challenge@iwr.fzk.de IN2P3 sc@cc.in2p3.fr INFN sc@infn.it NDGF sc-tech@ndgf.org PIC lcg.sc@pic.es RAL lcg-support@gridpp.rl.ac.uk SARA/NIKHEF tier1-ams@sara.nl TRIUMF sc@triumf.ca 2.Build list of services run at each site, and register them in GOCDB. End of February

10 WLCG Service Workshop - Operations model - 11/02/06 10 Operations plan - II 3.SFT evolution. Mid-March New sensors to monitor all services LFCtests done FTSend Feb RBavailable SRMtests avail BDIIavailable CEavailable VOMSscripts done MyProxybasic tests SFT framework extension: – To be able to integrate information from a variety of sources and tools – New schema defined – move to Oracle db Alarm Displays

11 WLCG Service Workshop - Operations model - 11/02/06 11 Monitoring shows a problem Operator- on-duty Site Tier1 Operator submits a GGUS ticket against the Tier 1 and CC’s the site Tier1 and Site work to resolve the problem Service Support Unit (experts) If the Tier1 + Site cannot resolve the problem, the Tier1 contacts the relevant Support Unit for assistance. 1 st level support 2 nd level support 3 rd level support Operations plan - III 4. Evolve support flow. March

12 WLCG Service Workshop - Operations model - 11/02/06 12 Tier-1: role and responsibilities Problems reported from Operator on duty to Tier-1 Tier-1 are the RESPONSIBLE for following up and solving the problem, in direct contact with the associated sites (Tier-2) Tier-1 are the contact points with the Service Units, in case Tier-1 and site are not able to fix a problem Eventually Tier-1 are responsible for building up the operation competence to support all the associated sites

13 WLCG Service Workshop - Operations model - 11/02/06 13 Operations Checklist The ROCs/Tier-1 need support to build their competence, specially in the beginning (now) Support coming from service support units (developers), a unit per service Support coming from distributed operation team: – Operator on duty documenting the most common problems – Operation guides Service checklist: documentation, debugging procedures, etc: – 2nd level support organisation defined (who to call when there is a problem with the application or middleware) – Mechanism to contact 2nd level organisation – Response time for 2nd level organisation – List of machines where service is running defined – List of configuration parameters and their values for the software components – List of processes to monitor – List of file systems and their emergency thresholds for alarms – Application status check script requirements defined – Definition of scheduled processes (e.g. cron) – Test environment defined and available – Problem determination procedures including how to determine application vs middleware vs database issues – Procedures for start/stop/drain/check status defined – Automatic monitoring of the application in place – Backup procedures defined and tested

14 WLCG Service Workshop - Operations model - 11/02/06 14 Intermediate steps While the previous steps are done (end of March), there are some preparations that could be done to start getting used to the operation process: – Start using standard procedures from operations- dashboard: Declare a scheduled maintenance Include the SC problems in the weekly ROC report Broadcast tool – Detect problems via other means e.g.: Read service-challenge-tech mailing list Read daily SC meeting summaries – So the operators on duty report them via operator dashboard – Advantages: We test the operation flow – the sites get used to receive tickets, process and answer them – we test the escalation procedures, response time – define and use site contacts

15 WLCG Service Workshop - Operations model - 11/02/06 15 Summary We need the operation process to be ready for SC4 There is a plan to get there, where different groups need to be involved – Developers Provide and maintain sensors Initial support from service support units (developers) is essential, specially in the beginning Operations checklist – Distributed operation team Grid operator on duty Problem follow up and escalation procedures – Sites They are the ultimate responsible to ensure that the services are running


Download ppt "Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006."

Similar presentations


Ads by Google