Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

John Gordon CCLRC eScience centre Grid Support and Operations John Gordon CCLRC GridPP9 - Edinburgh.
29 June 2006 GridSite Andrew McNabwww.gridsite.org VOMS and VOs Andrew McNab University of Manchester.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct GOSC Oct 28.
Support: Certificates and Helpdesks Andrew Richards (GSC/NGS) – CCLRC, RAL.
Andrew McNab - Manchester HEP - 22 April 2002 EU DataGrid Testbed EU DataGrid Software releases Testbed 1 Job Lifecycle Authorisation at your site More.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Torsten Antoni – LCG Operations Workshop, CERN 02-04/11/04 Global Grid User Support - GGUS -
Last update 01/06/ :23 LCG 1Maria Dimou- cern-it-gd Maria Dimou IT/GD Site Registration policy & procedures
Andrew McNab - Manchester HEP - 2 May 2002 Testbed and Authorisation EU DataGrid Testbed 1 Job Lifecycle Software releases Authorisation at your site Grid/Web.
Andrew McNab - Manchester HEP - 22 April 2002 EU DataGrid Testbed EU DataGrid Software releases Testbed 1 Job Lifecycle Authorisation at your site More.
Deployment Team. Deployment –Central Management Team Takes care of the deployment of the release, certificates the sites and manages the grid services.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Andrew McNab - Manchester HEP - 6 November Old version of website was maintained from Unix command line => needed (gsi)ssh access.
John Gordon and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations.
08/11/908 WP2 e-NMR Grid deployment and operations Technical Review in Brussels, 8 th of December 2008 Marco Verlato.
John Gordon CCLRC RAL Grid Operations Centre Update Trevor Daniels LCG Grid Deployment Board 10 th November 2003.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
GGF12 – 20 Sept LCG Incident Response Ian Neilson LCG Security Officer Grid Deployment Group CERN.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
Dave Kant Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005.
Dave Kant Grid Operations Centre LCG Workshop CERN 24/3/04.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
Grid Operations Centre LCG Accounting Trevor Daniels, John Gordon GDB 8 Mar 2004.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
15-Dec-04D.P.Kelsey, LCG-GDB-Security1 LCG/GDB Security Update (Report from the Joint Security Policy Group) CERN 15 December 2004 David Kelsey CCLRC/RAL,
Grid Operations Centre LCG SLAs and Site Audits Trevor Daniels, John Gordon GDB 8 Mar 2004.
Dave Kant Monitoring ROC Workshop Milan 10-11/5/04.
Presenter Name Facility Name UK Testbed Status and EDG Testbed Two. Steve Traylen GridPP 7, Oxford.
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA gLite Information System Pedro Rausch IF.
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
Grid Deployment Enabling Grids for E-sciencE BDII 2171 LDAP 2172 LDAP 2173 LDAP 2170 Port Fwd Update DB & Modify DB 2170 Port.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
VO Box Issues Summary of concerns expressed following publication of Jeff’s slides Ian Bird GDB, Bologna, 12 Oct 2005 (not necessarily the opinion of)
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
John Gordon CCLRC RAL Grid Operations LCG Grid Deployment Board FNAL, 9th October 2003.
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
INFN GRID Production Infrastructure Status and operation organization Cristina Vistoli Cnaf GDB Bologna, 11/10/2005.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
John Gordon Grid Accounting Update John Gordon (for Dave Kant) CCLRC e-Science Centre, UK LCG Grid Deployment Board NIKHEF, October.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
II EGEE conference Den Haag November, ROC-CIC status in Italy
INFSO-RI Enabling Grids for E-sciencE GOCDB Requirements John Gordon, STFC.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.
Grid Operations Centre LHCC Comprehensive Review Trevor Daniels, John Gordon 25 Nov 2003.
Bob Jones EGEE Technical Director
Grid Operations Centre Progress to Aug 03
Job monitoring and accounting data visualization
Regional Operations Centres Core infrastructure Centres
EGEE is a project funded by the European Union
SA1 Execution Plan Status and Issues
Ian Bird GDB Meeting CERN 9 September 2003
Leigh Grundhoefer Indiana University
Presentation transcript:

Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004

Production Service Grids CCLRC is involved in Grid Operations for –LCG –GridPP –NGS –CCLRC –EGEE This means different things for different grids

UK GOC Core of GOC built around experience in deploying and running National Grid Service (NGS) –Support service –Help Desk/call centre ? Important to coordinate and integrate this with deployment and operations work in EGEE, LCG and similar projects. –EGEE – low level services, CA, GOC, CERT... Dedicated deployment and operations management will be a key component Develop relationship to ETF(o), ETFp/NGS, HPC, and large campus and project focused grids, which are not under the direct control of the GOC

The LCG GOC Vision GOC Processes and Activities –Coordinating Grid Operations –Defining Service Level Parameters –Monitoring Service Performance Levels –First-Level Fault Analysis –Interacting with Local Support Groups –Coordinating Security Activities –Operations Development –Grid Accounting

LCG Wider Picture In LCG, GOC sits alongside –Deployment Team – who roll out the middleware –Certification & Testing team –User Support Centre –Experiment Support – for the applications

Within the scope of LCG we are responsible for monitoring how the grid is running – who is up, who is down, and why Identifying Problems, Contact the Right People, Suggest Actions Providing scalable solutions to allow other people to monitor resources Manage site Information – definitive source of information Accounting – Aggregate Job Throughput (per Site, per VO) Established at CCLRC (RAL) Status of LCG2 Grid here: LCG GOC Monitoring

Overview GOC Proposal envisaged three Phases –Phase 1 Jul 03 – Oct 03 –Phase 2 Nov 03 – May 04 –Phase 3 Jun 04 – Jun 05 GOC Vision What was planned in Phase 1 and its current status What is planned for Phase 2

The Vision GOC Processes and Activities –Coordinating Grid Operations –Defining Service Level Parameters –Monitoring Service Performance Levels –First-Level Fault Analysis –Interacting with Local Support Groups –Coordinating Security Activities –Operations Development

Phase 1 ( Jun 03 – Oct 03) Taken from Proposal Jun 2003 a) Set up an initial monitoring centre - Done –Steering Group established –LCG-Rollout list installed –GOC website set up –Variety of Monitoring –SLA tests developed and running for CE and RB

Phase 1 b) Draft Security Policy and Procedures - Done –Drafted with the LCG Security Group Approved by GDB in October Will be submitted to SC2 for Adoption –Three GOC-related supporting Annexes in preparation Service Level Agreement Guide - drafted Procedures for Resource Admins - partly drafted Procedure for site self-audit - in outline

Phase 1 c) Define Service Level Parameters – Partly Done –Schedule, Availability, Reliability all clear and defined Schedule –The published periods of downtime for upgrading etc Availability –The proportion of actual up-time to scheduled up-time Reliability –The mean time to failure –Performance is service-specific; ideas under discussion needs experience with real users before deciding what is important –Service Level Agreement The publication by the site of the targeted (designed) service level parameters for an LCG service in a prescribed format will comprise the SLA for that service The GOC will monitor and publish alongside the actual achieved values of the same parameters

Phase 1 d) Establish a Monitoring Regime – Done (but further development is ongoing) –SLA Monitoring CE and RB availability and reliability are being crudely monitored now Reports of significant failures sent to Rollout List –Use and Development of MapCenter –Use and Development of GppMon –GridICE

Phase 1 e) Select tools for use and evaluation in Phase 2 - Done –As Phase 1 GppMon (extended to add history) MapCenter (extended to accommodate SLA tests) GridICE (run server for LCG2) –plus MonALISA needs local sensing agents –plus network monitoring tools from EDG WP7 needs local agents needs R-GMA

Phase 1 In addition to the work envisaged in the Proposal for Phase 1 RAL is acting as an operational GOC by monitoring LCG sites from the moment they install the LCG software. –All CE s are tested every 10 mins with an authentication test –All RB s are tested every 10 mins with a job-list-match test –Network connectivity is tested every 10 mins from RAL to every host –Port accessibility is tested to every externally accessible service every 10 mins –A trivial job is submitted to every CE every hour via Globus and via the CERN RB –Logs are examined and analysed several times a week –Significant failings or problems are reported to the LCG-Rollout list –Several problems have been uncovered in both the monitors and in various sites

Plans for Phase 2 (Nov 03 – May 04) a) Set up a second monitoring centre –Eventually there should be 2 more, one in the East and one in the West to provide 24 hour cover, and to provide regional coordination of operational issues like alerts and SLAs –Taipei have taken packaged monitoring and installed –Now sharing monitoring duties –Discussions with TRIUMF as third

Plans for Phase 2 b) Establish Grid operations and security coordination regime in consultation with –LCG Security Group –Local Security Officers –Local Support Groups –LCG User Support Centre (GGUS) to –promote the Security Policy and associated documents –agree and establish common operational practices, principally the way in which SLAs and monitoring will work –agree a fault analysis and alerting mechanism –agree an incident response mechanism

Plans for Phase 2 c) Establish a simple change control regime –question whether or to what degree 'control' is appropriate –as a minimum ensure information about recent and prospective changes is published to the community –establish whatever mechanism is agreed in coordination with local support groups –the minimum in outline would include: the schedule of service down time (part of SLA) the schedule and nature of proposed changes site would publish information via GOC web site

Plans for Phase 2 d) Monitoring service levels –Investigate using EDG WP7 network monitoring tools uses R-GMA –Install tools to monitor and detect deviations from SLA –Deploy remote agents - include in software distributions? –Automatic alert mechanisms for operations staff –Set up mechanisms to notify local support of problems

 Why We Monitor Keep systems up and running Notice failures; grid-wide services MDS, RBs Knowing what services a site should be running  no point raising an alert if the site isn’t meant to run it!  definition of services and which sites run them (SLA)  What Tools Do We Use Job Submission; GridIce; Nagios How – Database Developments Planned nagios  3 Stage Plan over next 12 months Monitoring Overview

There are many frameworks which can be used to monitor distributed environments MAPCENTRE GPPMON GRIDICE NAGIOS MONALISA Example: Mapcentre 30 sites ~ 500 lines in config file (static version) Example: Nagios 30 sites, 12 individual config files with dependencies Developed Tools to Configure these services to make the job easier NAGIOS, MAPCENTER and GPPMON Monitoring Services

Status of Grid, based on the success of job submission to resources, displayed as a world map, with sites represented by coloured dots SQL Query of Database -> List of Resources (CE, RB) Job Submission to each Site in Two Ways: Direct to CE = globus-job-run Indirect to CE via Resource Brokers = edg-job-submit Responses Collected and Translated into a Site Status Colour Index Success via RB = Green, Globus Only = Orange, Fail = Red Geographical View Presented Against World Map GOC Features – GPPMon

LCG2 CORE SITES Status: 23 March SITES

LCG2 CORE SITES Status: 12th May ~30 SITES

GOC Job Submission Flow Diagram JOB Script RB.CE create RB sent acknowledge edg-job-submit GOC (UI) Build List of CE, RB Resources SITE DB SQL QUERY CE Other.GlueCEUniqueID wget received acknowledgement WN CE

GOC Job Submission Flow Diagram GOC (UI) Build List of CE, RB Resources JOB Script GLOBUS.CE create CE sent acknowledge globus-job-run CE SITE DB SQL QUERY wget received acknowledgement

Nagios is a powerfull monitoring service that supports notifications, and the execution of remote agents to correct problems when faults are discovered. Advantages => proactively monitor grid (NRPE daemon) Automatic Configuration of Nagios based on Database Developed a set of plugins which focus on service behaviour and data consistency Do RBs find resources? Does Site GIIS’s publish correct hostname? Is the site running the latest stable software release? Does the Gatekeeper authentication service work? Are the host certificates valid e.g Issued by Trusted CA Are essential services running e.g GridFTP Further plugins are being developed (e.g certification) GOC Features – Nagios Monitoring

Nagios Screen Shot Service Summary for Nodes: Certificate Lifetime Check, GridFTP, GRAM Authentication Site Attributes via GIIS (siteName, Tag, …) HOST PLUGIN STATUS STATUS INFORMATION

Nagios Screen Shots LCG-1 Host and Service Summary tables for BDII nodes

GOC Site Database Develop and maintain a database to hold Site Information Contact Lists, Nodes, IP, URLs, Scheduled Maintenance Each Site has its own Administration Page where Access is Controlled through the use of X509 certificates. (GridSite) Monitoring Scripts read information in database and run a set of customised tools to monitor the infrastructure To be included in the monitoring a site must register its resources (CE,SE,RB,RC,RLS,MDS,RGMA,BDII,..)

GOC GridSite MySQL Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … ce se bdii rb Monitoring Secure Database Management via HTTPS / X.509 People, Contact Information, Resources Scheduled Maintenance RC SQL https SERVERSERVER

People: Who do we notify when there are problems EXAMPLE: RAL Site

Node Information (Type, Hostname, IP Address, Group) EXAMPLE: RAL Site

Fault Diagnosis Monitoring is currently checked every day –And a report sent to LCG-ROLLOUT mail-list Further diagnosis done by GOC on problem sites by additional tools –and possible causes suggested Additional monitoring developed in response to new problems –Eg certificate lifetimnes

LCG1 CERT Status: 27 Feb 2004

Distributing GOC Software GOC GridSite MySQL  Packaging Monitoring Tools Provide ROCs with a standard set of tools to proactively monitor resources 2 nd Prototype GOC established in Taipei (GMT+8hours) GOC Centre CLRC, TW Remote Query to collect a list of resources Local query if service not available Monitor Resources via Job Submission TOOLSTOOLS SITE CONFIG

 Provide ROCs with a package to monitor the resources in the region Tailored Monitoring ROCs can upload their own maps GUI to automate site locations on the map  Hierarchical view of Resources Example GridPP federated into 4 virtual T2 centres Monitoring Developments EGEE FranceUK/I GridPP LondonT2 IMPERIAL QMUL ScotGrid Edinburgh S.E.E

LCG Accounting Overview CE PBS/LSF Jobmanager Log GateKeeper Listens on port 2119 GRAM Authentication GIIS LDAP Information Server MON RGMA Database We have an accounting solution. The Accounting is provided by RGMA At each site, log-file data is processed from different sources and published into a local database.

LCG Accounting – How it Works GOC provides an interface to produce accounting plots “on-demand” Total Number of Jobs per VO per Site (ok) Total Number of Jobs per VO aggregated over all sites (to be done) Tailor plots according to the requirements of the user community ~ 1000 Alice Jobs Taipei Statistics Feb/Mar

LCG Accounting CNAF Statistics March ~ 10,000 Alice Jobs RAL Statistics March ~ 6,300 Alice Jobs

EGEE - Consortia 10 Consortia (incl. GEANT/TERENA/DANTE)  70 Partners UK e-Science: PPARC + Core Programme USA Enabling Grids for E-science for Europe Everyone

EGEE – SA ROCs, 4 CICs cf 3 worldwide in LCG RAL proposes to extend LCG GOC monitoring to ROCs

 RAL runs monitoring  All RCs added to database through their ROC i.e ROC takes responsibility for adding and checking information / data consistency in the database.  Provide Tailored Maps (example GridPP)  Each ROC will monitor its sites and regional services through the GOC monitoring at RAL  Timescale ~ 3-6 Months EGEE Stage 1

 Distribution of GOC s/w to allow ROCs to run their own monitoring i.e they run the monitoring tools themselves!  Centralised Database based at RAL but ROCs configure their monitoring from the centralised database  Further monitoring development required before completion of this stage.  [Nagios not finished; Other outstanding things e.g Packaging and Document; CVS..do we continue to use the LCG CVS repository?]  Timescale ~ 6 – 12 Months EGEE Stage 2

 Distribute database amongst the ROCs  A large distributed database instead of a single database  Distributed database hops to monitor core services  Timescale ~12 Months and beyond EGEE Stage 3

Summary A Grid Operations Centre involves many roles –Security, agreements, monitoring, accounting, support RAL has tackled all of these to different degrees –Still developing Share work with other grids –NGS, EGEE Biggest problem is problem and issue tracking