Technical workshop: Grid and ROC operations; Planning

Slides:

Advertisements

Similar presentations

Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct GOSC Oct 28.

Advertisements

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.

John Gordon and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations.

08/11/908 WP2 e-NMR Grid deployment and operations Technical Review in Brussels, 8 th of December 2008 Marco Verlato.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.

INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

SEE-GRID-SCI Regional Grid Infrastructure: Resource for e-Science Regional eInfrastructure development and results IT’10, Zabljak,

SEE-GRID-SCI SEE-GRID-SCI Operations Procedures and Tools Antun Balaz Institute of Physics Belgrade, Serbia The SEE-GRID-SCI.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

02/07/09 1 WLCG NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) University of Oxford.

Dave Kant Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005.

Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager

EGEE is a project funded by the European Union under contract IST User support in EGEE Alistair Mills Torsten Antoni EGEE-3 Conference 20 April.

Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment

Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov

Grid Operations Centre LCG SLAs and Site Audits Trevor Daniels, John Gordon GDB 8 Mar 2004.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.

EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN

EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.

EGEE is a project funded by the European Union under contract INFSO-RI Grid accounting with GridICE Sergio Fantinel, INFN LNL/PD LCG Workshop November.

INFSO-RI Enabling Grids for E-sciencE Introduction to Grid Computing, EGEE and Bulgarian Grid Initiatives, Sofia, South.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

EGEE is a project funded by the European Union under contract IST Service Activity 1 M.Cristina Vistoli ROC Coordinator All activity meeting,

EGEE is a project funded by the European Union under contract IST Issues from current Experience SA1 Feedback to JRA1 A. Pacheco PIC Barcelona.

DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 1 / 10 Grid Monitoring A conceptual introduction to GridICE Sergio Andreozzi

II EGEE conference Den Haag November, ROC-CIC status in Italy

SEE-GRID-SCI Grid Operations Procedures Antun Balaz Institute of Physics Belgrade Serbia The SEE-GRID-SCI initiative.

TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.

1 Grid Operations Jinny Chien ASGC June 09, Academia Sinica Slides adapted from the EGEE training material repository:

INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.

Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it.

Transition to EGI PSC-06 Istanbul Ioannis Liabotis Greece GRNET

Service Availability Monitoring

Bob Jones EGEE Technical Director

James Casey, CERN IT-GD WLCG Workshop 1st September, 2007

Grid Operations Centre Progress to Aug 03

Job monitoring and accounting data visualization

Regional Operations Centres Core infrastructure Centres

NGI and Site Nagios Monitoring

Operations Status Report

EGEE is a project funded by the European Union

LCG Monitoring and Accounting

Use of Nagios in Central European ROC

Advancing South-East Europe into the eInfrastructure era

SA1 Execution Plan Status and Issues

Ian Bird GDB Meeting CERN 9 September 2003

POW MND section.

Brief overview on GridICE and Ticketing System

Grid Operations Procedures

EGEE VO Management.

Sergio Fantinel, INFN LNL/PD

INFN – GRID status and activities

Maite Barroso, SA1 activity leader CERN 27th January 2009

GGUS Partnership between FZK and ASCC

LCG Operations Centres

LCG Operations Workshop, e-IRG Workshop

Leigh Grundhoefer Indiana University

Pierre Girard ATLAS Visit

EGEE Operation Tools and Procedures

Site availability Dec. 19 th 2006

Presentation transcript:

Technical workshop: Grid and ROC operations; Planning The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 002356

Technical workshop, Istanbul, 7/12/2004 2 Workshop program DEPLOYMENT: Deployment, installation and certification process - Jozsef Patvarczki, 30 minutes Deployment strategy, operational procedures, organization – Ognjen Prnjat, 30 minutes Discussion and planning – Ognjen Prnjat: 30 minutes RUNTIME OPERATIONS: Monitoring – Min Tsai - 30 mins Helpdesk: EGEE state of the art: Alex Tudose -15 minutes Technical workshop, Istanbul, 7/12/2004 2

Deployment, installation and certification process Jozsef Patvarczki SZTAKI The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 002356

Technical workshop, Istanbul, 7/12/2004 4 Jozsef Jozsef!!! Technical workshop, Istanbul, 7/12/2004 4

Organization, deployment strategy, operational procedures Ognjen Prnjat GRNET The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 002356

Technical workshop, Istanbul, 7/12/2004 6 Overall organization Technical workshop, Istanbul, 7/12/2004 6

Specific responsibilities Middleware installation, deployment and site certification activities: SZTAKI Helpdesk and RC list: ICI Monitoring: UKIM (TBC) Regional application integration and deployment + user support: UoB and TUBITAK EGEE application deployment coordination: IPP Technical workshop, Istanbul, 7/12/2004 7

Site responsibilities Coordinated by GIM per each country Perform full cluster administration (fabric, OS, middleware). Port m/w to target platform if necessary (depending on the reference development platform from EGEE). Perform m/w certification; and customisations if necessary (add region-specific VOs, define service configurations for local VOs, define local configurations at the node). Perform m/w deployment and upgrades as often as necessary. Carry out site certification in collaboration with SZTAKI and CERN. Provide relevant documentation for the site. Provide front-line support for the operational problems for the cluster; and for local and remote users using the cluster (respond to local trouble-tickets related to the node within reasonable timelines). Work towards establishing automated procedures for daily checks and notification systems in case of failure. Participate in developing and running coherent Trouble Ticket (TT) and knowledge base infrastructure. Support local monitoring service. Monitor the resource utilization and SLAs and provide necessary monitoring, accounting and SLA compliance statistics for deliverables and other purposes, on a regular basis. Keep detailed logs of all interventions on the site. Technical workshop, Istanbul, 7/12/2004 8

Deployment, installation and certification First phase: coordination by SZATKI as presented before When production-level is reached: EGEE-SEE ROC Technical workshop, Istanbul, 7/12/2004 9

Technical workshop, Istanbul, 7/12/2004 10 Core services RB (+LBS)+BDII Workload management, logging and bookkeeping, information system per VO MyProxy Extends the valid life of the job by generating a short-lived proxy. VO server / VOMS VO management VO manager: administration RLS File and metadata catalog service Technical workshop, Istanbul, 7/12/2004 10

Deployment strategy (D2.2) Rollout primary sites into EGEE Deploying core services: RB, BDII, MyProxy Deploying centralized services: VOMS, RLS Monitoring and CIC (out of scope) Technical workshop, Istanbul, 7/12/2004 11

Regional applications Phases of deployment 1st: only on local clusters: testing job submission through local UI and CE 2nd: through EGEE core services 3rd: through own core services Middleware adaptation / customisation Regional applications might require some m/w customizations and specific configurations at sites Technical workshop, Istanbul, 7/12/2004 12

Technical workshop, Istanbul, 7/12/2004 13 EGEE applications HEP, BioMed A number of experiments available within each field Only a subset can be supported Some experiments require specific configurations MPI availability Scavenger Grid not suitable for HEP Technical workshop, Istanbul, 7/12/2004 13

Technical workshop, Istanbul, 7/12/2004 14 Runtime operations Repository of RCs: SZTAKI runs this currently RO should maintain Monitoring: Min Tsai to present; UKIM to support (TBC) Operational/user support (helpdesk + TTS)  RO, relationship with EGEE must be defined Security: presentation by Auth later Technical workshop, Istanbul, 7/12/2004 14

Discussion Ognjen Prnjat GRNET The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 002356

Technical workshop, Istanbul, 7/12/2004 16 Discussion: goals Consensus on D2.2 deployment strategy EGEE support still unclear: NA4Test vs. GILDA Core services: timeline and responsibilities + teams Cluster support for applications, initial timelines Technical workshop, Istanbul, 7/12/2004 16

Technical workshop, Istanbul, 7/12/2004 17 Deployment strategy Rollout primary sites into EGEE Support of all core services by EGEE? Which ones? GILDA? Deploying core services: RB, BDII, MyProxy Rely on EGEE centralized services: which ones?? Deploying centralized services: VOMS, RLS Monitoring and CIC (out of scope) Technical workshop, Istanbul, 7/12/2004 17

Core services: SEEGRID VO Core services needed VO server / VOMS [Greece - Auth] Is one VOMS enough to support number of VOs? Yes. MyProxy: Auth: running Is one MyProxy enough to support number of VOs? Yes. RB (+LBS)+BDII WHO RLS Is it needed? VO manager: administration – IPP? If have first test-job submission by May 2005 its very good Coordination: SZTAKI Technical workshop, Istanbul, 7/12/2004 18

Regional applications Phases of deployment 1st: only on local clusters: testing job submission through local UI and CE 2nd: through EGEE core services 3rd: through own core services Middleware adaptation / customisation Integrating the regional application with EGEE m/w: any changes to m/w needed? Customised version of m/w? UoB, TUBITAK Dependency on OS / m/w? Maybe should have SL clusters for deployment straight away? Can, and should, UoB application be a part of EGEE BioMed VO? Technical workshop, Istanbul, 7/12/2004 19

Technical workshop, Istanbul, 7/12/2004 20 EGEE applications HEP, BioMed Which experiment to support, on which clusters (based on requirements), and when BioMed: RB institute? LHC: volunteers? Technical workshop, Istanbul, 7/12/2004 20

Technical workshop, Istanbul, 7/12/2004 21 Status of resources Country CPUs Storage (TB) Bulgaria ? Romania Turkey Hungary Albania Bosnia-Herzegovina FYRoM Serbia-Montenegro Croatia Countries: 10 CPUs: ? Storage: ? Technical workshop, Istanbul, 7/12/2004 21

EGEE/LCG Monitoring, Role of GOC, and SEE-GRID strategy Min Tsai CERN The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 002356

Technical workshop, Istanbul, 7/12/2004 23 EGEE/LCG Monitoring Look at the existing monitoring tools that are being used in LCG Grid Operations Centre Technical workshop, Istanbul, 7/12/2004 23

GOC Configuration Database (monitor app slides from D. Kant) Secure Database Management via HTTPS / X.509 Store a Subset of the Grid Information system People, Contact Information, Resources Scheduled Maintenance Monitoring Services Operations Maps Configure other Tools Organisation Structures Secure services - Site News Self Certification Accounting GOC GridSite MySQL SERVER SQL https Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … bdii ce GOC DB can also contain information that is not present in the IS such as: Scheduled maintenance; News; Organisational Structures; Geographic coordinates for maps. se rb RC Technical workshop, Istanbul, 7/12/2004 24

Technical workshop, Istanbul, 7/12/2004 25 EGEE/LCG Monitoring (from D. Kant) Ganglia Monitoring http://gridpp.ac.uk/ganglia Can use Ganglia to monitor a cluster Scalable distributed monitoring system for clusters and grids using RRD for storage and visualisation. RAL Tier-1 Centre LCG PBS Server displays Job status for each VO Get a lot for little effort Ganglia is a scalable distributed monitoring system for clusters and grids which uses RoundRobinDatabaseTool for data storage and visualisation. Its relatively easy to install and you get a lot for little effort. One of its strengths is that it can federate clusters together. Technical workshop, Istanbul, 7/12/2004 25

Federated Cluster Information Ganglia Monitoring Federated Cluster Information Can also use Ganglia to monitor clusters of clusters Ganglia/R-GMA integration through Ranglia. Separate and distinct clusters federated together. Ganglia provides a wealth of information, much of it low level, that can be useful for operations. Ganglia/R-GMA integration through Ranglia. Technical workshop, Istanbul, 7/12/2004 26

Technical workshop, Istanbul, 7/12/2004 27 GridICE - Architecture GRIDICE – Architecture A different kind of monitoring tool – processes / low level metrics / grid metrics Developed by the INFN-GRID Team http://infnforge.cnaf.infn.it/gridice Data harvest via discovery service (postgreSQL) Measurement service Publication service Unlike GPPMON which runs simple functional tests, there are other tools which can monitor services in different ways. For example GRIDICE – monitoring tool for a grid operations center – which has been developed by the INFN grid team. Gridice implements a number of services ranging from A Measurement service which uses monitoring sensor agents to probe “core processes” belonging to a service; and other low level metrics such as memory, cpu A Publisher service to collect this information in a local database (fmonServer) at the site. A Discovery service to find resources and harvest data into a central database. An finally a publisher service a portal to the monitoring data which can be aggregated in different ways. Technical workshop, Istanbul, 7/12/2004 27

Technical workshop, Istanbul, 7/12/2004 28 GridICE – Global View GRIDICE – Global View Different Views of the data: Site / VO / Geographic Resource Usage CPU#, Load, Storage, Job Info List of Sites Web Interface shows what you might see if you want an overall Global view of grid resources. Here you can see a list of Participating sites and a description of resource usage, such as total CPU and storage available. GridIce use Nagios to schedule updates of its central monitoring repository, and the information you see is reasonably up-to-date. The information can be viewed in different ways: for example Geographically or for each VO Technical workshop, Istanbul, 7/12/2004 28

GridICE – Job Monitoring GridICE - Architecture GridICE – Job Monitoring GridICE - Architecture Recently deployed version 1.6.3 on to LCG which features job monitoring: Queued, Running, Finished organised in different ways (site, Vo etc) XML views of data Latest version of GridIce (1.6.3) implements job monitoring features: Current status of running/queued/finished jobs per vo per site. Technical workshop, Istanbul, 7/12/2004 29

Technical workshop, Istanbul, 7/12/2004 31 Gstat (GIIS Monitor) Tool to display and check information published by the site GIIS (bdii update, IS sanity, rgma, core service checks, usage statistics) http://goc.grid.sinica.edu.tw/gstat/ GIIS Monitor which has been developed by the GOC based in Taipei It’s a tool to monitor the grid information system. The primary goal of the application is to detect faults, perform sanity checks and display useful data. Provides an overview of the current grid status and you can drill down to get more information. Technical workshop, Istanbul, 7/12/2004 31

Technical workshop, Istanbul, 7/12/2004 32 Real Time Grid Monitor http://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html A Visualisation tool to track jobs currently running on the grid. Applet queries the logging and bookkeeping service to get information about grid jobs. Why are jobs failing? Why are jobs queued at sites while others are empty? Visualisation tool developed by GridPP at Imperial College. The monitor works by querying the RB Logging and Bookkeeping database for job information. Because the L&B service is continually being updated, the tool shows jobs flowing from the RB to a site for processing or returning back to the source once completed. Tool is useful to get a global picture of trends and quickly identify potential problems such as “job pile-up”, and it help to publicise the grid at conferences to non experts. [ Applet queries files not older than 6 hours. Long running jobs don’t show up] Technical workshop, Istanbul, 7/12/2004 32

Technical workshop, Istanbul, 7/12/2004 33 GPPMON – Job Submission Tests Displays the results of tests against sites. Test: Job Submission Job is a simple test of the grid middleware components e.g. Gatekeeper service, RB service, and the Information System via JDL requirements. This is the GPPMON tool developed by the GridPP Collaboration. Basically, it’s a map which represents the results of kind of test as coloured dot. For example: A job submission test sends a job request to a site through a resource broker. If the job executes successfully, the site is marked with a green dot. If the job fails, the site Is marked with a red dot. This kind of test is testing the functional behaviour of the core services – Do simple jobs run. These maps can be tailored for different communities: for example a grid communities identified by a list of sites in a BDII configuration database such as the one shown here for LCG. This kind of test deals with the functional behaviour core grid services – do simple jobs run. They are lightweight tests which run hourly. However, they have certain limitations e.g. Dteam VO; WN reach (specialised monitoring queues). Technical workshop, Istanbul, 7/12/2004 33

Sites Functional Tests A set of many (~20) various tests that run on WN and checks essential functionality of a site: general WN configuration: RPMs version, Environment, CSH, BrokerInfo... grid tools and services: Replica Manager (local and remote SE involved), lcg-utils, R-GMA new tests are added instantly when new types of problems are detected/reported Executed every morning – now 6am for scavenger grids Relatively easy to install and configure Technical workshop, Istanbul, 7/12/2004 35

Certification Test Results http://lcg-testzone-reports.web.cern.ch/lcg-testzone-reports/cgi-bin/listreports.cgi Test results shown on a web page. As you can see, it’s a large Matrix of data where each row identifies a site and the corresponding test results. One of the difficulties of having too much information, is that it can be difficult to find the information you need. Its also quite detailed - most links allow you to drill down and examine the debug information - again a useful tool for the expert! Technical workshop, Istanbul, 7/12/2004 36

Problems with Monitoring Tools Inconsistent site configuration sources Monitoring tools don’t have same coverage Correlation more difficult Difficult to correlate test results Search through many web pages Time consuming especially with 90 sites! Need a view with only alerts Don’t need to be flooded with information Technical workshop, Istanbul, 7/12/2004 37

Unified Monitoring System (in progress) Site configurations are consistent Only from GocDB Data is sent to single data transport (RGMA) Shared data format Single Console to display all data Single alarm system to monitor all data http://goc.grid.sinica.edu.tw/gocwiki/RgmaUnifiedMonitoringSystem RGMA Issues Reliability: Registry is single point of failure Complex queries not available Older less supported version in production Technical workshop, Istanbul, 7/12/2004 38

Technical workshop, Istanbul, 7/12/2004 39 Other Problems Can’t readily perform test on demand Verify if the problems still exists Some tests run once a day Solution: allow on demand testing through a secure interface Not enough help on error messages What are the possible causes and solutions? Build knowledge base http://goc.grid.sinica.edu.tw/gocwiki/SiteProblemsFollowUpFaq Technical workshop, Istanbul, 7/12/2004 39

Role of Grid Operation Center (CIC Procedure) CIC Current Operation Procedures: http://cic.in2p3.fr Problem detection Sites Functional Test: provides most detailed fault detection Other tools also used and monitored periodically Diagnosis Check detailed reports on monitoring tools Refer to wiki knowledge base Collect information help admin trouble shoot problem Problem tracking Savannah used New solutions go back into wiki Technical workshop, Istanbul, 7/12/2004 40

Role of Grid Operation Center (CIC Procedure) Escalation Mail to site admin and ROC Second mail to ROC Phone to ROC Reported to SA1 Management Deadlines typically 3 days each (1 day for large sites) Run Core services RB, BDII, MyProxy VOMS, RLS Monitoring Technical workshop, Istanbul, 7/12/2004 41

Technical workshop, Istanbul, 7/12/2004 42 Monitoring Strategy Phase I Register as EGEE Resource Centers Automatically get incorporated into monitoring systems TZtest Page GPPMon: regional view Gstat (GIIS Monitor): regional view GridICE Accounting Gain operational experience with these tools Phase II Decide what is lacking then, (service level reports) Develop or contribute to other projects Technical workshop, Istanbul, 7/12/2004 42

EGEE Helpdesk and operational support procedures; SEE-GRID strategy Alex and Alex ICI The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 002356

Technical workshop, Istanbul, 7/12/2004 44 Alex and Alex Alex and Alex!!! Technical workshop, Istanbul, 7/12/2004 44