EGEE Operation Tools and Procedures M. Jouvin (LAL-Orsay) jouvin@lal.in2p3.fr Grid Administration Training LAL, Orsay, September 2008, 15-19
Grid Operation Tools and Procedures - M. Jouvin Agenda Regional Operation Centers Site registration Site monitoring Grid Support Accounting Site downtimes Activity Visualization Mailing List Useful Links Grid Operation Tools and Procedures - M. Jouvin 27/04/2019
Regional Operation Centers EGEE is organized in 12 regions called ROCs 1 country or a group of country No African ROC so far… ROCs are part of EGEE SA1 (EGEE operations) A ROC is in charge of coordinating operations in the region EGEE contact with the site In charge of help and validation of new site Follow-up of site problems in the region Site coordination in the region Participate to global support EGEE-funded people in each ROC Grid Operation Tools and Procedures - M. Jouvin 27/04/2019
Grid Operation Tools and Procedures - M. Jouvin Site Registration To be published in EGEE production grid, a site must be registered in a central DB : GOCDB https://goc.gridops.org/site When creating a new site, it must pass validation tests before being seen as a production site After validation, site will appear in production BDII and be used automatically A site has registered administrators Site admins identified by their DN : requires a grid certificate A site belongs to a Regional Operation Center (ROC) Senegal : French ROC South Africa : Italian ROC Grid Operation Tools and Procedures - M. Jouvin 27/04/2019
Grid Operation Tools and Procedures - M. Jouvin Site Monitoring… SAM : Site Availability Monitoring Set of tests run every hour at each site checking site health Run under VO ops Other VOs can run their own SAM tests under their VO credentials Several group of tests : CE, SE, BDII… CE is the main one Several tests per group Result of tests is published on SAM portal : https://lcg- sam.cern.ch:8443/sam/sam.py Results organized by ROCs Site status is OK (green), WARNING (orange), ERROR (red) Some tests considered critical : site no longer used by WMS until problem is fixed A VO can publish its own view of site status Gstat: monitors site perfomances, in particular BDII http://goc.grid.sinica.edu.tw/gstat Grid Operation Tools and Procedures - M. Jouvin 27/04/2019
Grid Operation Tools and Procedures - M. Jouvin … Site Monitoring CIC Portal A grid operation portal http://www.gridops.org/ Ability to run SAM test on-demand (SAMAP) https://cic.gridops.org/index.php?section=roc&page=samadmin Requires to be a site administrator Requires the site to be in production Gridmap : http://gridmap.cern.ch Treemap view of the grid Display site status Give access to SAM page for site/resources Grid Operation Tools and Procedures - M. Jouvin 27/04/2019
Grid Operation Tools and Procedures - M. Jouvin Grid Support GGUS : Global Grid User Support https://ggus.org Unique entry point for grid support, both for users and administrators Used to report problem both about site behaviour or MW bugs Ticket processed by dedicated persons (TPM) and assigned to ROC and sites Problem fixing followed by ROC Important to report problem there : allow for traceability and global follow-up Important to answer and close problems Grid Operation Tools and Procedures - M. Jouvin 27/04/2019
Grid Operation Tools and Procedures - M. Jouvin Accounting Accounting is a critical service in a grid Users/VO manager need to know what is the real share of the resource they can use Funding bodies want to know if their « users » have the expected share Sites need to know who is using their site Based on a publisher run at each site + 1 portal Portal : http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_vie w.php Publisher runs on a grid machine called a MonBox : publishes into a grid central database Portal has site-oriented and user-oriented views Some (detailed) views required appropriate role, others are public Grid Operation Tools and Procedures - M. Jouvin 27/04/2019
Grid Operation Tools and Procedures - M. Jouvin Site Downtimes Site downtimes must be declared in GOCDB Users may know the site is unavailable Prevent “central operators” to raise alarms Avoid SAM test failures : tests are run but errors are “ignored” Site status is MAINT(enance) : not used by WMS Can be both “scheduled” or “unscheduled” Severity can be “outage” (most often) or “at risk” to declare an important configuration change which should not affect availability Grid Operation Tools and Procedures - M. Jouvin 27/04/2019
Grid Operation Tools and Procedures - M. Jouvin Alarms CIC Portal raises alarms when there is a SAM error for a resource without a declared downtime Alarms can be subcribed by site administrators on CIC portal Subscription per site/resource http://www.gridops.org Alarm sent as an email Alarm contains reference to the relevant page on SAM portal If problem remains a few hour after the alarm is sent, ROC should open a GGUS ticket and assign it to the site Not done automatically but by real persons in charge : 24x7 coverage, 2 persons in charge Grid Operation Tools and Procedures - M. Jouvin 27/04/2019
Activity Visualization LCG RTM: a “propaganda” tool at the beginning but also nice to use at CE/WMS activity Displays grid activity on top of GoogleEarth When clicking a site, access to jobs on the CE and WMS 3D application in Java : some problems on VISTA and MacOS GridIce An old application to look at queue statistics per site : running, waiting jobs, free CPUs… Unfortunatly not well maintained and often broken http://gridice2.cnaf.infn.it:50080/gridice/site/ GridView : grid activity history Jobs Data transfers http://gridview.cern.ch Grid Operation Tools and Procedures - M. Jouvin 27/04/2019
Grid Operation Tools and Procedures - M. Jouvin Mailing Lists LCG ROLLOUT : email list for discussion between site administrators Don’t use it as an alternative to GGUS To subscribe, http://www.jiscmail.ac.uk/cgi- bin/webadmin?REPORT=&t=1&s=0&X=&Y=&z=3&9=O&n=lcg- rollout&a=maybe Pretty high volume but worth to have at least one person per site Lot of grid experts on the list EGEE BROADCAST : email list used to broadcast important information related to EGEE operations All site administrators declared in GOCB are automatically subscribed Mainly used to broadcast downtimes and availability of new gLite updates Medium volume (depends on number of downtimes!) Grid Operation Tools and Procedures - M. Jouvin 27/04/2019
Grid Operation Tools and Procedures - M. Jouvin Useful Links… EGEE Operational Procedures https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEROperationalProc edures Grid Operation Tools and Procedures - M. Jouvin 27/04/2019