Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Torsten Antoni – LCG Operations Workshop, CERN 02-04/11/04 Global Grid User Support - GGUS -
Last update 01/06/ :23 LCG 1Maria Dimou- cern-it-gd Maria Dimou IT/GD Site Registration policy & procedures
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Grid Infrastructure and Operations Maite.
Experience with Site Functional Tests Piotr Nyczyk CERN IT/GD WLCG Service Workshop Mumbai, February 2006.
08/11/908 WP2 e-NMR Grid deployment and operations Technical Review in Brussels, 8 th of December 2008 Marco Verlato.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
SEE-GRID-SCI SEE-GRID-SCI Operations Procedures and Tools Antun Balaz Institute of Physics Belgrade, Serbia The SEE-GRID-SCI.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
INFSO-RI Enabling Grids for E-sciencE EGEE 1 st EU Review – 9 th to 11 th February 2005 CERN.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE is a project funded by the European Union under contract IST User support in EGEE Alistair Mills Torsten Antoni EGEE-3 Conference 20 April.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Enabling Grids for E-sciencE INFSO-RI Tools for CIC Operations, Bologna, 24th May Monitoring workflow in EGEE GOC DB is used to get the list.
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Grid Deployment Enabling Grids for E-sciencE BDII 2171 LDAP 2172 LDAP 2173 LDAP 2170 Port Fwd Update DB & Modify DB 2170 Port.
SAM Tests SAM Devel. & Support Team CERN IT/GD WLCG/EGEE/OSG Operations Workshop 25 Jan. 2007, CERN.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Communication tools between Grid Virtual.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CIC portal Requirements from users WLCG service.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Mardi 8 mars 2016 Status of new features in CIC Portal Latest Release of 22/08/07 Osman Aidel, Hélène Cordier, Cyril L’Orphelin, Gilles Mathieu IN2P3/CNRS.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operational Procedures (Contacts, procedures,
II EGEE conference Den Haag November, ROC-CIC status in Italy
SEE-GRID-SCI Grid Operations Procedures Antun Balaz Institute of Physics Belgrade Serbia The SEE-GRID-SCI initiative.
1/3/2006 Grid operations: structure and organization Cristina Vistoli INFN CNAF – Bologna - Italy.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROC model assessment AP ROC ShuTing Liao.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
Site Manageability Issues for LCG Ian Bird IT Department, CERN HEPiX JLab, 12 th October 2006.
1 Grid Operations Jinny Chien ASGC June 09, Academia Sinica Slides adapted from the EGEE training material repository:
Scuola Grid - Martina Franca, Thursday 08 November Il Sistema di Supporto INFNGrid & GGUS ( Global Grid User.
INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.
Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it.
Service Availability Monitoring
NGI and Site Nagios Monitoring
Grid Operations Procedures
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
EGEE Operation Tools and Procedures
Site availability Dec. 19 th 2006
Presentation transcript:

Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2 Outline Monitoring and Operational tools –SAM framework sensors availibility metrics –FCR –gstat, GOCDB, SAM Admin Portal, COD Dashboard Grid Operations (COD)

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3 Monitoring tools Service Availibility Monitoring (SAM)

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4 SAM -- Overview Grid service-level monitoring framework successor of SFT used in Grid Operations basis for Availibility Metrics VO-based submissions –VO-specific tests services tested currently: CE, gCE SE RB sBDII BDII FTS LFC JobWrapper tests

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 5 Central SAM submissions Official CERN submissions –Production and Certified sites –ops (+ dteam) VO –job submitted in every hour –basis of COD alarms – PPS –ops VO –hourly – SAM Admin Portal –ops VO –on-demand –Certified + Uncertified sites

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 6 VO specific tests submission LHCb –successfully migrated to SAM (only CE, gCE) –VO specific test (Dirac installation) Atlas –all sensors –submitted from SAM UI CMS –set up, but no regular submission yet

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 7 SAM Portal

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 8 SAM Internals framework structure –client submission framework –(developed by CERN team) sensors –developed by different contributors + CERN team –tests: plug-in modules –server web services portal Oracle DB accessed by web services static (GOCDB) + dynamic (BDIIs) info

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 9 Sensors – CE, gCE, SE, SRM CE, gCE –job submission UI → RB → CE → WN chain –CA certificates (on WN) –software middleware version (WN) –replica management lcg-utils default SE + 3 rd -party replication –RGMA, Apel, etc. SE, SRM –UI ↔ SE/SRM lcg-utils (LFC)

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1010 Sensors – LFC, FTS LFC –lfc-ls + create file in /grid/ FTS –BDII entry check –listing channels glite-transfer-channel-list (ChannelManagement service) –transfer test (in development): submitting transfer jobs between SRMs in all Tier0 and Tier1 sites (N-N testing) checking the status of jobs Note! The test is relying on availability of SRMs in sites

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai1 Standalone sensors – BDII, RB sBDII (Gstat) –accessibility –sanity checks top-level BDIIs (Gstat) –accessibility –reliability of data (number of entries) RB –jobs submission UI → important RBs → “reliable” CEs –time of matchmaking

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1212 JobWrapper tests JobWrapper –requested by experiments, also useful in operations –testing all WNs SAM always tests just an arbitrary one –tests executed by CE wrapper script executed with every production job –test results passed to the job published to the SAM DB –test code core scripts in the release tests on software area (signed tarball) –soon in production

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1313 Availability metrics - algorithm t ∈ CriticalTests TestResult (N,t)Status of node N =Status of site S = CE1CE2CEnSRM 1SRM 2SRM nsite BDII AND OR Everything is calculated for each VO that defined critical tests in FCR Results make sense only if VO submits tests!!! N ∈ instances(C) Status (N) Status of service C = ∧ ∨ ∧ = boolean AND ∨ = boolean OR

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1414 Availability metrics - algorithm II service and site status in every hour daily, weekly, monthly availability scheduled downtime information from GOCDB details of the algorithm on GOC:

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1515 Availability metrics - GridView

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1616 Availability metrics - data export

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1717 VO tools Freedom of Choice for Resources (FCR)

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1818 FCR -- Overview Freedom of Choice for Resources VO policy enforcement tool critical test and resource selection for VOs by manipulating top-level BDII information goal is to be able to –select which aspects of site funcionality are important for the VO –blacklist unreliable sites –always use stable, "important" sites –less reliable sites based on SAM results

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1919 FCR -- Overview integrated with SAM –sharing the same DB optional usage –BDII configuration parameter –FCR output: ldif file information from GOCBD + BDII DN-based authentication (2-levels)

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2020 FCR Admin Portal

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2121 FCR User Pages read-only view of VO settings tells if the resource is available at the moment grouping selection

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai2 FCR User Portal

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2323 Monitoring tools gstat, SAM Admin Portal, COD dashboard

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2424 gstat (Sinica) – –Information System (BDII) monitoring –response time, consistency (sanity), completeness –site-BDII + top-level BDII –aggregated and detailed views –plots (history) –refreshed in every 5 mins (non- intrusive)

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2525 gstat

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2626 SAM Admin Portal – –on-demand SAM submission –easy to use –target site selection –used by: ROCs: certification of a site ROCs, site admins, CODs: speed up debugging

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2727 SAM Admin Portal

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2828 GOCDB – –central database to store static site information –all EGEE sites have to register –contact, security contact, certification status, site type –scheduled maintainence –used by script that generates top-level BDII config file monitoring tools SAM DB → SAM, FCR, Availability calc. operations management tools

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2929 GOCDB

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3030 CIC Operations Portal COD management –schedule for rotations –COD dashboard –COD handover notes ROC management –ROC contacts –weekly reports VO management –VO ID cards (VO contacts, etc.) EGEE broadcast

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3131 CIC Operations Portal

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3232 GGUS (FZK) Global GRID User Support ticketing system for the EGEE GRID based on Remedy tickets created by –individual users (manually) –Grid Operators (via COD Dashboard) news, documentation

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai3 GGUS Portal

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3434 Operations Grid Operations

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3535 EGEE Operations Structure Regional Operations Centres (ROC) –One in each region (incl. Asia-Pacific) –Front-line support for user and operations issues point of contact for sites in the region –Provide local knowledge and adaptations –Manage daily Grid operations – oversight, troubleshooting –Run infrastructure services for Asia-Pacific region –Asia-Pacific Jason Shih, Min-Hong Tsai, Shu-Ting Liao –CERN (catch-all ROC) Nicholas Thackray

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3636 COD COD is Operator on Duty –was: CIC-on-Duty global LCG/EGEE GRID monitoring 1 (2) ROCs responsible for the whole GRID operations at a time –12 ROCs involved –weekly rotation weekly WLCG-OSG-EGEE Operations meeting –ROCS, Tier1, VOs –all sites invited

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3737 COD Procedures Looking at monitoring tools –SAM, Certificate Monitoring pages Open tickets using COD Dasboard Escalate expired tickets Process site responses (update tickets accordingly) End of duty: hand-over notes Update the GOC wiki pages

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3838 COD Dashboard summary of necessary monitoring information + tools for ticket processing tickets linked to GGUS tickets GOCDB information –site downtime information! SAM alarms ticket creation and management tool tools for related

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3939 COD Dashboard

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4040 Connection between the used tools COD dashboard Monitoring tools GGUS Grid Operators (COD) Problem tracking and reporting Ticket follow-up Modifications on the tickets SAM

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4141 defines the steps to be taken during the lifetime of a ticket –tickets don't get forgotten! avaliable on CIC Portal –( prioritization alarms depending on the amount of resources at the site Escalation Procedure

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4242 Escalation Steps 1.ticket creation 2.first mail (to: site + ROC) 3.second mail (to: site + ROC) 4.suspension from the GRID before 4.: a) mail to ROC b)mail to OCC for validation c)site is invited to the weekly operations meeting

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4343 Escalation Procedure -- Quarantine site categories –low: CPU <20 –normal: 20 < CPU < 100 –high: 100 < CPU between and –low + normal: 3 days –high: 1 days

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai4 COD Escalation Procedure Create ticket Close ticket When deadline reached Problem solved ? last escalation ? Extend deadline Suspend site Escalate mail yes no site responds mail

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4545 What a site is expected to do Look at the monitoring tools (SAM) –try to notice & fix failures before the CODs COD notification about a failure –fix it ASAP –contact the ROC for help if needed Scheduled downtime –enter it in GOCDB –broadcast it in advance –broadcast when it's finished weekly site reports (at COD portal) –input to weekly Operations meeting

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4646 What a site could do problems → contact the ROC –best way: GGUS ticket question → ask the ROC open a ticket if there is a failure in Central Services –LFC, SAM, etc.

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4747 Happy End Thanks for your attention :)