Metrics and Monitoring on FermiGrid Keith Chadwick Fermilab

Slides:



Advertisements
Similar presentations
PRAGMA Application (GridFMO) on OSG/FermiGrid Neha Sharma (on behalf of FermiGrid group) Fermilab Work supported by the U.S. Department of Energy under.
Advertisements

EGEE-II INFSO-RI Enabling Grids for E-sciencE The gLite middleware distribution OSG Consortium Meeting Seattle,
Dec 14, 20061/10 VO Services Project – Status Report Gabriele Garzoglio VO Services Project WBS Dec 14, 2006 OSG Executive Board Meeting Gabriele Garzoglio.
Role Based VO Authorization Services Ian Fisk Gabriele Carcassi July 20, 2005.
The FermiGrid Software Acceptance Process aka “So you want me to run your software in a production environment?” Keith Chadwick Fermilab
GUMS status Gabriele Carcassi PPDG Common Project 12/9/2004.
Implementing Finer Grained Authorization in the Open Science Grid Gabriele Carcassi, Ian Fisk, Gabriele, Garzoglio, Markus Lorch, Timur Perelmutov, Abhishek.
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.
The Fermilab Campus Grid (FermiGrid) Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
VOX Project Status T. Levshina. Talk Overview VOX Status –Registration –Globus callouts/Plug-ins –LRAS –SAZ Collaboration with VOMS EDG team Preparation.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.
Apr 30, 20081/11 VO Services Project – Stakeholders’ Meeting Gabriele Garzoglio VO Services Project Stakeholders’ Meeting Apr 30, 2008 Gabriele Garzoglio.
Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Mine Altunay OSG Security Officer Open Science Grid: Security Gateway Security Summit January 28-30, 2008 San Diego Supercomputer Center.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Computer Emergency Notification System (CENS)
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Overview of Monitoring and Information Systems in OSG MWGS08 - September 18, Chicago Marco Mambelli - University of Chicago
Using NMI Components in MGRID: A Campus Grid Infrastructure Andy Adamson Center for Information Technology Integration University of Michigan, USA.
02/07/09 1 WLCG NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) University of Oxford.
Metrics and Monitoring on FermiGrid Keith Chadwick Fermilab
Mar 28, 20071/9 VO Services Project Gabriele Garzoglio The VO Services Project Don Petravick for Gabriele Garzoglio Computing Division, Fermilab ISGC 2007.
Mar 28, 20071/18 The OSG Resource Selection Service (ReSS) Gabriele Garzoglio OSG Resource Selection Service (ReSS) Don Petravick for Gabriele Garzoglio.
GUMS Gabriele Carcassi PPDG Collaboration meeting June 27, 2004.
Condor-G A Quick Introduction Alan De Smet Condor Project University of Wisconsin - Madison.
Job and Data Accounting on the Open Science Grid Ruth Pordes, Fermilab with thanks to Brian Bockelman, Philippe Canal, Chris Green, Rob Quick.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Overview of Privilege Project at Fermilab (compilation of multiple talks and documents written by various authors) Tanya Levshina.
Role Based VO Authorization Services Ian Fisk Gabriele Carcassi July 20, 2005.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.
4/25/2006Condor Week 1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.
The OSG and Grid Operations Center Rob Quick Open Science Grid Operations Center - Indiana University ATLAS Tier 2-Tier 3 Meeting Bloomington, Indiana.
FermiGrid School Steven Timm FermiGrid School FermiGrid 201 Scripting and running Grid Jobs.
Jun 12, 20071/17 AuthZ Interoperability – Status and Plan Gabriele Garzoglio AuthZ Interoperability Status and Plans June 12, 2007 Middleware Security.
AstroGrid-D Meeting MPE Garching, M. Braun VO Management.
Virtual Organization Membership Service eXtension (VOX) Ian Fisk On behalf of the VOX Project Fermilab.
Auditing Project Architecture VERY HIGH LEVEL Tanya Levshina.
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
An Introduction to Campus Grids 19-Apr-2010 Keith Chadwick & Steve Timm.
VOX Project Tanya Levshina. 05/17/2004 VOX Project2 Presentation overview Introduction VOX Project VOMRS Concepts Roles Registration flow EDG VOMS Open.
The GRIDS Center, part of the NSF Middleware Initiative Grid Security Overview presented by Von Welch National Center for Supercomputing.
VOX Project Status T. Levshina. 5/7/2003LCG SEC meetings2 Goals, team and collaborators Purpose: To facilitate the remote participation of US based physicists.
Site Authorization Service Local Resource Authorization Service (VOX Project) Vijay Sekhri Tanya Levshina Fermilab.
Fermilab / FermiGrid / FermiCloud Security Update Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359 Keith Chadwick Grid.
An introduction to (Fermi)Grid September 14, 2007 Keith Chadwick.
FermiGrid Keith Chadwick Fermilab Computing Division Communications and Computing Fabric Department Fabric Technology Projects Group.
OSG Status and Rob Gardner University of Chicago US ATLAS Tier2 Meeting Harvard University, August 17-18, 2006.
Development of the Fermilab Open Science Enclave Policy and Baseline Keith Chadwick Fermilab Work supported by the U.S. Department of.
April 18, 2006FermiGrid Project1 FermiGrid Project Status April 18, 2006 Keith Chadwick.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
Defining the Technical Roadmap for the NWICG – OSG Ruth Pordes Fermilab.
VOX Project Status Report Tanya Levshina. 03/10/2004 VOX Project Status Report2 Presentation overview Introduction Stakeholders, team and collaborators.
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
Grid Colombia Workshop with OSG Week 2 Startup Rob Gardner University of Chicago October 26, 2009.
FermiGrid The Fermilab Campus Grid 28-Oct-2010 Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
FermiGrid Highly Available Grid Services Eileen Berman, Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Ferbruary 2006FermiGrid – CHEP FermiGrid Status and Plans Keith Chadwick Fermilab Computing Division Communications and Computing.
FermiGrid - PRIMA, VOMS, GUMS & SAZ Keith Chadwick Fermilab
FermiGrid - PRIMA, VOMS, GUMS & SAZ
A Model for Grid User Management
f f FermiGrid – Site AuthoriZation (SAZ) Service
Presentation transcript:

Metrics and Monitoring on FermiGrid Keith Chadwick Fermilab

25 June 2007FermiGrid Metrics and Monitoring1 Outline FermiGrid Introduction and Background Metrics Service Monitoring Availability (Acceptance) Monitoring Dashboard Lessons Learned Future Plans

25 June 2007FermiGrid Metrics and Monitoring2 Personnel Eileen Berman, Fermilab, Batavia, IL Philippe Canal, Fermilab, Batavia, IL Keith Chadwick, Fermilab, Batavia, IL David Dykstra, Fermilab, Batavia, IL Ted Hesselroth, Fermilab, Batavia, IL, Gabriele Garzoglio, Fermilab, Batavia, IL Chris Green, Fermilab, Batavia, IL Tanya Levshina, Fermilab, Batavia, IL Don Petravick, Fermilab, Batavia, IL Ruth Pordes, Fermilab, Batavia, IL Valery Sergeev, Fermilab, Batavia, IL Igor Sfiligoi, Fermilab, Batavia, IL Neha Sharma Batavia, IL Steven Timm, Fermilab, Batavia, IL D.R. Yocum, Fermilab, Batavia, IL

25 June 2007FermiGrid Metrics and Monitoring3 What is FermiGrid? FermiGrid is:  The Fermilab campus Grid and Grid portal. –The site globus gateway. –Accepts jobs from external (to Fermilab) sources and forwards the jobs onto internal clusters.  A set of common services to support the campus Grid and interface to Open Science Grid (OSG) / LHC Computing Grid (LCG): –VOMS, VOMRS, GUMS, SAZ, MyProxy, Squid, Gratia Accounting, etc.  A forum for promoting stakeholder interoperability and resource sharing within Fermilab: –CMS, CDF, D0; –ktev, miniboone, minos, mipp, etc.  The Open Science Grid portal to Fermilab Compute and Storage Services. FermiGrid Web Site & Additional Documentation: 

25 June 2007FermiGrid Metrics and Monitoring4 FermiGrid - Current Architecture CMS WC1 CDF OSG1 CDF OSG2 D0 CAB1 GP Farm VOMS Server SAZ Server GUMS Server Step 1 - user issues voms-proxy-init user receives voms signed credentials Step 2 – user submits their grid job via globus-job-run, globus-job-submit, or condor-g Step 4 – Gateway requests GUMS Mapping based on VO & Role Step 3 – Gateway checks against Site Authorization Service clusters send ClassAds via CEMon to the site wide gateway Step 5 - Grid job is forwarded to target cluster BlueArc Periodic Synchronization D0 CAB2 Site Wide Gateway Exterior Interior

25 June 2007FermiGrid Metrics and Monitoring5 Software Stack Baseline:  SL 3.0.x, SL 4.x, SL 5.0 (just released)  OSG (VDT 1.6.1, GT 4, WS-Gram, Pre-WS Gram) Additional Components:  VOMS (VO Management Service)  VOMRS (VO Membership Registration Service)  GUMS (Grid User Mapping Service)  SAZ (Site AuthoriZation Service)  jobmanager-cemon (job forwarding job manager)  MyProxy (credential storage)  Squid (web proxy cache)  syslog-ng (auditing)  Gratia (accounting)  Xen (virtualization)  Linux-HA (high availability)

25 June 2007FermiGrid Metrics and Monitoring6 Timeline FermiGrid services were initially deployed in April 1, The first formal metrics collection was commissioned in late August  Initially a manual process.  Automated during the fall of Service monitoring was commissioned in June VO Acceptance monitoring was commissioned in August Availability monitoring was commissioned earlier this month.

25 June 2007FermiGrid Metrics and Monitoring7 Metrics vs. Monitoring Metrics collection:  Takes place once per day. Service Monitoring:  Takes place multiple times per day (typically once an hour).  May have abilities to detect failed (or about to failed) services, notify administrators and (optionally) restart the service.  Generates capacity planning information. Acceptance Monitoring:  Does a grid site accept “my” VO and pass a minimal set of tests.  May not guarantee that a real application can run - just that it can get in the door. Availability Monitoring:  Very lightweight.  Can be run very frequently (multiple times per hour).  Optional automatic notification if results are “unexpected”.  Feeds automatic “Dashboard” display.

25 June 2007FermiGrid Metrics and Monitoring8 Metrics Collection - Mechanics Metrics collection is implemented on FermiGrid as follows:  A central metrics collection system launches a central metrics collection process once per day. –collect_grid_metrics.sh  The central metrics collection process in turn launches copies of itself (secondary metrics collection processes) via ssh across all systems (and the services) that are designated for metrics collection. –collect_grid_metrics.sh  The secondary metrics collection processes identify the system, service and metrics to be collected, and then launch a script which has been custom written to collect the desired metrics from the specified service. –collect-globus-metrics.sh –collect-voms-metrics.sh

25 June 2007FermiGrid Metrics and Monitoring9 Metrics collected within FermiGrid Globus Gatekeeper:  # of authenticated, authorized, jobmanager, jobmanager-fork, jobmanager-managedfork  batch (jobmanager-condor, jobmanager-pbs, etc.), jobmanager-condorg, jobmanager-cemon,  jobmanager-mis, default.  # of total IP connections, # of unique IP connections, # of unique IP connections from within Fermilab. VOMS:  # of voms-proxy-init’s by VO.  # of voms-proxy-init’s by group within the fermilab VO.  # of total IP connections, # of unique IP connections, # of unique IP connections from within Fermilab. GUMS:  # of successful GUMS mapping calls & # of failed GUMS mapping calls.  # of total certificates, # of unique dn, # of unique mappings, # of unique Vos  # of voms-proxy-inits, # of grid-proxy-inits.  # of total IP connections, # of unique IP connections, # of unique IP connections from within Fermilab. SAZ:  # of successful SAZ calls & # of rejected SAZ calls.  # of unique DN, # of unique VO, # of unique Role, # of unique CA.  # of total IP connections, # of unique IP connections, # of unique IP connections from within Fermilab.

25 June 2007FermiGrid Metrics and Monitoring10 Metrics Storage and Publication Metrics are stored using two mechanisms:  First, they are appended to “.csv” files which contain a leading date followed by tag-value pairs. Example: –22-Jun-2007,total=5721,success=5698,fails=53 –total_ip=5721,unique_ip=231,fermilab_ip=12  Second, the “.csv” files are processed and loaded in to round robin databases using rrdtool. A set of “standard” png plots are automatically generated from the rrdtool databases. All of these formats (.csv,.rrd and.png) are periodically uploaded from the metrics collection host to the central FermiGrid web server.

25 June 2007FermiGrid Metrics and Monitoring11 Globus Gatekeeper Metrics 1

25 June 2007FermiGrid Metrics and Monitoring12 Globus Gatekeeper Metrics 2

25 June 2007FermiGrid Metrics and Monitoring13 VOMS Metrics 1

25 June 2007FermiGrid Metrics and Monitoring14 VOMS Metrics 2

25 June 2007FermiGrid Metrics and Monitoring15 VOMS Metrics 3

25 June 2007FermiGrid Metrics and Monitoring16 GUMS Metrics 1

25 June 2007FermiGrid Metrics and Monitoring17 GUMS Metrics 2

25 June 2007FermiGrid Metrics and Monitoring18 GUMS Metrics 3

25 June 2007FermiGrid Metrics and Monitoring19 SAZ Metrics 1

25 June 2007FermiGrid Metrics and Monitoring20 SAZ Metrics 2

25 June 2007FermiGrid Metrics and Monitoring21 SAZ Metrics 3

25 June 2007FermiGrid Metrics and Monitoring22 Service Monitoring - Mechanics  A central service monitor system launches the central service monitor collection script once per hour. –monitor_grid_script.sh  The central service monitor process in turn launches background copies of itself (secondary service monitor processes) across all systems (and the services) that are designated for service monitoring. –monitor_grid_script.sh  The secondary service monitor processes identify the system, service to be monitored, and then launch a script which has been custom written to monitor the specified service. –monitor_ _script.sh –monitor_gatekeeper_script.sh –monitor_voms_script.sh –monitor_gums_script.sh –monitor_saz_script.sh

25 June 2007FermiGrid Metrics and Monitoring23 Service Monitor Configuration Configuration of the service monitor system is via a central configuration file: fermigrid0fermigrid0.fnal.govmaster publishvar/www/html # fermigrid0 fermigrid0.fnal.govvofermilab fermigrid1 fermigrid1.fnal.gov gatekeeper fermigrid2 fermigrid2.fnal.gov voms voms.fnal.gov fermigrid3 fermigrid3.fnal.gov gums gums.fnal.gov fermigrid3 fermigrid3.fnal.gov mapping cms fermigrid3 fermigrid3.fnal.gov mapping dteam fermigrid4 fermigrid4.fnal.gov saz saz.fnal.gov fermigrid4 fermigrid4.fnal.gov myproxy myproxy.fnal.gov fermigrid4 fermigrid4.fnal.gov squid squid.fnal.gov # fcdfosg1 fcdfosg1.fnal.gov gatekeeper fcdfosg2 fcdfosg2.fnal.gov gatekeeper d0cabosg1 d0cabosg1.fnal.govgatekeeperssh:/grid/login/chadwick d0cabosg2 d0cabosg2.fnal.govgatekeeperssh:/grid/login/chadwick ###cmsosgce cmsosgce.fnal.govgatekeepergrid:/uscms/osg/app/fermilab/chadwick ###cmsosgce2 cmsosgce2.fnal.govgatekeepergrid:/uscms/osg/app/fermilab/chadwick

25 June 2007FermiGrid Metrics and Monitoring24 Service Monitor - Information Collected Globus Gatekeeper:  # of authenticated, authorized, jobmanager, jobmanager-fork, jobmanager-managedfork, batch (condor, pbs, lsf, etc.), condorg/cemon, mis, default.  The value of uptime, load1, load5 and load15. VOMS:  # of voms-proxy-init’s  # of apache and tomcat processes  The rss and vmz of the Tomcat VOMS server process.  The value of uptime, load1, load5 and load15. GUMS:  # of successful GUMS mapping calls & # of failed GUMS mapping calls.  # of apache and tomcat processes  The rss and vmz of the Tomcat GUMS server process.  The value of uptime, load1, load5 and load15. SAZ:  # of successful SAZ calls & # of rejected SAZ calls.  # of apache and tomcat processes  The rss and vmz of the Tomcat SAZ server process.  The value of uptime, load1, load5 and load15.

25 June 2007FermiGrid Metrics and Monitoring25 Service Monitor Storage and Publication Results of the service monitors are stored using two mechanisms:  First, they are appended to “.csv” files which contain a leading time (in seconds from the Unix epoch) followed by tag-value pairs. Example: –time= ,authenticated=42,authorized=26,jobmanager=26 Second, the “.csv” files are processed and loaded in to round robin databases using rrdtool. A set of “standard” png plots are automatically generated from the rrdtool databases. All of these formats (.csv,.rrd and.png) are periodically uploaded from the metrics collection host to the central FermiGrid web server.

25 June 2007FermiGrid Metrics and Monitoring26 Globus Gatekeeper Monitor 1

25 June 2007FermiGrid Metrics and Monitoring27 Globus Gatekeeper Monitor 2

25 June 2007FermiGrid Metrics and Monitoring28 VOMS Monitor 1

25 June 2007FermiGrid Metrics and Monitoring29 VOMS Monitor 2

25 June 2007FermiGrid Metrics and Monitoring30 GUMS Monitor 1

25 June 2007FermiGrid Metrics and Monitoring31 GUMS Mapping Monitor

25 June 2007FermiGrid Metrics and Monitoring32 SAZ Monitor 1

25 June 2007FermiGrid Metrics and Monitoring33 VO Acceptance Monitoring Monitor the acceptance of a VO across a Grid in order to:  Identify where the members of the VO can consider running jobs. –Not a guarantee that the job can actually run.  Identify misconfigured sites that advertise that they “support” the VO but to not actually accept jobs from VO members.  Log formal trouble tickets through the OSG GOC. –Ideally have the sites respond and fix their configuration. –Unfortunately some sites have not been very responsive. –And still other sites have responded by removing support for the VO.

25 June 2007FermiGrid Metrics and Monitoring34 VO Acceptance Monitoring Mechanics How it is done:  A cron script periodically launches kcroninit.  kcroninit launches a script which does authentication: –kx509 –kxlist -p  Robot certificate “issued” by the Fermilab KCA: –/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cron/CN=Keith Chadwick/UID=chadwick  Get VO signed credentials: –voms-proxy-init -noregen -voms fermilab:/fermilab  Pulls the list of OSG sites from the OSG gridscan reports –  For each site in the report, the acceptance monitor tests: –Unix ping. –globusrun -a -r (authenticate). –globus-job-run (existing application - typ /usr/bin/id). –globus-url-copy (to and from).  Periodically I review the list of failing sites and if appropriate, log trouble tickets.

25 June 2007FermiGrid Metrics and Monitoring35 VO Acceptance Monitor 1

25 June 2007FermiGrid Metrics and Monitoring36 Availability (Infrastructure) Monitoring Designed to be very “lightweight”. Currently running with the service monitor, but designed and implemented so that it can run much more frequently. Monitors both the host system and the service which is running on the system. Driven by the same configuration file as the service monitor.

25 June 2007FermiGrid Metrics and Monitoring37 Base Infrastructure Monitor

25 June 2007FermiGrid Metrics and Monitoring38 Dashboard Based on a secondary analysis of the infrastructure monitor data. Design goal is to be a simple “health” dashboard:

25 June 2007FermiGrid Metrics and Monitoring39 Dashboard - Typical Display

25 June 2007FermiGrid Metrics and Monitoring40 Lessons Learned 1 Metrics and Service Monitoring is difficult:  Every service has it’s own log file format (at least today). –find, grep, awk are your friends. –The format of the messages within the service log file will change as new versions of the services are deployed.  Some services don’t log all necessary and/or interesting information “out of the box”, they need additional logging options enabled. –You may have to work with the service developers to insure that they log the necessary service information.  Some services are extremely “talkative” and place lots of information (that I am certain is useful to the developers) in the log file along with the “golden nuggets” that is needed by the metrics collection and service monitoring. –You may have to work with the service developers to insure that they log the necessary service information.  You may have to extract and correlate information from multiple logs.  You must also monitor services that the monitored service depends on (especially apache and tomcat).

25 June 2007FermiGrid Metrics and Monitoring41 Lessions Learned 2 Out of band access and monitoring is quite useful and necessary.  ssh, ksu as well as grid. Using grid services to monitor other grid services may not correctly identify the problem:  Did some local (non-grid) service fail? –kx509, kxlist -p  Did the local grid service fail? –voms-proxy-init  Did some intermediate service fail or timeout? –Network congestion  Did the remote grid service fail or timeout? –Globus gatekeeper

25 June 2007FermiGrid Metrics and Monitoring42 Lessons Learned 3 Service monitoring with automatic service recovery can be very useful.  Especially when responding to automated security probing,  And also for getting a full nights rest… Automatic service recovery will usually require some level of root access.  Sites are understandably reluctant to grant “remote” root access (I know that I am…). Robot certificates are extremely useful for automating grid service monitoring.

25 June 2007FermiGrid Metrics and Monitoring43 Plans for the Future Continue with the development of additional metrics and monitor probes. Continue with the development of automated reports & publication. Integrate/incorporate the new OSG SAM probes to fermilab VO monitoring. As part of the FermiGrid-HA deployment, enhance the metrics and monitoring infrastructure:  Collect from all [voms,gums,saz] service instances.  Collate a HA view of the services. Work towards making this infrastructure more portable.

25 June 2007FermiGrid Metrics and Monitoring44 Fin Any questions?