EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation in EGEE-III What does the OAT mean to you ? James Casey, CERN EGEE’08 Istanbul, Turkey
Enabling Grids for E-sciencE EGEE-III INFSO-RI What is the Operations Automation Team (OAT) Defined in EGEE MSA1.1 –‘Operations Automation Strategy’ –Initial focus on multi-level monitoring –Delivered mid-June – comments still welcome Abstract: In EGEE-III, within the SA1 activity, a group called the ‘Operations Automation Team’ was formed with the task of coordinating operational tools and their development, with the specific goal of advising on the strategic directions to take in terms of automating the operations effort. This will entail replacing manual processes with automated ones in order that the overall staffing level of operations can be significantly reduced in a long-term, sustainable infrastructure. This document outlines a strategy for achieving this automation using an integration architecture based on messaging. It describes how current tools and processes, such as operational alarming and ticketing will evolve during the lifetime of EGEE-III and lays out a roadmap for this evolution. 2
Enabling Grids for E-sciencE EGEE-III INFSO-RI Questions What’s this got to do with EGI ? I’ve heard Nagios replaces SAM, what does this mean? –Its uses messaging, what’s that mean? –I’m in a VO, does this affect me ? Will you help me manage my site better? When will this all happen? How can I help ? 3
Enabling Grids for E-sciencE EGEE-III INFSO-RI OAT and EGI OAT is an EGEE-III body –Using EGEE effort to automate (improve ???) operations during the project –Oversee all operational tool development within EGEE SA1 Following EGI visions on upcoming strategy –EGI “subsidiarity principle” –This is a big driver for us – moving processes and tools to regional models Where possible !!! Provide input to EGI on operational tool development and deployment that is on the roadmap beyond the end of EGEE-III 4
Enabling Grids for E-sciencE EGEE-III INFSO-RI Operational Tools in EGEE-III 5
Enabling Grids for E-sciencE EGEE-III INFSO-RI Current Operational Model Several teams involved –Operations Management (OCC) –Monitoring system operators (SAM) –Grid operators (COD) –Regional Operations Centres (ROC) –First line support teams (ROC) –Resource Centres/sites (RC) –User support team (GGUS) 6
Enabling Grids for E-sciencE EGEE-III INFSO-RI Improving reliability and availability 7
Enabling Grids for E-sciencE EGEE-III INFSO-RI Current operational model (s) 8
Enabling Grids for E-sciencE EGEE-III INFSO-RI Future operational model 9
Enabling Grids for E-sciencE EGEE-III INFSO-RI Multi-level monitoring Based on existing work in CE ROC –Replaces central SAM execution framework with Nagios at ROC and site –Interacts with existing SAM components Visualization, availability calculation, historical result store –Tied together via a reliable messaging infrastructure –Regional operations dashboard and alarms DB –Link into regional ticketing, e.g. via GGUS Follow new operational model –Raise alarms immediately at the site –1 st level support sees them and can respond if needed –Central COD only involved after 2-3 weeks e.g. site banning Tutorial yesterday with much more details –Full install done of all components at a site in 1.5 hours... 10
Enabling Grids for E-sciencE EGEE-III INFSO-RI Monitoring is multi-level Source# checks / service Type Central1-2Network monitoring, Service ‘Ping’ Regional5-10User-oriented actions (e.g existing SAM tests) Site local10-30Detailed functional tests 11
Enabling Grids for E-sciencE EGEE-III INFSO-RI Messaging Systems Flexible architecture: –Deliver messages, either in point to point (queue)… –… or multicast mode (topics) –Support Synchronous or Asynchronous communication. Reliable delivery of messages: –Provide reliability to the senders if required –Configurable persistency / Master-Slave. Highly Scalable: –Network of Brokers 12
Enabling Grids for E-sciencE EGEE-III INFSO-RI Mature open-source implementation of these ideas –Top-level Apache project –Commercial support available from IONA –Widely-used commodity software Easy to integrate into your code –Multiple language + transport protocol support Good performance characteristics –See later … Work done to integrate into our environment –RPMs, YAIM configuration, monitoring and alarms use Net::Stomp ; my $stomp = Net::Stomp->new ({hostname => 'gridmsg102.cern.ch', port => '6163' }) ; $stomp->connect () ; $stomp->subscribe ( { 'destination' => '/topic/grid.probe.metricOutput', 'ack' => 'client', 'activemq.prefetchSize' => 1}); while(1) { my $frame = $stomp->receive_frame; warn $frame->body ; print $frame->as_string ; $stomp->ack({frame => $frame} ) ; } $stomp->disconnect ; ActiveMQ 13
Enabling Grids for E-sciencE EGEE-III INFSO-RI ActiveMQ 14
Enabling Grids for E-sciencE EGEE-III INFSO-RI Results : Throughput > Consumers > Throughput 15
Enabling Grids for E-sciencE EGEE-III INFSO-RI Vendor tests 16 From “Optimizing FUSE Message Broker” -
Enabling Grids for E-sciencE EGEE-III INFSO-RI Usages of Messaging We use it as an ‘integration bus’ –Use when systems want to share information E.g VO transfer systems publishing data rates to WLCG It’s another string to our bow When the application model fits well, then use it E.g. Async communications, broadcast messages Don’t force applications to use it –Have other solutions too E.g “RESTful” web services a.la SAM Programmatic Interface 17
Enabling Grids for E-sciencE EGEE-III INFSO-RI ‘Standard’ Integration Patterns The same patterns are repeated in many of the following examples: –Gather results at many points –Collect the raw results and store in a database –Perform some operation on the raw results Summarisation, availability calculation, … –Publish the summarised results to many clients E.g. site monitoring, dashboards, … –Store historical data in a database and visualize via web client We provide ‘standard’ components to make this plug and play for many workflows 18
Enabling Grids for E-sciencE EGEE-III INFSO-RI VO, ROC, Project & Local monitoring 19
Enabling Grids for E-sciencE EGEE-III INFSO-RI Another application - Usage Reporting Other main part of monitoring – Usage statistics –Gridftp transfers, FTS transfers, job records, … Used to calculate throughput and reliability Currently handled in GridView, Dashboards –Use messaging system to unite these efforts Delegate parsing/routing of specific information back to experts –L&B, FTS, … Other integration examples include –Accounting –GOCDB synchronization 20
Enabling Grids for E-sciencE EGEE-III INFSO-RI Site Management gLite often doesn’t provide enough management tools –Direct feedback from site and service managers Site managers often write tools themselves Strategy defined to get these tools to a wide audience –Lightning talks 5 minute presentations on tools people have developed Stay for the rest of the session ! –Publicity of tool development e.g. Via iSGTW Doubling of visitors to gridmap.cern.ch after publishing an article 21
Enabling Grids for E-sciencE EGEE-III INFSO-RI Deployment support EGEE-SA1 tools project started Policy being defined now –With some ‘early adoptor’ projects Some of the tools you’ll see in the lightning talks session Facilities –Support in using ETICS –Support in writing YAIM –Yum Repository –Documentation repository Contact us if you’re interested in contributing here 22
Enabling Grids for E-sciencE EGEE-III INFSO-RI Global Roadmap 23 Covers multi-level monitoring Roadmaps for other areas (e.g. accounting) in the process of being defined by individual teams –And co-ordinated by the OAT
Enabling Grids for E-sciencE EGEE-III INFSO-RI Roadmap for tools Milestone ‘Messaging 1’: August 2008 –Production level messaging broker in production. This should have internal failover capabilities, but will not have the WAN failover capabilities of a network of broker Milestone ‘Messaging 2’: December 2008 –A scalable and reliable network of brokers, consisting of a deployment over at least 3 sites is in place Milestone ‘Site Monitoring 1’: September 2008 –A release of the site components for the multi-level monitoring, including packaging and configuration as part of a EGEE middleware release exists and is ready for deployment to the sites. Milestone ‘ROC Monitoring 1’: December 2008 –The ROC components for the multi-site monitoring are ready for deployment to sites. Milestone ‘ROC Monitoring 2’: February 2009 –The alarm component has been integrated with the regionalized dashboard Milestone ‘ROC Monitoring 3’: July 2009 –The regional dashboard is now available to be deployed at the ROCs 24
Enabling Grids for E-sciencE EGEE-III INFSO-RI Roadmap for distributed COD Milestone ‘rCOD 1’: September 2008 –4 ROCs carry out r-COD and 1st line support roles directly. This will be done with a ‘regionalized’ version of the current operations dashboard, and with SAM as the alarm generation system Milestone ‘rCOD 2’: April 2009 –4 additional ROCs carry out r-COD and 1st line support roles using the regionalized dashboard Milestone ‘rCOD 3’: April 2009 – 2 additional ROCs carry out r-COD and 1st line support roles directly using the new multi-level monitoring framework Milestone ‘rCOD 4’: September 2009 –All 11 ROCs carry out r-COD and 1st line support roles directly. The c-COD is fully established Milestone ‘rCOD 5’: December 2009 –All 11 ROCs carry out r-COD and 1st line support roles using the new multi-level monitoring framework 25
Enabling Grids for E-sciencE EGEE-III INFSO-RI OAT and new tools OAT is the body to oversee new tool development –In response to needs of sites, ROC, OCC New projects under investigation –SLA Portal In response to MSA 1.5 SLA –Metrics portal In response to MSA 1.3 – Activity QA plan Also new development for multi-level monitoring –Improvement of Nagios probes for services –Re-engineering of existing SAM probes –Re-engineer other existing tools for regional models SAMAP, Gridview,... –‘Probe description database’ metadata store for probes 26
Enabling Grids for E-sciencE EGEE-III INFSO-RI Take home messages The OAT is trying to provide tools to improve operations –Reduce effort The OAT is a process –We’ve started now –There’s still lots to do Site administrators are needed to contribute –With deploying the tools and giving feedback –With contributing best of breed system management tools –Working on design and development of operational tools Get in touch ! –Talk to an OAT member –Send us mail, join the discussion list –Read the strategy document 27
Enabling Grids for E-sciencE EGEE-III INFSO-RI Contacts Strategy Document : Contact the team : Discuss Mailing List : egee3-operations-automation- – please join !egee3-operations-automation- Documentation Site : share/oat/default.aspx (in development) share/oat/default.aspx List of OAT Members 28 AP – Joanna Huang CE – Emir Imamagic CE – Marcin Radecki CERN – James Casey CERN – John Shade DECH – Angela Poschlad FR – Cyril L'Orphelin FR – Guillaume Cessieux IT – Giuseppe Misurelli NE – Ronald Starink SEE – Antun Balaz SWE – Javier Lopez Cacheiro UKI – Gilles Mathieu