Download presentation
Presentation is loading. Please wait.
Published byDayna Bishop Modified over 9 years ago
1
EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Operations Automation in EGEE-III What does the OAT mean to you ? James Casey, CERN EGEE’08 Istanbul, Turkey
2
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 What is the Operations Automation Team (OAT) Defined in EGEE MSA1.1 –‘Operations Automation Strategy’ –Initial focus on multi-level monitoring –Delivered mid-June – comments still welcome https://edms.cern.ch/document/927171 Abstract: In EGEE-III, within the SA1 activity, a group called the ‘Operations Automation Team’ was formed with the task of coordinating operational tools and their development, with the specific goal of advising on the strategic directions to take in terms of automating the operations effort. This will entail replacing manual processes with automated ones in order that the overall staffing level of operations can be significantly reduced in a long-term, sustainable infrastructure. This document outlines a strategy for achieving this automation using an integration architecture based on messaging. It describes how current tools and processes, such as operational alarming and ticketing will evolve during the lifetime of EGEE-III and lays out a roadmap for this evolution. 2
3
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Questions What’s this got to do with EGI ? I’ve heard Nagios replaces SAM, what does this mean? –Its uses messaging, what’s that mean? –I’m in a VO, does this affect me ? Will you help me manage my site better? When will this all happen? How can I help ? 3
4
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 OAT and EGI OAT is an EGEE-III body –Using EGEE effort to automate (improve ???) operations during the project –Oversee all operational tool development within EGEE SA1 Following EGI visions on upcoming strategy –EGI “subsidiarity principle” –This is a big driver for us – moving processes and tools to regional models Where possible !!! Provide input to EGI on operational tool development and deployment that is on the roadmap beyond the end of EGEE-III 4
5
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Operational Tools in EGEE-III 5
6
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Current Operational Model Several teams involved –Operations Management (OCC) –Monitoring system operators (SAM) –Grid operators (COD) –Regional Operations Centres (ROC) –First line support teams (ROC) –Resource Centres/sites (RC) –User support team (GGUS) 6
7
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Improving reliability and availability 7
8
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Current operational model (s) 8
9
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Future operational model 9
10
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Multi-level monitoring Based on existing work in CE ROC –Replaces central SAM execution framework with Nagios at ROC and site –Interacts with existing SAM components Visualization, availability calculation, historical result store –Tied together via a reliable messaging infrastructure –Regional operations dashboard and alarms DB –Link into regional ticketing, e.g. via GGUS Follow new operational model –Raise alarms immediately at the site –1 st level support sees them and can respond if needed –Central COD only involved after 2-3 weeks e.g. site banning Tutorial yesterday with much more details –Full install done of all components at a site in 1.5 hours... 10
11
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Monitoring is multi-level Source# checks / service Type Central1-2Network monitoring, Service ‘Ping’ Regional5-10User-oriented actions (e.g existing SAM tests) Site local10-30Detailed functional tests 11
12
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Messaging Systems Flexible architecture: –Deliver messages, either in point to point (queue)… –… or multicast mode (topics) –Support Synchronous or Asynchronous communication. Reliable delivery of messages: –Provide reliability to the senders if required –Configurable persistency / Master-Slave. Highly Scalable: –Network of Brokers 12
13
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mature open-source implementation of these ideas –Top-level Apache project –Commercial support available from IONA –Widely-used commodity software Easy to integrate into your code –Multiple language + transport protocol support Good performance characteristics –See later … Work done to integrate into our environment –RPMs, YAIM configuration, monitoring and alarms use Net::Stomp ; my $stomp = Net::Stomp->new ({hostname => 'gridmsg102.cern.ch', port => '6163' }) ; $stomp->connect () ; $stomp->subscribe ( { 'destination' => '/topic/grid.probe.metricOutput', 'ack' => 'client', 'activemq.prefetchSize' => 1}); while(1) { my $frame = $stomp->receive_frame; warn $frame->body ; print $frame->as_string ; $stomp->ack({frame => $frame} ) ; } $stomp->disconnect ; ActiveMQ 13
14
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 ActiveMQ 14
15
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Results : Throughput > Consumers > Throughput 15
16
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Vendor tests 16 From “Optimizing FUSE Message Broker” - http://open.iona.com/resources/collateral/#whitepapershttp://open.iona.com/resources/collateral/#whitepapers
17
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Usages of Messaging We use it as an ‘integration bus’ –Use when systems want to share information E.g VO transfer systems publishing data rates to WLCG It’s another string to our bow When the application model fits well, then use it E.g. Async communications, broadcast messages Don’t force applications to use it –Have other solutions too E.g “RESTful” web services a.la SAM Programmatic Interface 17
18
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 ‘Standard’ Integration Patterns The same patterns are repeated in many of the following examples: –Gather results at many points –Collect the raw results and store in a database –Perform some operation on the raw results Summarisation, availability calculation, … –Publish the summarised results to many clients E.g. site monitoring, dashboards, … –Store historical data in a database and visualize via web client We provide ‘standard’ components to make this plug and play for many workflows 18
19
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 VO, ROC, Project & Local monitoring 19
20
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Another application - Usage Reporting Other main part of monitoring – Usage statistics –Gridftp transfers, FTS transfers, job records, … Used to calculate throughput and reliability Currently handled in GridView, Dashboards –Use messaging system to unite these efforts Delegate parsing/routing of specific information back to experts –L&B, FTS, … Other integration examples include –Accounting –GOCDB synchronization 20
21
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Site Management gLite often doesn’t provide enough management tools –Direct feedback from site and service managers Site managers often write tools themselves Strategy defined to get these tools to a wide audience –Lightning talks 5 minute presentations on tools people have developed Stay for the rest of the session ! –Publicity of tool development e.g. Via iSGTW Doubling of visitors to gridmap.cern.ch after publishing an article 21
22
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Deployment support EGEE-SA1 tools project started Policy being defined now –With some ‘early adoptor’ projects Some of the tools you’ll see in the lightning talks session Facilities –Support in using ETICS –Support in writing YAIM –Yum Repository –Documentation repository Contact us if you’re interested in contributing here 22
23
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Global Roadmap 23 Covers multi-level monitoring Roadmaps for other areas (e.g. accounting) in the process of being defined by individual teams –And co-ordinated by the OAT
24
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Roadmap for tools Milestone ‘Messaging 1’: August 2008 –Production level messaging broker in production. This should have internal failover capabilities, but will not have the WAN failover capabilities of a network of broker Milestone ‘Messaging 2’: December 2008 –A scalable and reliable network of brokers, consisting of a deployment over at least 3 sites is in place Milestone ‘Site Monitoring 1’: September 2008 –A release of the site components for the multi-level monitoring, including packaging and configuration as part of a EGEE middleware release exists and is ready for deployment to the sites. Milestone ‘ROC Monitoring 1’: December 2008 –The ROC components for the multi-site monitoring are ready for deployment to sites. Milestone ‘ROC Monitoring 2’: February 2009 –The alarm component has been integrated with the regionalized dashboard Milestone ‘ROC Monitoring 3’: July 2009 –The regional dashboard is now available to be deployed at the ROCs 24
25
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Roadmap for distributed COD Milestone ‘rCOD 1’: September 2008 –4 ROCs carry out r-COD and 1st line support roles directly. This will be done with a ‘regionalized’ version of the current operations dashboard, and with SAM as the alarm generation system Milestone ‘rCOD 2’: April 2009 –4 additional ROCs carry out r-COD and 1st line support roles using the regionalized dashboard Milestone ‘rCOD 3’: April 2009 – 2 additional ROCs carry out r-COD and 1st line support roles directly using the new multi-level monitoring framework Milestone ‘rCOD 4’: September 2009 –All 11 ROCs carry out r-COD and 1st line support roles directly. The c-COD is fully established Milestone ‘rCOD 5’: December 2009 –All 11 ROCs carry out r-COD and 1st line support roles using the new multi-level monitoring framework 25
26
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 OAT and new tools OAT is the body to oversee new tool development –In response to needs of sites, ROC, OCC New projects under investigation –SLA Portal In response to MSA 1.5 SLA –Metrics portal In response to MSA 1.3 – Activity QA plan Also new development for multi-level monitoring –Improvement of Nagios probes for services –Re-engineering of existing SAM probes –Re-engineer other existing tools for regional models SAMAP, Gridview,... –‘Probe description database’ metadata store for probes 26
27
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Take home messages The OAT is trying to provide tools to improve operations –Reduce effort The OAT is a process –We’ve started now –There’s still lots to do Site administrators are needed to contribute –With deploying the tools and giving feedback –With contributing best of breed system management tools –Working on design and development of operational tools Get in touch ! –Talk to an OAT member –Send us mail, join the discussion list –Read the strategy document 27
28
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Contacts Strategy Document : https://edms.cern.ch/document/927171https://edms.cern.ch/document/927171 Contact the team : egee3-operations-automation-core@cern.chegee3-operations-automation-core@cern.ch Discuss Mailing List : egee3-operations-automation- discuss@cern.ch – please join !egee3-operations-automation- discuss@cern.ch Documentation Site : http://espace.cern.ch/sa1- share/oat/default.aspx (in development)http://espace.cern.ch/sa1- share/oat/default.aspx List of OAT Members 28 AP – Joanna Huang CE – Emir Imamagic CE – Marcin Radecki CERN – James Casey CERN – John Shade DECH – Angela Poschlad FR – Cyril L'Orphelin FR – Guillaume Cessieux IT – Giuseppe Misurelli NE – Ronald Starink SEE – Antun Balaz SWE – Javier Lopez Cacheiro UKI – Gilles Mathieu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.