Sistemi di monitoraggio e allarmistica

Sistemi di monitoraggio e allarmistica
Paolo Veronesi - INFN-CNAF Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid 4 Luglio 2008 1

Outline Current Operational Model
Regional Operations Center (ROC); Resource Center (RC); Functional Areas and Operational tools Repositories of information; Monitoring; Accounting; Reporting; Support; Future Operational Model Evolution of the operational tools; Integration issues

Current Operations Model
The current operational model is mainly centralised. It has areas of work which are devolved completely to the region: user certification, site and infrastructure support for example. Other areas are run by teams from across the grid with an individual leading the team: examples are the Grid operator on duty, Monitoring, TPM, and GGUS are all centralised. This does not mean they are all run from or by the OCC at CERN, but at any one time there is a central team, be it TPM or COD, responsible for the entire grid. The duty is rotated around the regions with a single team responsible for a given time period, although backup teams exist. One can contrast this with an approach based on regionally autonomous teams, e.g. regional helpdesks, where the region can continue to operate without GGUS and handle internal tickets according to their local requirements and working practices.

Regional Operation Center (ROC)‏
A ROC consists of a manager or a management team and support staff. ROCs provide a framework of support, to both users and sites, in order to allow them to use the data and computational resources of the grid. The main responsibilities of the ROC are: Provide Help Desk facilities (first-level support) either by using GGUS support units to create a regional Help Desk within GGUS, or by providing a regional Help Desk which is interfaced with GGUS; Register site administrators in the available Help Desk facilities; Provide third-level support by helping in the resolution of advanced and specialized operational problems that cannot be solved by site administrators. If necessary, the ROC will propagate and follow-up problems with higher-level operational or development teams; Ticket follow-up (ensure that sites work on tickets opened against them); Respond to tickets from sites in a timely manner; ROCs manage and support the deployment of gLite middleware on sites, and are also responsible for registering new sites.

Resource Center (RC)‏ The Site (Resource Centre) provides the actual computational resources, such as Computing Elements (CE), Storage Elements (SE), and middleware services. Sites provide second-level level support, have one or several site administrators, and have a designated site security officer. Sites are expected to: adhere to the Operational Procedures described in the Operations Procedures Manual; maintain accurate information on the services they provide in GOCDB; adhere to the Grid Site Operations Policy, and other policy documents referenced therein; adhere to the requirements stated in the Security and Availability Policy document; adhere to the criteria and metrics that are defined in the ROC-Site Service Level Description (SLD) , deploy supported versions of gLite (or compatible) middleware; respond to tickets in a timely manner;

Functional Areas Operational tools Ticketing System

The GOC Database (GOCDB) is a core service within EGEE and is used by many tools for monitoring and accounting purposes. It also contains essential static information about the sites such as: * site name; * location (region/ country); * list of responsible people and contact details (site administrators, security managers); * list of all services running on the nodes (CE, SE, UI, RB, BDII etc.); * phone numbers. Site administrators have to enter all scheduled downtimes into the GOCDB. The information provided by GOCDB is an important information source during problem follow-up and Escalation procedure.

Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid
GOCDB View Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid

SAM tool SAM EGEE SAM PON Framework comune Concetto di VO comune per test (OPS vs PONCERT) Verificare quali sono i test critici Verificare la fonte dell’elenco risorse per i PON (per EGEE si usa il GOCDB) Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid

GSTAT GSTAT - IT GSTAT - GRISU Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid

CPU accounting CESGA: HLRMon: HLR2APEL Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid

SAM test and reporting tool
GridMap: SAM Admin: Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid

Availabilty/Reliability
MoU LCG ( per i siti T1 e T2 SLD ( per gli altri. TIER1: Availability 97%, Reliability ? TIER2: Availability 95%. Reliability ? TIER3: Availability 70%. Reliability 75% Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid

GridView Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid

Ticketing system SISTEMI DI TICKET PON Ogni sito ha un sistema di ticket locale Progetto CRESCO– Basato su Xoops Progetto CYBERSAR – Basato su Xoops Progetto PI2S2 – Basato su Xoops Progetto SCoPE – Basato su sistema proprietario HDA della PATH SISTEMI DI TICKET Grid-IT/EGEE Unico sistema di ticket per tutti I siti Grid-IT basato su Xoops/Xhelp interfacciato con il sistema di ticket europeo GGUS. Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid

Current and Future Current Operational Model Future Operational Model
In this model, alarms no longer come from a central monitoring system like SAM, but from ROC or site local monitoring systems. When an alarm is raised, it is sent directly to the site and at the same time the first line support will see the alarm. This gives the site a chance to respond to the alarm before it even becomes a ticket within a regional or global alarm system. If necessary, the first line support team can immediately start to help the site to fix the problem. The regional operations dashboard is used to keep track of who “owns” a problem at a given point in time. The site will have a certain time period (e.g. 6 hours) within which it can fix the underlying problem (possibly with support from the regional first line support team). At this point, the „regional COD team (r-COD)‟ looks at the alarm. They will then track the problem in whatever way they wish e.g. By raising a local ticket to the site asking them to follow up on the alarm. After a second longer time period (e.g. 2-3 weeks) a GGUS ticket is raised automatically from the regional operations dashboard. This would normally happen only when a site had failed to meet a SLA upon it, and would allow for the breach of SLA to be followed up at a project level.

Nagios display

Nagios GLOBAL REPORTING REGIONAL NAGIOS REGIONAL NAGIOS REGIONAL NAGIOS NAGIOS Multi RC NAGIOS RC NAGIOS RC NAGIOS RC RC RC Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid

Integration issues Grid-IT Middleware gLite based; GSTAT and SAM test (=>NAGIOS); Accounting (CPU); Accounting (STORAGE); GOCDB; Reporting; Ticketing; PON Middleware gLite based; GSTAT and SAM test (=>NAGIOS); Accounting (CPU); Accounting (STORAGE); Repository of information; Reporting; Ticketing; EGI Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid

Docs, tools and links HLRMon: https://dgas.cnaf.infn.it/hlrmon/
WEB PORTAL The Italian Grid Infrastructure: Production Grid: EGEE Operation Portal: DOC Operational Procedures Manual: Operation Automation Strategy: Service Level Definition: TOOL Italian Support System: Global Grid User Support: GOCDB: SAM test: GridMap: SAM Admin: HLRMon: EGEE Accountig Portal (CESGA):

Backup Slide Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid

Operations Management (OCC)‏
The operations management team provides overall coordination and management of operations at a project level. In the context of operations automation, the OCC are the primary consumers of reports coming from usage accounting and SLA monitoring.

Monitoring System Operators (SAM)‏
Currently, a dedicated team runs the operational components of SAM. These are hosted centrally at CERN. This includes gLite User Interface (UI) systems for the execution framework and dedicated gLite resource broker (RB) and BDII nodes for the job submission tests. Due to its central nature, a highly scalable system is needed for the visualization tools and programmatic interfaces. Alarms generated in SAM are passed to the COD team via the operations dashboard for diagnosis and follow up. SAM test: GridMap: SAM Admin:

Grid Operators (COD)‏ The COD team is responsible for tracking problems with grid services, coordinating the diagnosis, and monitoring the problems through to resolution. This has to be done in cooperation with the Regional Operations Centres to allow for a hierarchical approach and overall management of tasks. The operations team continually monitors the production infrastructure using the results from SAM. This is currently a rotating shift carried out by members of the project from different ROCs. The team works on the problems from initial detection through to diagnosis and solution. An escalation procedure is used in case the problem is not solved on time by the site and the ROC.

Global Grid User Support (GGUS)‏
The Global Grid User Support (GGUS) acts as a portal or helpdesk for users and supporters reporting and solving grid problems, as well as an integration platform for existing help desk systems. GGUS is fully connected with the Italian Ticketing System based on Xoops/XHelp (see next presentation for details)‏.

FUTURE OPERATIONAL MODEL
EGEE is already a mature operational infrastructure with a working operational model. Therefore, it is logical that the future operational model will be based on the major components of the existing model. Changes to the model are driven by the project needs of increased automation, and a move to a more distributed mode of operations. Key points in moving to such a distributed model are: Responsibility for daily operations moves from COD to ROCs and sites. Over the course of EGEE-III, daily operations, currently carried out as a central role by the COD, should become devolved to a set of regional teams at the ROCs. These teams will take full responsibility for monitoring their regional section of the grid, and ensuring appropriate availability and reliability for these sites. Faster and more direct notification of problems to sites. Site administrators are the most appropriate people to fix problems at their site. Therefore, the quicker that a problem notification gets to a site administrator, the quicker it can be resolved. Automation of oversight and manual ticket creation processes. Currently, the COD is a manpower intensive task. Many aspects of this, such as ticket creation and oversight reporting can, and should be, automated to reduce (and possibly remove completely) the central COD team.

Central or Regional? Although regional autonomy is a goal, it is not necessarily useful to move all operational tools to a fully distributed architecture. There are different architectures that tools could use, for example: Central: there is logically a single central instance of the tool (disregarding multiple physical instances for reliability and scalability reasons.) Many of the tools currently in use, such as the GOCDB or the operations portal, follow this model; Partitioned: In this architecture, the system is partitioned into several logical instances, each of which deals with a particular subset of the information/work carried out. There is no interaction between the partitions. Partitioned with central aggregation/co-ordination: The system is again partitioned into separate instances, but a central component aggregates the data and co-ordinates the work of the partitioned instances. Distributed: the system is again partitioned, but individual partitioned instances can communicate directly with each other to share information or work.

Central or Regional? For each of the operational tools we need to make a decision on which is the most appropriate architecture to choose. This choice must consider technical and social constraints such as ease of implementation and deployment, and the manpower required to operate them. Also the wish for regional autonomy should be a factor in the decision. Some operational tools and systems lend themselves readily to a particular architecture while for others there is still some choice to bemade: GOCDB The responsibility for the maintenance of static information storage lies with the site. However, many tools require an overview of all sites. Partitioned with central aggregation would work well, with a separate GOCDB instance at each ROC, for example. Another alternative would be distributed with some sort of caching layer at each ROC to store data retrieved from other ROCs. HelpDesk For users, a single point of entry to submit tickets is very important, so a central architecture is most appropriate. A more complicated alternative could be provided by having all users communicate with their local ticketing system in the region along with a mechanism to pass tickets between them. This could also be of help when considering integration with another grid project that already has a „central‟ ticketing system.

Central or Regional? Site monitoring and alarming
Generation of site alarms, for instance based on Nagios, is an example of Partitioned. Here the data for each region is completely independent of all other regions. Grid Accounting This is a perfect example of Partitioned with Aggregation: each region collects its own accounting data, and then sends summary information upwards to a central project level aggregation system. Italian ROC accounting is based on DGAS (DataGrid Accounting System). Accounting Data are collected at site level using a HLR server. A 2nd level HLR server collects data from the other HLRs at Regional level. Reports are available through a Web interface (hlrmonitor). Data are also send to EGEE Accountig Portal where are available through the CESGA portal. HLRMon: EGEE Accountig Portal (CESGA):

Monitoring System Operators (SAM)‏
SAM test: GridMap: SAM Admin: GridVIEW:

Sistemi di monitoraggio e allarmistica

Similar presentations

Presentation on theme: "Sistemi di monitoraggio e allarmistica"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sistemi di monitoraggio e allarmistica

Similar presentations

Presentation on theme: "Sistemi di monitoraggio e allarmistica"— Presentation transcript:

Similar presentations

About project

Feedback