Presentation is loading. Please wait.

Presentation is loading. Please wait.

Monitoring in EGEE Automatisierung & Regionalisierung im Hinblick auf EGI Torsten Antoni (KIT), James Casey (CERN), Sabine Reißer (KIT) Torsten.Antoni@kit.edu,

Similar presentations


Presentation on theme: "Monitoring in EGEE Automatisierung & Regionalisierung im Hinblick auf EGI Torsten Antoni (KIT), James Casey (CERN), Sabine Reißer (KIT) Torsten.Antoni@kit.edu,"— Presentation transcript:

1 Monitoring in EGEE Automatisierung & Regionalisierung im Hinblick auf EGI Torsten Antoni (KIT), James Casey (CERN), Sabine Reißer (KIT)

2 Expertise / Anforderungen
EGEE NGI D-Grid EGI NGI NGI NGI NGI WLCG 2 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

3 NGI D-Grid NGI EGI NGI NGI NGI
Service Anbieter NGI D-Grid NGI EGI NGI NGI NGI 3 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

4 Overview Operations automation team
Messaging as an integration platform Multi-level site monitoring Futures Conclusion 4 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./ 4

5 Operations Automation Team
Sub-task within EGEE SA1 “Reduce the level of effort required for operations by improving automation of operational procedures” “Prepare for a NGI structured environment by devolving the operations and management fully to the ROCs within EGEE-III” “Keep the existing production-level grid service running with no interruptions” Membership Representation from all EGEE regions Members represent ROC, COD, sites, tools developers Some people wear more than one hat… 5 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

6 Mandate Define evolution strategy for operational tools
Key is devolution of tools to ROCs where appropriate If, when, how, … Define work plan for operational tools development Ensure work on new operational tools is within the agreed strategy Bring into the mainstream existing effort which is on the periphery Manage the deployment process of tools 6 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

7 Information Model Active Monitoring Operations Tools Accounting
GOCDB Active Monitoring SAM Site Monitoring (Nagios) gStat Operations Tools CIC Portal COD Dashboard In collaboration with COD SLA compliance monitoring In collaboration with SLA team Accounting Accounting (APEL, DGAS) Usage Monitoring Gridview (Job state, Data Transfer) Reporting Availability Reports (Gridview) Accounting Reports (CESGA Portal) User Support GGUS 7 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

8 Areas of Work Deployment of site monitoring tools to EGEE sites
Based on Nagios Devolve central monitoring tools to regional instances Where appropriate Prime candidate is SAM Create architecture for new shared infrastructure to support operational tools Messaging system Standard mechanisms for failover, robustness ‘Topology Database’ Improvement of schema in GOCDB Integration with VO/ROC topology sources Standards to interoperate (SEE-GRID-SCI, OSG, NGIs) 8 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

9 Possible tool development areas
Resource Allocation / SLA Portal How do we report on SLAs, resources provided, … Configuration Information VOs + operational management poll sites for configuration info Which version of lcg_utils is deployed?, how many specints ? Currently gathered via SAM tests… But no real way to query, plot, report Infrastructure Failover best-practices/procedures Monitoring of operational tools Common Infrastructure – e.g. Messaging 9 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

10 We’ve got an integration problem !
10 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

11 Messaging has the answer…
We need: Loose coupling of systems Distributed components Reliable delivery of messages Standard methods of communication Flexibility to add new producers and consumers of the information without having to reconfigure everything Message Oriented Middleware provides this And is widely used in similar scenarios Chosen solution: Apache ActiveMQ Top-level apache project Mature, stable product; widespread adoption Commercial support available 11 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

12 Broker at the centre .. Reliablity and persistence of messaging built into the broker network Mitigates the single point of failures we’ve had with previous solutions 12 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

13 … or some of them… Not a silver bullet
Still can end up with spaghetti Tight specification of interaction of components Message format specifications Standard metadata schema Message Queue naming schemas Protocols System management is key You’ve got code for free from the messaging system But you need to write your management layer Component co-ordination Configuration Message tracing Debugging 13 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

14 Multi-level site monitoring
Current site testing model is based on remote probing of a site to determine status/availability And a central operations team (COD) to followup on problems Service Availability Monitoring (SAM) run from CERN Tests all EGEE sites (~280) COD raises GGUS tickets from SAM alarms Availability calculated centrally Problems: If SAM goes down, there’s no testing SAM operations effort just keeps growing ! Sites are ‘informed’ remotely of problems 14 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

15 Solution WLCG made a push to add site fabric monitoring at sites based on Nagios A set of common probes A configuration tool which queries GOCDB & SAM on structure of grid site Re-use of this effort now in EGEE-III Local reporting of problems leads to quicker resolution. Next step is to ‘regionalize’ SAM Run a regional SAM instance, which does the same tests as SAM does now, but only for the region Better scaling Regional autonomy Have a central database of results To be used for reporting 15 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

16 SAM – Current Architecture
Complex and Centralized ! 16 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

17 Migrating to the regions
17 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

18 Nagios-based Grid Monitoring
Monitoring CRO-GRID Infrastructure ( ) Globus Toolkit Pre-WS & WS, UNICORE, other services active recovery of services Monitoring EGEE resources in Central Europe (CE) core services since mid 2006 all CE sites for 1st line support since September 2006 Grid Services Monitoring (GSM) WG site monitoring prototype, mid 2007 (egee.srce.hr) (CERN-PPS) 18 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./ 18

19 Site Monitoring Prototype
Get Nagios results Site admins Get site status Issue alarms Probe descriptions MyProxy Get site’s & nodes information Get nodes information Live node checks Get remote results Refresh proxy Get VOMS proxy Service checks Monitoring server Site nodes Site BDII CE SE LFC 19 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./ 19

20 Grid Probes Provided by SRCE, CERN, OSG Security facilities & services
CA distribution, Certificate lifetime, MyProxy Monitoring & information services R-GMA, BDII, MDS, GridICE Job management services Globus Gatekeeper, RB, WMS, WMProxy, Job matching File management services GridFTP, SRM, DPNS, LFC, FTS 20 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./ 20

21 Current Status Three sets of standard probes integrated
SRCE, CERN, OSG Two external monitoring systems SAM, ENOC DownCollector Several deployments CERN-PPS, SRCE, NIKHEF, PIC, IN2P3, ScotGrid RPMs in apt and yum repository Installation and configuration manual More info 21 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

22 Futures We have worked on how to devolve centralized probing of services & sites Still not clear how to devolve : GOCDB-like databases CIC operations Move towards distributed 1st level support with a thin central co-ordination body Ticketing and alarm follow up A NGI should be able to host all monitoring services itself by end of EGEE-III Or choose who they pay to provide it for them 22 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./

23 Summary EGEE-III will try and devolve the operations of the grid as much as possible to regional entities Clear strategy for SAM And integration architecture being defined Other systems will follow Commodity components will play a vital role Stop developing new solutions Reuse existing solutions e.g. Nagios, ActiveMQ, … A NGI should be able to host all monitoring services itself by end of EGEE-III And still interoperate at an EGI level for operations 23 | Antoni, Casey, Reißer | D-Grid Monitoring Workshop | Dresden | 26./


Download ppt "Monitoring in EGEE Automatisierung & Regionalisierung im Hinblick auf EGI Torsten Antoni (KIT), James Casey (CERN), Sabine Reißer (KIT) Torsten.Antoni@kit.edu,"

Similar presentations


Ads by Google