Presentation is loading. Please wait.

Presentation is loading. Please wait.

Monitoring: problems, solutions, experiences

Similar presentations


Presentation on theme: "Monitoring: problems, solutions, experiences"— Presentation transcript:

1 Monitoring: problems, solutions, experiences
Enrico Fattibene INFN-CNAF Rimini, 09/05/2007 Workshop CCR

2 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
Outline Local vs. Grid monitoring The GridICE + LEMON approach Storage monitoring Experiment dashboard Enrico Fattibene - Workshop CCR - Rimini, May

3 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
Why monitoring? We need monitoring functionalities: to observe the composition, state and features of available resources to analyze their behavior and performance to detect and prevent fault situations Two different monitoring domains: Local Monitoring the domain is defined by an administrative site Grid Monitoring the domain includes many geographycally-dispersed administrative sites Enrico Fattibene - Workshop CCR - Rimini, May

4 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
Terms and concepts Entity: any networked and useful resource having a considerable lifetime (e.g. CPU, RAM, disk space, etc.). Measurement: a process of assign numbers reflecting relationships of the attributes of the phenomena being observed. Measurement value: estimate of an observable physical value. Sensor: a process monitoring an entity and generating measurement values. Enrico Fattibene - Workshop CCR - Rimini, May

5 The monitoring process
Processing and abstracting the number of received measurement values in order to enable the consumer to draw conclusions about the operation of the monitored system Presenting Transmission of the measurement values from the source to any interested parties (data delivery model: push vs. pull; periodic vs. aperiodic) Distributing Processing (e.g., fairly static as software and hardware configuration or dynamics as current processor load) Dynamics: (e.g., fairly static as software and hardware configuration or dynamics as current processor load) Timing: (e.g., periodic or on demand) Sensors inquiring entities and encoding the measurement values according to a schema Generation Filtering or aggregation of the measurement values according to predefined criteria Enrico Fattibene - Workshop CCR - Rimini, May

6 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
Monitoring goals Collecting information in order to describe status and characteristics of available resources Enable retrospective analysis Ability to detect faulty situations and prevent them Dinamic aggregation of information about resources and services using different dimensions Event handler and notification system Enrico Fattibene - Workshop CCR - Rimini, May

7 Local vs. Grid: differences
Network LAN WAN Privileges Direct access to resources No direct access Scope Single site Many sites Resource quantity O(100), O(1000) O(10000) Resource heterogeneity Low High User types Local admins Local, Grid admins, VO managers and users Enrico Fattibene - Workshop CCR - Rimini, May

8 Local + Grid monitoring
Grid-level aggregation and presentation Site publisher Site publisher Site collector Site collector Entity Entity Entity Entity Local monitoring Enrico Fattibene - Workshop CCR - Rimini, May

9 GridICE + LEMON approach

10 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
GridICE a distributed monitoring tool for Grid systems started in late 2002 (EU-DataTAG project) is evolving in the context of EU-EGEE fully integrated with the gLite Middleware (and previosly with LCG) Metering and publishing of data can be configured via gLite standard installation mechanisms Self-configurable collection and presentation just give the URL of the root Grid Information Service (GIS) using W3C standards to offer easy access to monitoring data Enrico Fattibene - Workshop CCR - Rimini, May

11 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
LEMON Set of tools for local monitoring: Distributed monitoring system scalable to ~10k nodes Provides active monitoring of software and hardware in the farm on centrally managed clusters Facilitates early error detection and problem prevention Executes corrective actions and sends notifications Provides persistent storage of the monitoring data Offers a framework for further creation of sensors for monitoring Enrico Fattibene - Workshop CCR - Rimini, May

12 GridICE architecture GridICE Server Presentation layer
Data aggregation Grid Information Service Site Persistent Storage discovery consumer scheduler Site – administrative domain Site Publisher Monitored Entity Site Collector LEMON GIS Adapter Local Publisher Site Consumer Site Persistent Storage Sensor

13 LEMON agent (Local Publisher)
Runs sensors and communicate with them Checks on status of sensors Sends data to servers using TCP or UDP Caches data locally Enrico Fattibene - Workshop CCR - Rimini, May

14 LEMON server (Site Consumer)
Two implementations: Oracle based – OraMon optimized for high performance and for large Computer Centers runs on Oracle 9i+ Flat files based – FlatMon (edg-fmon-server) uses OS files for storing data for smaller sites (scalable to 1000 machines max.) Enrico Fattibene - Workshop CCR - Rimini, May

15 GridICE Sensor and Publisher
Sensor and publisher are in the scope of each administrative domain: Sensor Output as extension of the GLUE Schema-based information already available in the GIS adopted by gLite (i.e., Globus MDS 2.x): fabric-level information Grid services monitoring (daemons monitoring) job monitoring summary info for computing resources from LRMS Publisher Adopt the available Grid Information Service in gLite Enrico Fattibene - Workshop CCR - Rimini, May

16 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
GridICE + LEMON GIS adapter (fmon2glue) transforms fabric data into LDAP Data stored in a special instance of a GRIS (site publisher) GridICE can detect the existence of this GRIS (discovery process) and publish the information Current LEMON version included in GridICE is v.2.5.4 Enrico Fattibene - Workshop CCR - Rimini, May

17 GridICE + LEMON new version
Substantial effort to integrate the new version (v.2.13.x) Revision of the whole set of GridICE-specific sensors to comply with the new naming conventions Few early GridICE sensors replaced by new LEMON ones Significant rewriting of the fmon2glue Testing at INFN-BARI since February 2007 Enrico Fattibene - Workshop CCR - Rimini, May

18 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
Integration tests After 17 April all the jobs in the batch system are detected by GridICE Enrico Fattibene - Workshop CCR - Rimini, May

19 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
GridICE screenshots Enrico Fattibene - Workshop CCR - Rimini, May

20 INFN-T1

21 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
The INFN-T1 solution RedEye Thought for analysis and actions Sensors Customized for T1 farm monitoring Data analyzer Web interface 5 min refreshing Event handler WN banned in LSF e.g. in case of serious disk problems Notifications Customizable through sms, rss logbook, mail, etc. Temperature problem in the farm Production CEs not available LEMON In the future: RedEye analyzer interfaced with LEMON Oracle based Grid level In the future GridICE integrated with new LEMON version Enrico Fattibene - Workshop CCR - Rimini, May

22 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
RedEye screenshots Enrico Fattibene - Workshop CCR - Rimini, May

23 Storage monitoring

24 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
The INFN-T1 solution LEMON Oracle based 150 nodes Useful multilevel aggregation for monitoring analysis thanks to the possibility to create groups Nagios Good event handler and alarm system Notification of problems through mail Enrico Fattibene - Workshop CCR - Rimini, May

25 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
The INFN-Bari tool Complete detail at local level Site administrators can visualize all information about file transfers: host (source, destination) points of failure detection (host, protocols...) DN of operating users (read/write storage data) Possible aggregation of information in GridICE, selecting data on: administrative site Virtual Organization operation type (read/write) Tested successfully on dCache at INFN-Bari To be tested on Castor, DPM, SEClassic Enrico Fattibene - Workshop CCR - Rimini, May

26 Experiment dashboard

27 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
Experiment Dashboard Single entry point to the monitoring data collected for LHC VOs. No available system can collect all kinds of data Collects information from multiple sources VO specific monitoring service showing Grid usage from a VO point of view (cross Grid, cross application, submission tool, etc.) merging Grid information and VO information. Enrico Fattibene - Workshop CCR - Rimini, May

28 Information sources of the VO monitoring data
Monitoring systems (RGMA, GridICE, SAM, RTM, MonAlisa, BDII, GridView…) Multiple sources, multiple interfaces… Is it easy to find what I am looking for ? Is it possible to correlate data coming from various sources? Generic Grid Services Experiment specific services Experiment work load management and data management systems VO users with various roles Jobs instrumented to report monitoring information Enrico Fattibene - Workshop CCR - Rimini, May

29 Experiment Dashboard concept
Monitoring systems (RGMA, GridICE, SAM, RTM, MonaAlisa, BDII, GridView…) Generic Grid Services Collect data of VO interest coming from various sources Store it in a single location Provide GUI following VO requirements Experiment specific services Experiment work load management and data management systems Jobs instrumented to report monitoring information VO users with various roles Enrico Fattibene - Workshop CCR - Rimini, May

30 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
Job monitoring How is evolving a job’s status Belonging to an individual user/group/VO Submitted to a given site or via a given resource broker Reading a certain data sample, running a certain application If pending/running: from when and where? If finished: successfully or not? If failed: why? Enrico Fattibene - Workshop CCR - Rimini, May

31 Job monitoring information flow
At the submission time META information about user task Submission info for individual jobs Job Submission Tools (CRAB, ProdAgent, Panda, Ganga) Dashboard for Job Monitoring MonAlisaservice Grid status info only for jobs submitted via RB (RGMA, RTM) Jobs status according to the local batch system (only where GridICE is running) Jobs at the WNs Running jobs report their progress Job status info for and application related information Experiment specific monitoring systems (Production system in Atlas, Dirac monitoring) Grid monitoring systems (RGMA, RTM, GridICE, BDII) Enrico Fattibene - Workshop CCR - Rimini, May

32 GridICE contribution to the dashboard
Job monitoring data from GridICE sensors Direct query on the batch system A different viewpoint for jobs Information on Condor-G jobs (submitted directly on CE) ROC-IT data from a dedicated GridICE server Customized API extracting data from persistent storage on GridICE server Enrico Fattibene - Workshop CCR - Rimini, May

33 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
Conclusions Local and Grid monitoring differences and integration The GridICE+LEMON approach Storage monitoring experiences Experiment dashboard as example of integration of different tools Enrico Fattibene - Workshop CCR - Rimini, May

34 Enrico Fattibene - Workshop CCR - Rimini, May 9 2007
References GridICE Dissemination: ROC-IT server: LEMON RedEye Experiment dashboard Enrico Fattibene - Workshop CCR - Rimini, May


Download ppt "Monitoring: problems, solutions, experiences"

Similar presentations


Ads by Google