Download presentation
Presentation is loading. Please wait.
1
LCG Monitoring and Accounting
Dave Kant CCLRC e-Science Centre, UK LCG Workshop Nov 2nd-5th 2004
2
Introduction Look at the existing monitoring tools that are being used in LCG Grid Operations Centre GPPMON GRIDICE GSTAT CERTIFICATION TESTING REAL TIME GRID MONITOR Job Accounting There has been a coordinated effort to develop, deploy and integrate a variety of monitoring tools from CERN, CCLRC (UK), GridPP, INFN-Grid (Italy) and Taiwan.
3
Monitoring Challenges
We have only fragmentary information about the services that sites are running. We don’t know what RBs/SEs/Sites the VOs are using for data challenges. We don’t know what the core services are and who is running them. We don’t have a toolkit to test specific core services. We have to concentrate on functional behaviour of services e.g. If an RB sends your job to a CE, then we must assume the RB is working fine. Is this the only test of a RB? Not all the tests that we perform are effective at finding problems. We must develop tests which simulate the life cycle of real applications in a Grid environment. …and lots more (see earlier talks) Useful operations information.
4
GOC Configuration Database
Secure Database Management via HTTPS / X.509 Store a Subset of the Grid Information system People, Contact Information, Resources Scheduled Maintenance Monitoring Services Operations Maps Configure other Tools Organisation Structures Secure services - Site News Self Certification Accounting GOC GridSite MySQL SERVER SQL https Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … The GOC Database: ============= Database: Repository of site information : -Contains information provided by site administrators through a secure interface (GridSite) -information can be resources, contact and security details. Basically subset of the things you might find in the information system However, it can also contain information that is not present in the IS such as: - scheduled maintenance; news; organisational structures; => A focal point for operational information. The database can also be used to configure some of the monitoring tools which I showed in the previous slide: Mapcenter -> 30 sites requires a configuration file of around a few hundred lines Nagios > 12 separate configuration files with dependencies bdii ce GOC DB can also contain information that is not present in the IS such as: Scheduled maintenance; News; Organisational Structures; Geographic coordinates for maps. se rb RC
5
Operations Map – Job Submission Tests
GPPMON Displays the results of tests against sites. Test: Job Submission Job is a simple test of the grid middleware components e.g. Gatekeeper service, RB service, and the Information System via JDL requirements. This is the GPPMON tool developed by the GridPP Collaboration. Basically, it’s a map which represents the results of kind of test as coloured dot. For example: A job submission test sends a job request to a site through a resource broker. If the job executes successfully, the site is marked with a green dot. If the job fails, the site Is marked with a red dot. This kind of test is testing the functional behaviour of the core services – Do simple jobs run. These maps can be tailored for different communities: for example a grid communities identified by a list of sites in a BDII configuration database such as the one shown here for LCG. This kind of test deals with the functional behaviour core grid services – do simple jobs run. They are lightweight tests which run hourly. However, they have certain limitations e.g. Dteam VO; WN reach (specialised monitoring queues).
6
GRIDICE – Architecture
A different kind of monitoring tool – processes / low level metrics / grid metrics Developed by the INFN-GRID Team Data harvest via discovery service (postgreSQL) Publication service Unlike GPPMON which runs simple functional tests, there are other tools which can monitor services in different ways. For example GRIDICE – monitoring tool for a grid operations center – which has been developed by the INFN grid team. Gridice implements a number of services ranging from A Measurement service which uses monitoring sensor agents to probe “core processes” belonging to a service; and other low level metrics such as memory, cpu A Publisher service to collect this information in a local database (fmonServer) at the site. A Discovery service to find resources and harvest data into a central database. An finally a publisher service a portal to the monitoring data which can be aggregated in different ways. Measurement service
7
Different Views of the data: Site / VO / Geographic
GRIDICE – Global View Different Views of the data: Site / VO / Geographic Resource Usage CPU#, Load, Storage, Job Info List of Sites Web Interface shows what you might see if you want an overall Global view of grid resources. Here you can see a list of Participating sites and a description of resource usage, such as total CPU and storage available. GridIce use Nagios to schedule updates of its central monitoring repository, and the information you see is reasonably up-to-date. The information can be viewed in different ways: for example Geographically or for each VO Display shows the processes belonging to the Broker service. Problems are flagged
8
GridIce Job Monitoring
Recently deployed version on to LCG which features job monitoring: Queued, Running, Finished organised in different ways (site, Vo etc) XML views of data Latest version of GridIce (1.6.3) implements job monitoring features: Current status of running/queued/finished jobs per vo per site.
9
Ganglia Monitoring http://gridpp.ac.uk/ganglia
Can use Ganglia to monitor a cluster Scalable distributed monitoring system for clusters and grids using RRD for storage and visualisation. RAL Tier-1 Centre LCG PBS Server displays Job status for each VO Get a lot for little effort Ganglia is a scalable distributed monitoring system for clusters and grids which uses RoundRobinDatabaseTool for data storage and visualisation. Its relatively easy to install and you get a lot for little effort. One of its strengths is that it can federate clusters together.
10
Federating Cluster Information
Can also use Ganglia to monitor clusters of clusters Ganglia/R-GMA integration through Ranglia. Separate and distinct clusters federated together. Ganglia provides a wealth of information, much of it low level, that can be useful for operations. Ganglia/R-GMA integration through Ranglia.
11
GIIS Monitor http://goc.grid.sinica.edu.tw/gstat/
Developed by MinTsai (GOC Taipei) Tool to display and check information published by the site GIIS (sanity checks, fault detection) GIIS Monitor which has been developed by the GOC based in Taipei It’s a tool to monitor the grid information system. The primary goal of the application is to detect faults, perform sanity checks and display useful data. Provides an overview of the current grid status and you can drill down to get more information.
12
Regional Monitoring EGEE is made up of regions.
USA One of the ways in which we can deal with operations and the complexity of managing a grid is to divide the task into smaller pieces. For example in EGEE: GOC can provide monitoring services that are tailored to each region. EGEE is made up of regions. Each region contains many computing centres. Regional Operational Centres is a focus for operations.
13
Regional Monitoring Maps
Provide ROCs with a package to monitor the resources in the region Tailored Monitoring GUIs to create organisations and populate them with sites Hierarchical view of Resources Example UK Particle Physics GridPP Materialised Path encoding Organisational Structures such as the Regional Operational Centres belonging to EGGE, can be described in the GOC database using Materialized Path encoding.
14
Site Certification Service
In terms of middleware, the installation and configuration of a site is quite a complicated procedure. When there is a new release, sites don’t upgrade at the same time Some upgrades don’t always go smoothly Unexpected things happen (who turned of the power?) Day-to-day problems; robustness of service under load? Its necessary to actively hunt for problems Site certification testing is by CERN deployment team on a daily basis. First step toward providing this service involves running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3rd party copies from remote SE. Unlike the simple job submission tests implemented in GPPMON, these tests are more heavy weight and attempt simulate the life cycle of real applications.
15
Certification Test Results
Test results shown on a web page. As you can see, it’s a large Matrix of data where each row identifies a site and the corresponding test results. One of the difficulties of having too much information, is that it can be difficult to find the information you need. Its also quite detailed - most links allow you to drill down and examine the debug information - again a useful tool for the expert!
16
Syndication of Monitoring Information
GOC generates RSS feeds which clients can pull using an RSS aggregator. How can we integrate feeds and ticketing systems? Clearly, there is a lot of monitoring information. Need a way to syndicate this information to the right people Really Simple Syndication (RSS) is an XML schema that is used by many organisations to syndicate content Slashdot, Nature Client Pull model: GOC creates RSS formatted documents, clients pull these feeds which render them in html. Can tailor feeds to communities: per site, per ROC, whole Grid, specific test The screenshot shows the layout of the results that a site administrator would see subscribing to such a feed. There are plenty of open source RSS clients out there to render the feeds into html. Aggregator RSSReader (Windows Client)
17
Real Time Grid Monitor Why are jobs failing? Why are jobs queued at sites while others are empty? A Visualisation tool to track jobs currently running on the grid. Applet queries the logging and bookkeeping service to get information about grid jobs. Visualisation tool developed by GridPP at Imperial College. The monitor works by querying the RB Logging and Bookkeeping database for job information. Because the L&B service is continually being updated, the tool shows jobs flowing from the RB to a site for processing or returning back to the source once completed. Tool is useful to get a global picture of trends and quickly identify potential problems such as “job pile-up”, and it help to publicise the grid at conferences to non experts. [ Applet queries files not older than 6 hours. Long running jobs don’t show up]
18
Problems with existing tools
Lots of monitoring tools have described – they have a few things in common: - all the information which they generate is hidden away or difficult to access - limited interfaces: the data can only be accessed in specific ways Therefore, its difficult to build “on-demand” services to allow communities “Players” to interact with the data. Examples include Job Accounting service : to allow an Organisation to compare resources usage for each VO Certification Testing service: Secure service to allow a site administrator to run the certification test suite against their site through a RB of their choice? The idea is for the services to collect information and put it into a common repository such as an RGMA Archiver. In this way, the information can be shared and accessible to all. Services (EGEE parlance: ROC and CIC services) munch the data and present it to the community. Example: GIIS is that its hard to drill down to the information you want e.g How much CPU in GridPP today? How much disk in the UKI ROC? The new paradigm solves this problem by allowing the data to be aggregated in different ways.
19
Information Repository (RGMA)
Monitoring Paradigm A Better way to unify monitoring information. GOC Services collect information and publish into an archiver. ROC/CIC Services provide a means for the community to interact with this information on-demand. GOC provides services tailored to the requirements of the community. Information Repository (RGMA) Accounting Monitoring GSTAT Testing ROC Services Self Certification CIC Services Communities VOs ROCs EGEE Sites Organisations GOC Services Thus far I’ve described lots of tools. They have a few things in common: - they have all been developed independently - all the information which they generate is hidden away or difficult to access - limited interfaces: the data can only be accessed in specific ways e.g. web page Therefore, its difficult to build “on-demand” services to allow communities “Players” to interact with the data. What sort of services are we thinking about? Examples include Job Accounting service : to allow an Organisation to compare resources usage for each VO Certification Testing service: Secure service to allow a site administrator to run the certification test suite against their site through a RB of their choice? The idea is for the services to collect information and put it into a common repository such as an RGMA Archiver. In this way, the information can be shared and accessible to all. Services (EGEE parlance: ROC and CIC services) munch the data and present it to the community. Example: The current limitation with GIIS is that its hard to drill down to the information you want e.g How much CPU in GridPP today? How much disk in the UKI ROC? The new paradigm solves this problem by allowing the data to be aggregated in different ways. Each community has its own requirements so we need to tailor services.
20
GOC UseCase Job Accounting
An accounting package for LCG has been developed by the GOC at RAL There are two main parts the accounting data-gathering infrastructure based on R-GMA which brings the data to a central point a web portal to allow on-demand reports for a variety of players.
21
Accounting Flow Diagram
LCG SITE LCG SITE Site GIIS GOC Site MON Archiver Accounting Data Data Aggregation per VO per ROC Accounting Service On Demand CE MON filter filter RGMA filter filter Picture shows the flow of accounting data from collection at source to the presentation of this information to the community via the GOC. This system uses R-GMA – a relational implementation of the GGF Grid Monitoring Architecture and serves as the transport layer for sending information from once place to another. At each site, filters process data from log files and write to database tables on the site MON via JDBC/MySQL. GK filter to get DN (a requirement for grid jobs) Event filter to get job information Messages to map Grid users to Local jobs GIIS filter to get specINTs and specFLOATs Data is joined to build accounting records. Accounting records sent to GOC via RGMA. GOC Collect records from the sites (1 records per grid job) GOC processes records to create summary tables from which the on-demand services run (ROC views of data/ VO views etc) Reports Batch Log Data Sources GK Log messages
22
GOC Accounting Services
On Demand Services to EGEE Community Simple interface to customise views of data: VO, time frame and Region (default = EGEE) BaseCpuSeconds Aggregated across EGEE Each Site, per VO, per Month Each Region, per VO, per Month Other Distributions Normalised CPU # Jobs GOC provides “services” to the EGEE community. Use case 1: Accounting service Accounting data fed into RGMA database MARKUS/PIOTR/LAURENCE/MIN WORRY ABOUT GETTING MONITORING INFORMATION INTO COMMON DATASTORE GOC SHOULD WORRY ABOUT PRESENTING THIS INFORMATION NO DUPLICATION OF EFFORT ! Use case 2: Monitoring/Testing service NB: Monitoring/Testing information from Piotr/Markus via R-GMA database. Philosophy: Piotr worries about what kind of tests to run; GOC worries about how to present the information to the community. Use case 3: gstat Show disk/cpu/ # of jobs running/waiting per site/ region etc. Use case 4: Job monitoring service Hourly job submission tests of sites by GOC
23
Accounting Issues A stable release of accounting package has been certified and tested at CERN; Should sites wait for the official release of press ahead independently? Package supports PBS only; initial implementation for LSF. 80 sites advertising 313 Job managers: - 300 PBS (91% of sites) - 3 CONDOR (KFKI, FNAL, TRIUMF) 7 LSF (GSI, LNL, CERN). Accounting requires the R-GMA infrastructure to be deployed at the site. The VO associated with a user’s DN is not available in the batch or gatekeeper logs. It will be assumed that the group ID used to execute user jobs, which is available, is the same as the VO name. The global jobID assigned by the Resource Broker is not available in the batch or gatekeeper logs. This global jobID cannot therefore appear in the accounting reports. The RB Events Database contains this, but that is not accessible nor is it designed to be easily processed. [Andrea Guarise: JRA1 proposal]
24
Accounting Issues Most sites keep GK/Batch logs but throw away message log files after 9 weeks due to default log rotation. At present the logs provide no means of distinguishing sub-clusters of a CE which have nodes of differing processing power. Changes to the information logged by the batch system will be required before such heterogeneous sites can be accounted properly. At present it is believed all sites are homogeneous.
25
Future Plans Extend the ideas developed in the accounting service to the other tools. Example: Feeds Regional Operations News feeds (accounting, #cpu, disk, Piotrs Daily test results) Want to move toward a Service Orientated Architecture model and provide the community with a direct interface into the monitoring.
26
Summary Accounting Information gathering infrastructure has been developed It has been through the C&T cycle and should be deployed in the next release. A web portal for display of this information has been developed (work in progress) This is an EGEE deliverable (DSA1.3) The display infrastructure can be deployed for other monitoring information. Development towards on-demand services to provide the community with up-to-date information, aggregated at different levels. Development of Visualisation tools to enhance our understanding of the grid.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.