Download presentation
Presentation is loading. Please wait.
Published byShannon Reed Modified over 9 years ago
1
Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005
2
2 Overview 1.GOC Database 2.Monitoring Tools 3.Accounting 4.Issues 5.Future Plans
3
3 Overview GOC DataBase –What it does –What role it will play in future grid –What features New site registration procedure Administrative Roles Replication BDII generation (with other information) e.g. through RGMA Speak to pyotr/min –Security Monitoring packages –Whats out there; What they do –GPPMon: What features RSS feeds, regional monitoring, certificates and accounting Usefulness => map of red dots can be used to find problems quickly –Where are we going Services Accounting service
4
4 GOC Database –What features? Configuration of monitoring tools Security Organisations Administrative Roles Replication –What role will it play in the future? New site registration procedure BDII generation
5
5 Monitoring the Grid is a Challenge Number of participating sites is growing every day: August 2003 => 12 sites ; October 2004 => 83 sites ; 8000 CPUs; 96 PB Disk Grid Operations Centre Monitor the operational status of sites; Fault detection Problem Management Identify problems; escalate; track;
6
6 Introduction Look at the existing monitoring tools that are being used in LCG Grid Operations Centre –GPPMON –GRIDICE –GSTAT –CERTIFICATION TESTING –REAL TIME GRID MONITOR –Job Accounting There has been a coordinated effort to develop, deploy and integrate a variety of monitoring tools from CERN, CCLRC (UK), GridPP, INFN-Grid (Italy) and Taiwan.
7
7 We have only fragmentary information about the services that sites are running. We don’t know what RBs/SEs/Sites the VOs are using for data challenges. We don’t know what the core services are and who is running them. We don’t have a toolkit to test specific core services. We have to concentrate on functional behaviour of services e.g. If an RB sends your job to a CE, then we must assume the RB is working fine. Is this the only test of a RB? Not all the tests that we perform are effective at finding problems. We must develop tests which simulate the life cycle of real applications in a Grid environment. …and lots more (see earlier talks) Monitoring Challenges
8
8 GRID Configuration Database GOCDB GridSite MySQL Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … ce se bdii rb Monitoring Services Operations Maps Configure other Tools Resource Provider Organisation Structures Secure services - Site News - Self Certification - Accounting Secure Database Management via HTTPS / X.509 Store a Subset of the Grid Information system People, Contact Information, Resources Maintenance Bit RC SQL https SERVERSERVER GOC DB can also contain information that is not present in the IS such as: Scheduled maintenance; News; Organisational Structures; Geographic coordinates for maps.
9
9 EGEE ROC Structure EGEE is made up of regions. Each region contains many computing centres. Regional Operational Centres are a focus for operational activities. USA
10
10 Developed a tool to manage organisational structures. Modelled on GridPP Tier1/2 Structure Materialised Path Encoding Provide ROCs with a package to monitor the resources in the region Tailored Monitoring Administrative roles to the coordinators in GOCDB Organisational Structures EGEE (1) France (1.1) UK/I (1.2) GridPP (1.2.1) LondonT2 IMPERIAL QMUL ScotGrid Edinburgh S.E.E (1.3)
11
11 Total List of all sites is derived from GOCDB (via RGMA) GOC bit: sites which have opted out e.g. scheduled maintenance White List: Sites that failed one or more core tests but are well supported are put back in e.g. a Tier1 site Core tests are a subset of the site functional tests run by CERN every day Black List: Sites that are not trusted 100’s of Sites Monitoring Services Total List of all sites Sites pass core tests Trusted Sites Black List White List BDII RGMA GOC Bit GOC DB Site info Gstat Data Site Functional Tests GOC Hourly Tests Generation of BDII configuration file via feedback into IS Adaptive Job Brokering Based on the Monitoring System Environments Production, VO, GridPP, …
12
12 How Are New Sites Added? SiteROC GOCDB Site and ROC liaise [1] EGEE 1.JSPG have written a “Site Registration Policy & Procedure” Document 2.https://edms.cern.ch/document/503198/https://edms.cern.ch/document/503198/ 3.New GOCDB portal to streamline the site registration process. [3] Site installs middleware [2] “candidate” site [4] “uncertified” Site [6] “certified” Site [5] Certification Testing
13
13 Replication Two replicas, each one has a different security considerations “Services” replica managed by Taipei –Direct connections to the database by the monitoring tools from known hosts “Users” replica to be setup at IN2P3 –Web portal based on X.509 certificates –CIC on duty
14
14 Monitoring Tools What are the main tools that are used in the day-to-day operations of the LCG Grid? –GPPMON –GSTAT –Site Functional Tests Other monitoring tools exist, but I won’t discuss them here –GridIce
15
15 Operations Map – Job Submission Tests GPPMON Displays the results of tests against sites. Test: Job Submission Job is a simple test of the grid middleware components e.g. Gatekeeper service, RB service, and the Information System via JDL requirements. This kind of test deals with the functional behaviour core grid services – do simple jobs run. They are lightweight tests which run hourly. However, they have certain limitations e.g. Dteam VO; WN reach (specialised monitoring queues).
16
16 Operations Map – Certificate Lifetime GPPMON Displays the results of tests against sites. Test:Certificate Lifetime Many grid services require a valid certificate for security. By probing the host certificates on CEs and SEs at sites with a simple SSL client service, we can identify certificates which are due to expire and send an early warning to them. A predictive tool!
17
17 GRIDICE – Architecture A different kind of monitoring tool – processes / low level metrics / grid metrics Developed by the INFN-GRID Team http://infnforge.cnaf.infn.it/gridicehttp://infnforge.cnaf.infn.it/gridice Data harvest via discovery service (postgreSQL) Measurement service Publication service
18
18 GRIDICE – Global View Display shows the processes belonging to the Broker service. Problems are flagged List of Sites Resource Usage CPU#, Load, Storage, Job Info Different Views of the data: Site / VO / Geographic
19
19 GridIce Job Monitoring Recently deployed version 1.6.3 on to LCG which features job monitoring: Queued, Running, Finished organised in different ways (site, Vo etc) XML views of data
20
20 GRIDICE – Expert View Display shows the processes belonging to the Broker service. Problems are flagged Node Processes
21
21 Ganglia Monitoring http://gridpp.ac.uk/ganglia Can use Ganglia to monitor a cluster Scalable distributed monitoring system for clusters and grids using RRD for storage and visualisation. RAL Tier-1 Centre LCG PBS Server displays Job status for each VO Get a lot for little effort
22
22 Federating Cluster Information Can also use Ganglia to monitor clusters of clusters Ganglia/R-GMA integration through Ranglia.
23
23 GIIS Monitor Developed by MinTsai (GOC Taipei) Tool to display and check information published by the site GIIS (sanity checks, fault detection) http://goc.grid.sinica.edu.tw/gstat/ Regional Plot: http://map.gridpp.ac.uk
24
24 Site Certification Service In terms of middleware, the installation and configuration of a site is quite a complicated procedure. –When there is a new release, sites don’t upgrade at the same time –Some upgrades don’t always go smoothly –Unexpected things happen (who turned of the power?) –Day-to-day problems; robustness of service under load? Its necessary to actively hunt for problems Site certification testing is by CERN deployment team on a daily basis. First step toward providing this service involves running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3 rd party copies from remote SE. Unlike the simple job submission tests implemented in GPPMON, these tests are more heavy weight and attempt simulate the life cycle of real applications.
25
25 Certification Test Results http://lcg-testzone-reports.web.cern.ch/lcg-testzone-reports/cgi-bin/listreports.cgi
26
26 Aggregator RSSReader (Windows Client) GOC generates RSS feeds which clients can pull using an RSS aggregator. How can we integrate feeds and ticketing systems? Syndication of Monitoring Information
27
27 Real Time Grid Monitor http://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html A Visualisation tool to track jobs currently running on the grid. Applet queries the logging and bookkeeping service to get information about grid jobs. Why are jobs failing? Why are jobs queued at sites while others are empty?
28
28 Problems with Existing Tools Lots of monitoring tools around which have things in common:- - all the information which they generate is hidden away or difficult to access - limited interfaces: the data can only be accessed in specific ways Therefore, its difficult to build “on-demand” services to allow communities “Players” to interact with the data. The idea is for the services to collect information and put it into a common repository such as an RGMA Archiver. In this way, the information can be shared and accessible to all. Services (EGEE parlance: ROC and CIC services) munch the data and present it to the community. How much CPU in UKI ROC –How much in GridPP? How much in each Tier2? => Integrate data from different sources to provide this information
29
29 Monitoring Paradigm A Better way to unify monitoring information. GOC Services collect information and publish into an archiver. ROC/CIC Services provide a means for the community to interact with this information on-demand. GOC provides services tailored to the requirements of the community. Information Repository (RGMA) Accounting Monitoring GSTAT Testing ROC Services Self Certification CIC Services Communities VOs ROCs EGEE Sites Organisations GOC Services
30
30 Use Cases Monitoring services which use RGMA as the backbone for data transport and data location via the registry service. –Grid Event Monitoring System –“Site Functional Test” Reporting Tool –Accounting
31
31 UseCases - GEMS Grid Event Monitoring System List of resources to monitor is provided by GOCDB Alert system that uses RGMA Looks for changes of state in the monitoring data tables Generates an alert and displays on the GEMS console. Notification features Event filtering
32
32 Reporting Tool Prototype Organisational Identities taken from GOCDB
33
33 Reporting Tool Prototype
34
34 Advantages No need for configuration of access points both for sensors and for reporting tools – data is located automatically by the registry No risk of data loss in case of failure of monitoring software – R-GMA is just for data transport, not for storage Flexibility – by using predicates, one can setup a number of archivers with different policies in different physical places, fault tolerance Usage of SQL and relational database model makes it all elegant
35
35 UseCases Job Accounting An accounting package for LCG has been developed by the GOC at RAL There are two main parts –the accounting data-gathering infrastructure based on R-GMA which brings the data to a central point –a web portal to allow on-demand reports for a variety of players.
36
36 Accounting Information collected at each site from batch logs, gatekeeper logs etc Information joined at site level to select grid jobs and stored in database on R-GMA MON box at site. Information published through R-GMA and collected centrally in an R-GMA archive at GOC Web site presents various views of this data for presentation Information schema based on GGF Usage Group Structure of Grid taken from GOC DB – the grid configuration database. Only normalised cpu time collected (at the moment)
37
37
38
38 Archiving Log Files Which Log files should we keep and why? System logs, GateKeeper logs and Batch logs. Batch logs because we want job efficiency as a function of time for each VO. Currently, we do not extract this information from the batch log.
39
39 GOC Accounting Services http://goc.grid-support.ac.uk/gridsite/accounting/index.html BaseCpuSeconds Aggregated across EGEE Each Site, per VO, per Month Simple interface to customise views of data: VO, time frame and Region (default = EGEE) Each Region, per VO, per Month On Demand Services to EGEE Community Other Distributions Normalised CPU # Jobs
40
40 Web form to apply selection criteria on the data Aggregate data across an organisation structure (Default= All ROCs) Select VOs (Default = All) Select date range
41
41 VO Index Summed CPU (Seconds) consumed by resources in selected Region Selected Date Range
42
42 List of Sites Belonging to the Selected ROC A breakdown of the resource usage per Site, per VO, per Month
43
43 Deployment Package was released to LCG in August 2004 and certified soon afterwards. There was no LCG release after that until LCG2_3_0 on 18th December 2004 Today there are still very few 2_3_0 sites. There are 28 sites producing accounting records today. The 2_3_0 release has some bugs which are fixed in a new release that is available on the accounting home page Recommend that sites upgrade accounting to version APEL 3.4.40 available on the accounting homepage http://goc.grid-support.ac.uk/gridsite/accounting/index.html
44
44 Accounting Issues 1.Package supports PBS only. 80 sites advertising 313 Job managers: - 300 PBS (91% of sites) - 3 CONDOR (KFKI, FNAL, TRIUMF) -7 LSF (GSI, LNL, CERN) and 15 sites in Italy. 2.Accounting requires the R-GMA infrastructure to be deployed at the site. 3.The VO associated with a user’s DN is not available in the batch or gatekeeper logs. It will be assumed that the group ID used to execute user jobs, which is available, is the same as the VO name. 4.The global jobID assigned by the Resource Broker is not available in the batch or gatekeeper logs. This global jobID cannot therefore appear in the accounting reports. The RB Events Database contains this, but that is not accessible nor is it designed to be easily processed. [Andrea Guarise: JRA1 proposal]
45
45 Accounting Issues 6.Most sites keep GK/Batch logs but throw away message log files after 9 weeks due to default log rotation. 7.At present the logs provide no means of distinguishing sub-clusters of a CE which have nodes of differing processing power. Changes to the information logged by the batch system will be required before such heterogeneous sites can be accounted properly. At present it is assumed that all PBS sites are homogeneous. 8.Normalisation.
46
46 Future Plans Support for the LSF batch system. Understand Normalisation issues; do we have faith in the numbers we present? Extend accounting schema to include information about the worker node, Job efficiency and globalJobID. Integrate the LCG schema with de-facto grid accounting standards, namely GGF –Share data with other Grid Communities NorduGrid, Grid03
47
47 Summary GOCDB to take a more important role in operation environment A shift in the monitoring paradigm which relies on sharing data through RGMA Accounting Information gathering infrastructure and reporting web site Development towards on-demand services to provide the community with up-to-date information, aggregated at different levels. Development of Visualisation tools to enhance our understanding of the grid. Adaptive Job brokering based on the monitoring system
48
48 Summary Since August 2003, the LCG GOC has been working to understand the problems of running a large scale distributed grid. Setup a distributed GOC and deployed tools to help understand the issues. Development towards on-demand services to provide the community with up-to-date information, aggregated at different levels. Development of Visualisation tools to enhance our understanding of the grid.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.