1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya Carrillo
2 Outline Manifesto Monitoring Web interface Internals Accounting Web interface Internals Outlook URLs
3 Manifesto Monitoring and Accounting are tasks in DIRAC 377 DIRAC is a Production grid for LHCb The Monitoring reports the status of jobs while in the WMS (Workload Management System) 366 Instantaneous snapshot of the system No historic records The Accounting records the status of jobs after leaving the WMS Provides historic record, accumulated statistics and evolution of recorded variables with time Main users: production and site managers
4 Design choices Monitoring Job information stored centrally in the WMS Info Provided directly by the job and the WMS push Passive services: no push ing of information No need for a common consumer API Job and Application state stored together Accounting Separate infrastructure from the monitoring Jobs can never be on the Accounting and the Monitoring Domain specific: LHCb production jobs
5 Information Flow WMS Web interface Job Database Accounting Database Cleaner Agent Accounting WriteRead Monitoring ReadWrite Job Users Backend Services & Agents Job Heart-beat DIRAC
6 Monitoring Web Interface 1 Interface to query monitoring service JobId popup a window with job details if clicked
7 Monitoring Web Interface 2 The overview shows predefined plots on the production Generated every few minutes PyChart PyChart used as graphics engine 100% python Supports SVG Running jobs by site
8 Monitoring Web Interface 3 Job status by site and production id
9 Monitoring Internals It consists of a XML-RPC service exposing whatever parameters are known to DIRAC Job parameters stored internally by DIRAC Primary parameters Execution site, job status, job owner etc. Fixed, centrally defined: fast access Can query on them Secondary parameters Number of steps, internal job state, etc Defined by the production job itself Stored as key-value pairs Slower access. Cannot query on them
10 JMS basic API example from xmlrpclib import ServerProxy server = ServerProxy(monitoring_url) #Retrieve list of jobs verifying some conditions conditions = {'Status': 'running', 'Site': 'DIRAC.CERN.ch' } jobreq = server.getJobs(conditions) #Print some parameters for each job if jobreq['Status']: for jobid in jobreq['Value']: print server.getJobSite(jobid) print server.getJobParameter(jobid, 'LocalBatchId') #Bulk operations sum = server.getJobsPrimarySummary(jobreq['Value']) ~3 s to select 95 out of 50k jobs ~0.7 s ~40 s
11 Accounting Web Interface 1 GUI for querying the Accounting Shows results As graphics As table As Excel sheet Several types of report Only a few shown here
12 Accounting Web Interface 2 Used resources by site
13 Accounting Web Interface 3 Used resources by event type Mb/job CPU/job Failed jobs CPU vs. Exec time Input and Output data vs. CPU
14 Accounting Web Interface 4 Produced data by production ID Rates Cumulative Number of events Gb of output
15 Accounting Web Interface 5 WMS statistics on DIRAC's performance Plots Job execution time vs. WMS waiting time Job execution time vs. WMS matching time Granularity Per site Per production Integral Allows assessment of DIRAC's performance
16 Accounting Internals Job and DIRAC statistics kept in a database Site contribution Data produced and used by jobs and steps Timing for jobs, steps and DIRAC internals Separate XML-RPC interfaces to populate and query the accounting tables Both interfaces have restricted access Jobs are moved to the accounting system by a cleaner agent after being validated
17 Accounting Usage About 10 hits per day Time to generate daily static reports: 8 min 60-70% of the time querying the database 30-40% of the time in the drawing package Server load<0.2 Total: 169 kjobs
18 Outlook Monitoring page Transactions in monitoring updates Further optimisation (bulk operations...) Search for a faster rendering package Make the web page dynamic: Less reloads Accounting New report types Normalized CPU Contribution by country Rate by site, country etc...
19 URLs Monitoring page Mirror on: Direct link to overview pages Accounting page