Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dave Kant LCG Accounting Overview GDA 7 th June 2004.

Similar presentations


Presentation on theme: "Dave Kant LCG Accounting Overview GDA 7 th June 2004."— Presentation transcript:

1 Dave Kant D.Kant@RL.AC.UK LCG Accounting Overview GDA 7 th June 2004

2 Dave Kant D.Kant@RL.AC.UK LCG Accounting Overview Overview of the accounting program Offline Analysis of Core Site Data Accounting Issues

3 Dave Kant D.Kant@RL.AC.UK Motivation Identify and distinguish jobs that are submitted through the grid from jobs which were submitted by non-grid users. Collect job statistics and aggregate job information across sites, across virtual communities on a daily basis Provide feedback into SLA and MoU Jason Leak Rob Byrom Martin Craig Trevor Daniels John Gordon Dave Kant

4 Dave Kant D.Kant@RL.AC.UK Basic Concepts Program is run on several different nodes within a site to process log files necessary to generate accounting information on a daily basis. It uses RGMA to publish information into a MySQL database on the R-GMA MON node at the site. The accounting program uses the RGMA framework and the site is required to have a properly configured RGMA installation. The MON node processes this data to create accounting records. Accounting records are streamed to the GOC R-GMA MON node

5 Dave Kant D.Kant@RL.AC.UK Event Log Processing DATA SOURCE PBS EVENT LOGS SQL PbsRecords Table LcgProcessed Table PBS filter to extract data from the event log records. RGMA-API publishes data to a PbsRecords database table on the MON box and records the names of the processed logs for book- keeping CE MON /var/spool/pbs/server_priv/accounting 20040203 20040204 20040205 EVERY DAY dbProducer

6 Dave Kant D.Kant@RL.AC.UK PBS Event Record Format A typical “END” EventRecord from PBS : 20040330 03/30/2004 00:00:47;E;21891.lcgce02.gridpp.rl.ac.uk;user=dteam001 group=dteam jobname=STDIN queue=short ctime=1080601234 qtime=1080601235 etime=1080601235 start=1080601242 exec_host=lcg0317.gridpp.rl.ac.uk/1 Resource_List.cput=00:15:00 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=02:00:00 session=9842 end=1080601247 Exit_status=0 resources_used.cput=00:00:03 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:11 To identify these records, the program looks for are the date and time format data at the start of the records and the E indicating that this is an end or record entry. JobName = 21891.lcgce02.gridpp.rl.ac.uk LocalUserID = dteam001 LocalUserGroup = dteam WallDuration = 02:00:00 WallDurationSeconds = 120 CpuDuration = 00:00:03 CpuDurationSeconds = 3 StartTimeEpoch = 1080601242 StartTimeUTC StopTimeEpoch = 1080601247 StopTimeUTC SubmitHost = lcgce02.gridpp.rl.ac.uk MemoryReal = 0 MmeoryVirtual = 0

7 Dave Kant D.Kant@RL.AC.UK PbsRecords Table Schema R-GMA publishes data to a PbsRecords database table on the MON box and records the names of the processed logs for book-keeping. +--------------------------+----------------+ | Field | Type | +--------------------------+----------------+ | RecordIdentityP | varchar(255)| | SiteName | varchar(50) | | JobName | varchar(100) | | LocalUserID | varchar(20) | | LocalUserGroup | varchar(20) | | WallDuration | varchar(30) | | CpuDuration | varchar(30) | | WallDurationSeconds | int(11) | | CpuDurationSeconds | int(11) | | StartTime | varchar(30) | | StopTime | varchar(30) | | SubmitHost | varchar(50) | SQL PbsRecords Table MON

8 Dave Kant D.Kant@RL.AC.UK GateKeeper & Message Log Processing Extract data from globus-gatekeeper and system messages logs DATA SOURCE GLOBUS GATEKEEPER LOGS GateKeeper SQL GKRecords Table LcgProcessed Table JobNames MON /var/log: globus-gatekeeper.log.20040201040203.gz messages.2.gz messages.3.gz DATA SOURCE System Messages LOGS EVERY DAY

9 Dave Kant D.Kant@RL.AC.UK GateKeeper Log File Format JMA 2004/03/29 23:59:49 GATEKEEPER_JM_ID 2004-03-29.23:59:32.0000017193.0000031464 for /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=dave kant on 130.246.183.189 JMA 2004/03/29 23:59:49 GATEKEEPER_JM_ID 2004-03-29.23:59:32.0000017193.0000031464 has GRAM_SCRIPT_JOB_ID 1080601189:lcgpbs:internal_1591872368:2192.1080601179 manager type lcgpbs This tells us that the job was submitted through the grid and that the jobmanager was lcgpbs. GramScriptJobID = 1080601189:lcgpbs:internal_1591872368:2192.1080601179 LocalJobID = 2004-03-29.23:59:32.0000017193.0000031464 GlobalUserName = /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=dave kant MeasurementDate = 2004-03-29 MeasurementTime = 23:59:32 Since GOC may processes logs independently from the sites and store this data in the same tables, we add some additional information to the database in the form of a Unique Record Identifier which is derived from the GramScriptJobID, the MeasurementDate and MeasurementTime. SubmitHost => From Config File SiteName => From Config File JMA record pairs that involve a fork are ignored. The program searches through the gatekeeper logs looking for the JMA pairs of records:-

10 Dave Kant D.Kant@RL.AC.UK GkRecords Table Schema +----------------------+-----------------+ | Field | Type +----------------------+-----------------+ | RecordIdentityG | varchar(255) | | GramScriptJobID | varchar(100) | | LocalJobID | varchar(50) | | GlobalUserName | varchar(255) | | SubmitHost | varchar(50) | | SiteName | varchar(50) | | +----------------------+-----------------+ SQL GKRecords Table MON

11 Dave Kant D.Kant@RL.AC.UK Message Log Processing Message Log files contain gridinfo records which map lcgpbs GramScriptJobID records to PBS Event Log records. Such a mapping is not necessary in vanilla PBS as these records are identical. Gatekeeper log PBS Event log PBSJobNameID 21891.lcgce02.gridpp.rl.ac.uk Messages log GramScriptJobID 1080601189:lcgpbs:internal_1591872368:2192.1080601179 gridinfo records match GK “JMA” records to PBS “E” records

12 Dave Kant D.Kant@RL.AC.UK Message Log File Format Mar 30 00:02:25 lcgce02 gridinfo: [3308-3308] Job 1080601189:lcgpbs:internal_1591872368:2192.1080601179 (ID 21891.lcgce02.gridpp.rl.ac.uk) has finished GramScriptJobID = 1080601189:lcgpbs:internal_1591872368:2192.1080601179 JobName = 21891.lcgce02.gridpp.rl.ac.uk MeasurementDate = 30 Mar 2004 MeasurementTime = 00:02:25 +----------------------+-----------------+ | Field | Type | +----------------------+-----------------+ | GramScriptJobID | varchar(100) | | LocalJobID | varchar(50) | | MeasurementDate | varchar(255) | | MeasurementTime | | --------------------------------------------------- SQL JobNameRecords Table MON

13 Dave Kant D.Kant@RL.AC.UK CPU Performance DATA SOURCE LDAP GIIS Server GIIS filter to collect CPU performance benchmarks for the worker nodes from the subclusters attached to the CE. RGMA-API publishes data to SpecRecords database table on the MON box CE SQL SpecRecords Table MON EVERY DAY

14 Dave Kant D.Kant@RL.AC.UK SpecRecords Schema +-------------------+---------------+ | Field | Type | +-------------------+---------------+ | RecordIdentity | varchar(255) | | SiteName | varchar(50) | | ClusterID | varchar(50) | | SubClusterID | varchar(50) | | SpecInt2000 | int(11) | | SpecFloat2000 | int(11) | SQL SpecRecords Table MON CPU Performance benchmarks for the worker nodes in the subclusters attached to the CE

15 Dave Kant D.Kant@RL.AC.UK Joining Records Together 4-Way Join matches records and writes them to the LcgRecords Table. These records are unique Site now has a copy of its own accounting data. SQL GKRecords PbsRecords JobNames SpecRecords LcgRecords MON EVERY DAY

16 Dave Kant D.Kant@RL.AC.UK GOC LcgRecords MON Site 1 LcgRecords MON Site n Site LcgRecords 1. n MON GOC GOC runs a special program on its MON node. This program listens for data streamed from the LcgRecords table by R-GMA. In this way, the GOC collects accounting data aggregated across all LCG sites. EVERY DAY streamProducer

17 Dave Kant D.Kant@RL.AC.UK Stand-Alone Test Results Stand-alone means that GOC has processed log data which has been sent from the sites. Data received from 7 sites covering different periods of time. SiteCEJobManagerStartEnd CAMfarm012lcgpbs30/03/0415/05/04 CERNlxn1181lcgpbs16/02/0415/04/04 CNAFwn-04-10-14-alcgpbs09/02/0431/03/04 FZKpbs-server2pbspro11/03/0431/03/04 NIKHEFtbn18lcgpbs11/02/0415/04/04 RALlcgce02pbs/lcgpbs20/01/0415/04/04 Taipeilcg00125lcgpbs02/02/0424/05/04

18 Dave Kant D.Kant@RL.AC.UK Stand-Alone Test Results CPU usage per VO per site: Note that Alice jobs dominate by more than an order of magnitude.

19 Dave Kant D.Kant@RL.AC.UK Stand-Alone Test Results CPU usage per VO, aggregated over sites

20 Dave Kant D.Kant@RL.AC.UK Accounting Issues 1.Support for vanilla pbs, lcgpbs and pbspro only. IN2P3 is supporting bqs. Extending support to LSF and other batch systems will depend on the amount of effort required. To be investigated. 2.The program has been tested in stand-alone mode using log files sent to the GOC by site administrators. It will begin production-mode testing this week 3.At present the logs provide no means of distinguishing sub-clusters of a CE which have nodes of differing processing power. 4.VO synonyms: FZK prefer “d0” wheras other sites prefer “dzero”. Does LCG impose a fixed-name schema for VOs? 5.The VO associated with a user’s DN is not available in the batch or gatekeeper logs. It will be assumed that the group ID used to execute user jobs, which is available, is the same as the VO name. This needs to be acknowledged as an LCG requirement. REFER TO NEXT SLIDE FOR EXAMPLE

21 Dave Kant D.Kant@RL.AC.UK Specific Issues CNAF 1.25% of all accounting records built have an un-recognised group in the PBS event END record. There is no way to trace this to the user without access to the log file. 03/31/2004 15:49:02;S;40372.wn-04-10-14- a.cr.cnaf.infn.it;user=dteam003 group=2688 jobname=STDIN queue=lcg ctime=1080720292 qtime=1080720 2.PBS log files show that group 2688 appeared on 16March 2004 3.Prior to this a named “dteam” group was defined 03/15/2004 21:07:51;E;15099.wn-04-10-14- a.cr.cnaf.infn.it;user=dteam004 group=dteam jobname=STDIN queue=lcg ctime=1079381237 qtime=107938123 CNAF TO ASSOCIATE “DTEAM” TO GROUPID 2688


Download ppt "Dave Kant LCG Accounting Overview GDA 7 th June 2004."

Similar presentations


Ads by Google