Download presentation
Presentation is loading. Please wait.
1
Vandy Berten Luc Goossens Alvin Tan
(CERN/EP/ATC) Alvin Tan (University of Birmingham) 07/11/2018
2
Pre-history of AtCom project
test productions in Jan 2002 at CERN (LSF) enjoyed nearly 100% success rate no need for tools to clean-up/resubmit odd failed job real production in summer 2002 suffered on average of 20% failures many factors were against us at 300 CPU slots capacity, cleanup became overwhelming Sep 2002 Vandy Berten (technical student) needed a well-defined project AtCom (Atlas Commander) project was born 07/11/2018
3
What is the Atlas Commander?
graphical interactive tool to support production manager define jobs in large quantities submit and monitor progress scan log files for (un)known errors update bookkeeping Databases (AMI, Magda) clean up in case of failures 07/11/2018
4
History of AtCom project
total resource count is about 5 man-months ideal situation: two persons in same office, one CS student as developer/designer plus one software engineer as client/designer/developer successful multi-cluster live demo at the Atlas SW workshop in Nov has been in continuous production use at CERN since Oct 2002 v1.0 released end of Jan -> production quality v1.2 released early March removed last important limitation Last important limitation is the remote capability 07/11/2018
5
AtCom has its own web site
atlas-project-atcom/ contains user guide, developer’s guide, documentation, downloads, relevant contact s, etc. 07/11/2018
6
Architecture: application + plug-ins
LSFComputingSystem EDGComputingSystem NGComputingSystem PBSComputingSystem Plug-ins ... Clusters AMIMgt MagdaMgt Bookkeeping DBs Magda AMI AtCom core 07/11/2018
7
Architecture (continued)
plug-in implements abstract ‘cluster’ interface for specific clusters e.g. LSF a plug-in is a Java class + configuration parameters e.g. the AtCom configuration file defines all existing plug-ins and allows each to have its own configuration section they are loaded at run-time 07/11/2018
8
Available plug-ins LSF NG PBS EDG BQS well understood and supported
development suspended after last SW PBS developed by Alvin Tan EDG working, but no EDG based clusters used in production BQS developed by Jerome Fulachier 07/11/2018
9
Bookkeeping databases
5 logical database domains, two physical databases AMI (Atlas Meta-data Interface) - mySQL DB hosted at Grenoble Magda (Manager for grid-based data) - mySQL DB hosted at BNL physics meta-data recipe catalog permanent production log transient production log replica catalog 07/11/2018
10
Concepts: datasets abstract transformation process abstract dataset
evgen evgen.2000 simul simul.2000 simul.2099 pileup lumi 07/11/2018
11
Concepts: partitions concrete tranformation process = job
concrete partition = file evgen546 evgen simul876 evgen546 evgen simul876 evgen546 evgen simul876 simul simul simul simul pileup760 lumi simul pileup760 lumi simul pileup760 lumi 07/11/2018
12
Two main functions of AtCom
definition of jobs job submission/monitoring 07/11/2018
13
Select a dataset using the SQL query composer
Choose to view relevant fields. Preview SQL query Definition of jobs select a dataset with SQL query composer dataset determines transformation 07/11/2018
14
07/11/2018
15
07/11/2018 select a version of its transformation
version determines uses, signature, outputs, … 07/11/2018
16
Transformation-type aware Parameter Load/Save feature.
User can define a variable range and various constants for use in expressions. Formulate expressions for partition and job-specific parameters. AtCom allows you to define a counter range and then use the counter value in expressions for the parameter values for each partition you want to create define some constant attributes the logical values for the parameters of the transformation the LocalFileName -> LogicalFileName mapping of all outputs the destination of stdout/stderr 07/11/2018
17
07/11/2018
18
07/11/2018
19
Preview SQL query Construct SQL query: Specify dataset constraints
Specify partition constraints Preview SQL query submission select a number of defined partitions using SQL query composer select a target cluster jobs are submitted for most clusters this means a number of auxiliary files are created (wrappers, jdl/xrsl files, …) 07/11/2018
20
07/11/2018
21
What happens on submission? (LSF@CERN)
when partition is unreserved in DB reserve it and create part_run_info record transformation definition path resolved just prepend an AFS path a wrapper is created insert commands to set up correct environment logical to physical value resolution insert line calling “core” script insert commands to copy outputs to final destination insert commands to set up correct environment resolve logical actual values into physical actual values LFN parameter -> prepend Castor path according to fixed algorithm LFNlist parameter -> expand LFNlist syntax into list of LFNs for each LFN prepend Castor path according insert into wrapper commands to create aux file containing these Castor PFNs 07/11/2018
22
What happens when you submit? (continued)
the job is submitted to LSF using the right queue (conf file) using –o and –e to specify temp locations for stdout and stderr in same dir the wrapper code is saved with a unique name in a dir with a unique name (in the dir specified in the AtCom.conf file) the jobID returned by LSF is recorded in the part_run_info record together with the temp locations for stdout and stderr. a wrapper is created (continued) insert line calling “core” script use PFN/name of aux file instead of LFN/LFNlist insert commands to copy outputs to final destination convert LFN into castor PFN using fixed algorithm rfcp local file to that PFN could insert code here to check rfcp/ retry the job is submitted to LSF using the right queue (conf file) using –o and –e to specify temp locations for stdout and stderr in same dir the wrapper code is saved with a unique name in a dir with a unique name (in the dir specified in the AtCom conf file) the jobID returned by LSF is recorded in the part_run_info record together with the temp locations for stdout and stderr. 07/11/2018
23
Monitoring jobs you submit are automatically added to list of monitored jobs running jobs can be recovered from the part_run_info table if needed e.g. after having closed AtCom any other partition can be added to the list as well using SQL query composer allows you to “see” also finished, defined jobs for the bar charts of course 07/11/2018
24
When a job moves from RUNNING to DONE post processing commences
resolve validation script logical name into physical name and apply it to stdout/stderr in temp locations returns 1=OK, 2=Undecided or 3=Failed if OK register output files with Magda replica catalog resolve extract script and apply it to stdout writes to stdout a set of attribute value pairs AtCom will attempt an UPDATE query with this on the partition table copy/move logfiles to final destination set status of partition to Validated 07/11/2018
25
if Failed delete output files if Undecided mark job as such
production manager can look at output of validation script or at the logfiles themselves and then force a decision as OK or Failed 07/11/2018
26
07/11/2018
27
07/11/2018
28
Questions ? 07/11/2018
29
Thank you! 07/11/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.