Download presentation
Presentation is loading. Please wait.
Published byDeirdre Lambert Modified over 9 years ago
1
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production
2
Pavel Nevski Job Definition Role n It is largely seen as a data provider for the PRODSYS –Interface for ATLAS physicists to the Distributed Production –Job provider for Distributed Production System (ProdSys) n On the other hand, it is the first line of data consumption –Jobs have to be defined dynamically to optimize the production throughput and avoid waist of resources –Users need to know the request progress in a reliable way to actively participate in production validation n Hence, it has to be fully integrated with ATLAS DDM –… Point of concern for Software Integration group
3
Pavel Nevski System Functionality n A lot of control – input parameters values, input availability, software availability etc. n Automated task status update n Automated task progress monitoring n Automated mail notification n Dynamic job definition based on jobs status in Production database jobs status in Production database n Integration with DDM to access inputs
4
Pavel Nevski Job Definition tools n Task Request Interface n Request control tools n Job submission tools n Statistic collection n Mailer n Input DDM interface
5
Pavel Nevski Task Request Page (AKTR) n Part of ATLAS wide Apache server provided with Panda monitor, seen by users is a single comprehensive web interface to the information services: http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?mode=listtask http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?mode=listtask n Back end is a python server interfaced to MySQL DBs at BNL -- Panda DB, logging DB, DB with task, dataset, (site) info -- and to DQ2 – monitor and database runs at BNL – development version at CERN
6
Pavel Nevski
7
Task Request Features n Flexible Control Scripts are integrated with User Request tools –Check naming, number of events, component existence etc … at the moment when user is typing request so that he can immediately correct mistakes –Checking parameters against transformations n The job of the production manager is limited by a “one click” request acceptance/rejection n The rest (request->task->jobs->DDM) goes automatically
8
Pavel Nevski Input Parameter Control n Most of the conflicts are detected at task request time and immediately corrected (mostly numeric information) n Some conflicts which could be potentially corrected with the same release are detected at job submission time n Some additional checking implemented on request – like same GRID flavor for simulations and reconstruction tasks.
9
Pavel Nevski Task status flow n Requested (used to create datasets) –Pending - a grace period for submitter and manager –Testing - often needed for new type of jobs n Active state: –Submitting - inputs are not all finished –Submitted - fully available for production –Running - some outputs are available n Final state: –Done - 100% finished sucessfully –Finished - finished, but less than 100% success –Failed (?) - finished with no successful jobs –Aborted
10
Pavel Nevski Task progress control n Task execution status is controled daily n Automatic mail notification is generated in case jobs are stuck for usual reasons (prepared, pending, maxattempt reached etc) n After about one month (depending on task type, grid etc) some automatic actions are forced – increasing maxattempt, aborting autoaborted jobs etc. n After two months some human interaction is initiated to achive progress, n After three months production manager aborts remaning jobs (do not complain!) As the result, the oldest running task now is the one submitted on June 22
11
Pavel Nevski Managing Parameter Input n When a group of tasks with similar conditions is required, we introduce a project which is associated with a set of default parameters, i.e.: –Specific geometry version (Ideal, Misaligned, etc) –Fast shower parametrization vs full G4 –Different digitization (low noise, high noise etc) n When the project is selected, default values are automatically set for all associated parameters. n When input comes from a different project, project names get concatinated n Example: Ideal_06_MisAl_07_csc11 – events generated with release 11 passed through simulations with misaligned geometry and reconstructed with
12
Pavel Nevski Input Control Integrated with DDM n If input data were produced on the same grid flavor, jobs are released in TOBEDONE status jobs are released in TOBEDONE status –still, simulations jobs are in WAITING status to allow evgen file replication n If input data were produced on a differenr grid flavor, jobs are defined in WAITINGCOPY status jobs are defined in WAITINGCOPY status n If user input (events) is required, jobs are defined as WAITINGINPUT until input is fully available n Input data needed are first collected at CERN (CAF) n Subscription in DQ2 is done and monitored until input become available for jobs ( …often waiting forever… )
13
Pavel Nevski Current Status n Currently used by –central production –CTB simulation production –“private” production (Saclay, BNL) n More the 3000 task requests served n Fast turn around n Running as daemon on a SWING machine n Works both with old and new (Python) trfs with automatic parameter extraction n Job definition is being documented on ATLAS TWiki by Junichi n Production often blocked as soon as the data movement is involved
14
Pavel NevskiPlans n Fully debug the DDM interface and move it to the common cron –Better synchronization with Dataset opening/closing –Typical error correction on input datasets delivery –Control jobs final state with DDM n Maintain configuration parameters in the database the database
15
Pavel Nevski BACK-UP MATERIAL
16
Pavel Nevski Components and their relations Task table OutputsJobDef entry DDM Dataset JobExec Production Manager Request Table metadata USER Production Operation Web interface (AKTR)
17
Pavel Nevski How It Works User submits a Request –Automatic parameter control is performed(!) n Production manager accepts or rejects it –Manager can affect priority and/or grid assignment, but not the request parameters –A mail to submitter is generated n Production scripts (daemon) automatically convert approved requests into Prod DB Tasks and Jobs and register new target datasets with DDM.
18
Pavel Nevski
19
Control Flow Chart n Tasks are going through a chain of states: –After request it is Pending for submission or Testing if debugging is needed or Testing if debugging is needed –After approval it is Submitting or Submitted –When Submitted and jobs start to succeed it goes to Running –When all jobs are terminated (DONE or ABORTED) task is Done or Finished (or Failed) n Current status and statistics is available on Task Page
20
Pavel NevskiConclusion n Job definition for ATLAS distributed production is in a good shape n More integration with ATLAS DDM is foreseen in the nearest future n We are looking forward for challenging ramp up of the ATLAS distributed production
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.