DIRAC Production Manager Tools Presenter: Gennady Kuznetsov for the LHCb collaboration CHEP 2006, 13-18 February, Mumbai
Outline Introduction to DIRAC DIRAC Authoring Tools Production Repository and Production Agent Automatic Data Processing Conclusion skip this slide Presenter Name: Gennady Kuznetsov
DIRAC – Distributed Infrastructure with Remote Agent Control DIRAC Introduction DIRAC – Distributed Infrastructure with Remote Agent Control LHCb Workload and Data Management system for the distributed data production and physics analysis. (Monte-Carlo simulation, reconstruction…) Composed of a set of light-weight services and a network of distributed agents to deliver workload to computing resources. Integrates computing resources available at LHCb production sites as well as on the LCG grid. It was demonstrated to support 5000 simultaneous jobs across 65 sites with no observed limitation. Currently handles over 70 TB of data. (See talk by Dr. TSAREGORODTSEV, Andrei [301]-DIRAC, the LHCb Data Production and Distributed Analysis system) DIRAC stands for the Distributed Infrastructure with Remote Agent Control It is the LHCb Workload and Data Management system for the distributed data production and physics analysis. Its composed of a set of light-weight services and a network of distributed agents to deliver workload to computing resources. It also integrates computing resources available at LHCb production sites as well as on the LCG grid. It was demonstrated to support 5000 simultaneous jobs across 65 sites with no observed limitation. It currently handles over 70 TB of data. Presenter Name: Gennady Kuznetsov
Production Manager The Production Manager plays a major role in executing organised data production activities within LHCb on a very large scale. His main tasks are: Designing data processing sequences Creating many thousands of data processing jobs Submitting and monitoring jobs Verifying output files Replicating data files in different centres All of these activities can be efficiently fulfilled only by using automated tools to assist the Production Manager in handling many thousands of distributed jobs and files. The Production Manager plays a major role in executing organised data production activities within LHCb on a very large scale. His main tasks includes Designing data processing sequences Creating many thousands of data processing jobs Submitting and monitoring jobs Verifying output files Replicating data files in different centres All of these activities can be efficiently fulfilled only by using automated tools to assist the Production Manager in handling many thousands of distributed jobs and files. Presenter Name: Gennady Kuznetsov
Jobs and Workflow Gauss 1 Gauss 3 Gauss 2 Boole Brunel The Job performs a certain activity that can be described by a complex succession of operations (e.g. running applications). Run Monte Carlo Simulation and write a file xxxxx_1.sim and create spill over events in xxxxx_2.sim and xxxxx_3.sim Run Boole digitization program, read detector simulation and spillover files and reconstruct detector response output in xxxxxx_4.digi Run Brunel reconstruction program, read digitization data and reconstruct xxxxxx_5.dst Each operation takes input data, produces output data and defined by parameters. The Workflow in this case is an abstract description of the sequence of operations to be performed. Simulation workflow Workflow The Job which performs a certain activity can be described by a complex succession of operations For example: run Monte Carlo Simulation then run Boole digitization program and run Brunel reconstruction program Each operation takes input data, produces output data and can be defined by parameters. The Workflow in this case is an abstract description of the sequence of operations to be performed. For example this is a simulation workflow which consists of 5 operations In this case the Workflow Template is derived by specifying the internal parameters of Workflow, like the version of the package or event type Gauss 1 Gauss 3 Gauss 2 Boole Brunel The Workflow Template is derived by specifying the internal parameters of Workflow Gauss v15r5, LHCb v15r4, XMLDDDB v22r0, Boole v5r3, Brunel v23r5 (i.e. specifying version) Specifying event type = 61, Number of events = 500 Presenter Name: Gennady Kuznetsov
Step Gauss3 Gauss1 Gauss2 Boole Brunel Example of Simulation Workflow Gauss 1 Gauss 3 Gauss 2 Boole Brunel Each Workflow can be decomposed into a series of Steps connected to each other via input-output files. Each Workflow can be decomposed into a series of Steps connected to each other via input-output files. A Step in this case is the smallest entity that could run independently as a job (if persistent storage is used for the input-output files), although all steps can be combined into single job. A Step is the smallest entity that could run independently as a job (if persistent storage is used for the input-output files). Step Presenter Name: Gennady Kuznetsov
Module Example of Step “Brunel” (reconstruction program) Install package Brunel logfiles check Execute Brunel Bookkeeping Module1 Module4 Module3 Module2 Step usually consists of a series of smaller unbreakable operations - not necessarily producing output nor requiring input files, but may produce, exchange or share some data. Module To brand these operations as reusable components, we introduce the smallest entity called a Module. Step usually consists of a series of smaller unbreakable operations - not necessarily producing output nor requiring input files, but may produce, exchange or share some data. To brand these operations as reusable components, we introduce the smallest entity called a Module. Instances of the Modules can be configured or linked with each other by defining their parameters. Instances of the Modules can be configured or linked with each other by defining their parameters. Presenter Name: Gennady Kuznetsov
Authoring Tools XML file XML file Step Module Editor Step Editor Workflow Module Module Editor Step Editor Workflow Editor In order to create and manipulate mentioned above objects, we designed set of tools which allows the Production Manager to do a complex editing of Modules, Steps and Workflows. Each object can be stored as an XML file in the file system in the Workflow Library. Since those tools are general and completely content independent the only LHCb specific part is a Workflow Library. XML file XML file Workflow Library Presenter Name: Gennady Kuznetsov
Production Repository & Agent The Production Repository is a hierarchical storage designed to provide an area for the Jobs handling. It is populated by a Production Manager Console and its content is used by a DIRAC Production Agent Jobs Productions Production Templates Repository Submits Jobs; Checks Job Status; Downloads output sandbox; Workflow Library Production Manager Since our task is to generate Jobs from the Workflow we need a place for this kind of operations. The Production Repository is a hierarchical storage designed to provide an area for the Jobs handling. It is populated by a Production Manager Console and its content is used by a DIRAC Production Agent Production Manager Console Production Repository MCProduction template MCProduction1 MCProduction2 Job 1 …… Job n MCProduction3 Reconstruction template Reconstruction1 Production Agent Command line tools Dirac WMS Presenter Name: Gennady Kuznetsov
Submitting production jobs Creates and publishes Production Production Console Jobs Productions Production Templates Repository Template + Parameters Production Template Workflow + Parameters Production Manager Production Job Job Production Agent The sequence of operations is presented on this slide as an animation. The Production Manager creates a Workflow and adds some parameters (click) The Workflow published in the Production Repository as Production Template (click) The Console reads Production Template adds parameters (click) and publishes it as Production in the Repository (click) Then Production inside repository (click) generates requested number of Jobs (click) The Production Agent takes Jobs and submits it into the Dirac Workload Management System where the Job got executed DIRAC Computing Resource DIRAC CE DIRAC WMS Presenter Name: Gennady Kuznetsov
Automatic Data Processing Data processing is automated through a Processing Database and a series of Transformation Agents. Transformation File Records Transformations Processing DB Transformations ID Request ID Condition Files per Job Ref 00012 00001028 '(00000742)_[0-9]+_[0-9]+.dst' 80 Transformation Table File ID LFN 0000123 File ID File Status 0000123 Files File ID Replica The appropriate Transformation Agents are triggered by the presence of a sufficient number of file entries in the Processing Database. The Transformation Agent creates the required number of jobs in the Production Repository which are then picked up by the Production Agent. In order to create automatic data processing we had to introduce a Processing Database and Transformation Agent The Processing Database in fact a combination of File catalogue, Replica catalogue, Transformation tables and the Transformation Definition The Transformation Definition Table defines all Transformation Agents. The appropriate Transformation Agents are triggered by the presence of a sufficient number of file entries in the Processing Database. The Transformation Agent creates the required number of Jobs in the Production Repository which are then picked up by the Production Agent. File Records Transformations Processing DB Transformation Agent 1 Job Data Filter Agent 2 Presenter Name: Gennady Kuznetsov
Automatic job submission Creates and publishes Transformation Production Console Jobs Productions Production Templates Repository Production Manager Associates with the Production Production Transformation File Records Transformations Processing DB Transformation Agent Production Agent The sequence of events for the automatic Job submission presented on this slide as an animation. The Production Manager creates the Transformation and (click) associates it with the specific production (click) (wait until Tr. disappears) (click) The Transformation will instantiate Transformation Agent (click) DIRAC Computing Resource Computing Resource DIRAC WMS Job Presenter Name: Gennady Kuznetsov
Automatic job submission Production Console File entry Jobs Productions Production Templates Repository Production Manager Production File Records Transformations Processing DB Job Transformation Agent Job Production Agent The Job we send before will publish a File Entry in the Processing Database (click) or alternatively the Production Manager can do the same (click) The Transformation Agent will pick-up File Records satisfying its own requirement (click) and activate associated Production (click) The Production will (click) generate Jobs in the Repository and Production Agent (click) will send it to the Dirac Workload Management System DIRAC Computing Resource DIRAC WMS File entry Job Presenter Name: Gennady Kuznetsov
Operational experience in DC04 The Production Console was used during the DC04 production period more than 100.000 jobs were created and executed 10 Workflows of 5 to 12 application Steps were used The whole production was handled by a small team of production managers with just one person on shift at each moment The Production Console was used during the DC04 production period more than 100.000 jobs were created and executed 10 Workflows of 5 to 12 application Steps were used The whole production was handled by a small team of production managers with just one person on shift at each moment ( 10 workflows, 120 productions, 55 event types, 50M events) Presenter Name: Gennady Kuznetsov
Use cases in DC06 In the DC06 LHCb will test its complete Computing Model which includes Data acquisition at CERN with subsequent inline replication to Tier1 centres Automatic submission of reconstruction and filtering jobs at Tier1 centres as soon as data become available DST data replication across the Tier1 centres The Production Manager tools will be used to cover all these new tasks to minimize the human intervention and to increase responsiveness of the production system In the DC06 LHCb will test its complete Computing Model which includes Data acquisition at CERN with subsequent inline replication to Tier1 centres Automatic submission of reconstruction and filtering jobs at Tier1 centres as soon as data become available DST data replication across the Tier1 centres The Production Manager tools will be used to cover all these new tasks to minimize the human intervention and to increase responsiveness of the production system Presenter Name: Gennady Kuznetsov
Conclusion The DIRAC Production Manager tool suite allows to define Workflows of arbitrary complexity It allows to handle seamlessly large numbers of jobs by a single Production Manager The automatic job preparation and submission triggered by the input data availability reduces the need for the human intervention The Production Console was successfully used in the LHCb DC’04. The complete system to be used in DC06 will support all the production operations defined in the LHCb Computing Model The conlcusion The DIRAC Production Manager suite allows to define Workflows of arbitrary complexity It allows to handle seamlessly large numbers of jobs by a single Production Manager The automatic job preparation and submission triggered by the input data availability reduces the need for the human intervention The Production Console was successfully used in the LHCb DC’04. The complete system to be used in DC06 will support all the production operations defined in the LHCb Computing Model Presenter Name: Gennady Kuznetsov
Backup slides Backup Slides Presenter Name: Gennady Kuznetsov
Instance 1 (list of Options) Variables Global scope How we can avoid variable clash if different programmers are creating Modules and Steps? If we use multiple instances of the same Module – we shall have a local scope. $VAR2 = “string” var3 = $VAR2 var1 = $VAR2 Instance 1 (list of Options) Name space based scope Option Name: var1 Type Value Linked Instance Linked Option Input Flag Output Flag Description Option Name: var2 Type Value Linked Instance Linked Option Input Flag Output Flag Description var3 = inst1.var2 var1 = self.var2 Instance 2 Option Name: var3 Type Value Linked Instance Linked Option Input Flag Output Flag Description Options grouped in a form of list. All our objects (Module, Step, Workflow) are lists of Options. Gennady Kuznetsov
Architecture Module Editor Step Editor Workflow Editor Variable Option Name: var1 Type Value Linked Instance Linked Option Input Flag Output Flag Description Variable n 1 VariableCollection Module edit Module Editor ModuleDefinition ModuleInstance n n Step 1 1 edit Step Editor StepDefinition StepInstance n n 1 1 Workflow edit Workflow Editor WorkflowDefinition Presenter Name: Gennady Kuznetsov
Step Editor Generated Python Code M2 M1 M3 ins1 ins4 ins3 ins2 class HistogramTest: seed = 3 numbers = 200 filename = "WorkflowHistogramTest1.155" def execute(self): ins1 = M1() <skipped> ins1.execute() ins2 = M2() ins2.seed = self.seed ins2.filename = 'default_file_name' ins2.mu = inst1.mu ins2.sigma = 2.5 ins2.numbers = self.numbers ins2.execute() M2 M1 M3 ins1 ins4 ins3 ins2 Step Variables Gennady Kuznetsov
Conclusion DIRAC Production Manager tool includes flexible and experiment (content) independent framework to formulate complex multi-staged jobs LHCb specific Workflow library combined from the smaller and reusable components tools for the bulk Job preparation, submission and control based on Production Repository and Production Agent automated data processing tools based on Processing Data Base and Transformation Agents Presenter Name: Gennady Kuznetsov