Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

Slides:

Advertisements

Similar presentations

Configuration management

Advertisements

Slide 1 of 10 Job Event Basics A Job Event is the name for the collection of components that comprise a scheduled job. On the iSeries a the available Job.

1 Databases in ALICE L.Betev LCG Database Deployment and Persistency Workshop Geneva, October 17, 2005.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

A tool to enable CMS Distributed Analysis

AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming.

Database Deployment on OSG Yuri Smirnov BNL US ATLAS DDM operations and MC production Workshop, BNL September 28-29, 2006.

MC, REPROCESSING, TRAINS EXPERIENCE FROM DATA PROCESSING.

The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Lecture 7 Interaction. Topics Implementing data flows An internet solution Transactions in MySQL 4-tier systems – business rule/presentation separation.

Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC

RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.

Marianne BargiottiBK Workshop – CERN - 6/12/ Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

ANL/BNL Virtual Data Technologies in ATLAS Alexandre Vaniachine Pavel Nevski US-ATLAS Core/GRID software workshop Brookhaven National Laboratory May 6-7,

Software Quality Assurance

Heterogeneous Database Replication Gianni Pucciani LCG Database Deployment and Persistency Workshop CERN October 2005 A.Domenici

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

ATLAS Grid Data Processing: system evolution and scalability D Golubkov, B Kersevan, A Klimentov, A Minaenko, P Nevski, A Vaniachine and R Walker for the.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.

Storage cleaner: deletes files on mass storage systems. It depends on the results of deletion, files can be set in states: deleted or to repeat deletion.

1 Chapter Overview Defining Operators Creating Jobs Configuring Alerts Creating a Database Maintenance Plan Creating Multiserver Jobs.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007.

Post-DC2/Rome Production Kaushik De, Mark Sosebee University of Texas at Arlington U.S. Grid Phone Meeting July 13, 2005.

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

NOVA A Networked Object-Based EnVironment for Analysis “Framework Components for Distributed Computing” Pavel Nevski, Sasha Vanyashin, Torre Wenaus US.

INFSO-RI Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, ,

Jean-Roch Vlimant, CERN Physics Performance and Dataset Project Physics Data & MC Validation Group McM : The Evolution of PREP. The CMS tool for Monte-Carlo.

PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.

Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.

Alex Undrus – Shifters Meeting – 16 Oct ATLAS Nightly System Integration LS1 Ugrade SIT Task Force Objective: increase efficiency, flexibility,

K. Harrison CERN, 22nd September 2004 GANGA: ADA USER INTERFACE - Ganga release status - Job-Options Editor - Python support for AJDL - Job Builder - Python.

Unit 17: SDLC. Systems Development Life Cycle Five Major Phases Plus Documentation throughout Plus Evaluation…

David Adams ATLAS ATLAS Distributed Analysis: Overview David Adams BNL December 8, 2004 Distributed Analysis working group ATLAS software workshop.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

ATLAS Distributed Analysis Dietrich Liko IT/GD. Overview  Some problems trying to analyze Rome data on the grid Basics Metadata Data  Activities AMI.

VOX Project Tanya Levshina. 05/17/2004 VOX Project2 Presentation overview Introduction VOX Project VOMRS Concepts Roles Registration flow EDG VOMS Open.

Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.

David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.

DDM Central Catalogs and Central Database Pedro Salgado.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

David Adams ATLAS ATLAS Distributed Analysis and proposal for ATLAS-LHCb system David Adams BNL March 22, 2004 ATLAS-LHCb-GANGA Meeting.

Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,

CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,

The GridPP DIRAC project DIRAC for non-LHC communities.

WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.

David Adams ATLAS ADA: ATLAS Distributed Analysis David Adams BNL December 15, 2003 PPDG Collaboration Meeting LBL.

Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.

Simulation Production System

Database Replication and Monitoring

U.S. ATLAS Grid Production Experience

Readiness of ATLAS Computing - A personal view

Job Processing Database consolidation Task recovery De-cronification

Job workflow Pre production operations:

X in [Integration, Delivery, Deployment]

Status and plans for bookkeeping system and production tools

Production client status

Presentation transcript:

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production

Pavel Nevski Job Definition Role n It is largely seen as a data provider for the PRODSYS –Interface for ATLAS physicists to the Distributed Production –Job provider for Distributed Production System (ProdSys) n On the other hand, it is the first line of data consumption –Jobs have to be defined dynamically to optimize the production throughput and avoid waist of resources –Users need to know the request progress in a reliable way to actively participate in production validation n Hence, it has to be fully integrated with ATLAS DDM –… Point of concern for Software Integration group

Pavel Nevski System Functionality n A lot of control – input parameters values, input availability, software availability etc. n Automated task status update n Automated task progress monitoring n Automated mail notification n Dynamic job definition based on jobs status in Production database jobs status in Production database n Integration with DDM to access inputs

Pavel Nevski Job Definition tools n Task Request Interface n Request control tools n Job submission tools n Statistic collection n Mailer n Input DDM interface

Pavel Nevski Task Request Page (AKTR) n Part of ATLAS wide Apache server provided with Panda monitor, seen by users is a single comprehensive web interface to the information services: n Back end is a python server interfaced to MySQL DBs at BNL -- Panda DB, logging DB, DB with task, dataset, (site) info -- and to DQ2 – monitor and database runs at BNL – development version at CERN

Pavel Nevski

Task Request Features n Flexible Control Scripts are integrated with User Request tools –Check naming, number of events, component existence etc … at the moment when user is typing request so that he can immediately correct mistakes –Checking parameters against transformations n The job of the production manager is limited by a “one click” request acceptance/rejection n The rest (request->task->jobs->DDM) goes automatically

Pavel Nevski Input Parameter Control n Most of the conflicts are detected at task request time and immediately corrected (mostly numeric information) n Some conflicts which could be potentially corrected with the same release are detected at job submission time n Some additional checking implemented on request – like same GRID flavor for simulations and reconstruction tasks.

Pavel Nevski Task status flow n Requested (used to create datasets) –Pending - a grace period for submitter and manager –Testing - often needed for new type of jobs n Active state: –Submitting - inputs are not all finished –Submitted - fully available for production –Running - some outputs are available n Final state: –Done - 100% finished sucessfully –Finished - finished, but less than 100% success –Failed (?) - finished with no successful jobs –Aborted

Pavel Nevski Task progress control n Task execution status is controled daily n Automatic mail notification is generated in case jobs are stuck for usual reasons (prepared, pending, maxattempt reached etc) n After about one month (depending on task type, grid etc) some automatic actions are forced – increasing maxattempt, aborting autoaborted jobs etc. n After two months some human interaction is initiated to achive progress, n After three months production manager aborts remaning jobs (do not complain!) As the result, the oldest running task now is the one submitted on June 22

Pavel Nevski Managing Parameter Input n When a group of tasks with similar conditions is required, we introduce a project which is associated with a set of default parameters, i.e.: –Specific geometry version (Ideal, Misaligned, etc) –Fast shower parametrization vs full G4 –Different digitization (low noise, high noise etc) n When the project is selected, default values are automatically set for all associated parameters. n When input comes from a different project, project names get concatinated n Example: Ideal_06_MisAl_07_csc11 – events generated with release 11 passed through simulations with misaligned geometry and reconstructed with

Pavel Nevski Input Control Integrated with DDM n If input data were produced on the same grid flavor, jobs are released in TOBEDONE status jobs are released in TOBEDONE status –still, simulations jobs are in WAITING status to allow evgen file replication n If input data were produced on a differenr grid flavor, jobs are defined in WAITINGCOPY status jobs are defined in WAITINGCOPY status n If user input (events) is required, jobs are defined as WAITINGINPUT until input is fully available n Input data needed are first collected at CERN (CAF) n Subscription in DQ2 is done and monitored until input become available for jobs ( …often waiting forever… )

Pavel Nevski Current Status n Currently used by –central production –CTB simulation production –“private” production (Saclay, BNL) n More the 3000 task requests served n Fast turn around n Running as daemon on a SWING machine n Works both with old and new (Python) trfs with automatic parameter extraction n Job definition is being documented on ATLAS TWiki by Junichi n Production often blocked as soon as the data movement is involved

Pavel NevskiPlans n Fully debug the DDM interface and move it to the common cron –Better synchronization with Dataset opening/closing –Typical error correction on input datasets delivery –Control jobs final state with DDM n Maintain configuration parameters in the database the database

Pavel Nevski BACK-UP MATERIAL

Pavel Nevski Components and their relations Task table OutputsJobDef entry DDM Dataset JobExec Production Manager Request Table metadata USER Production Operation Web interface (AKTR)

Pavel Nevski How It Works User submits a Request –Automatic parameter control is performed(!) n Production manager accepts or rejects it –Manager can affect priority and/or grid assignment, but not the request parameters –A mail to submitter is generated n Production scripts (daemon) automatically convert approved requests into Prod DB Tasks and Jobs and register new target datasets with DDM.

Pavel Nevski

Control Flow Chart n Tasks are going through a chain of states: –After request it is Pending for submission or Testing if debugging is needed or Testing if debugging is needed –After approval it is Submitting or Submitted –When Submitted and jobs start to succeed it goes to Running –When all jobs are terminated (DONE or ABORTED) task is Done or Finished (or Failed) n Current status and statistics is available on Task Page

Pavel NevskiConclusion n Job definition for ATLAS distributed production is in a good shape n More integration with ATLAS DDM is foreseen in the nearest future n We are looking forward for challenging ramp up of the ATLAS distributed production