Vandy Berten Luc Goossens Alvin Tan

Slides:



Advertisements
Similar presentations
RunJob in CMS Greg Graham Discussion Slides. RunJob in CMS RunJob is an Application Configuration and Job Creation Tool –RunJob uses metadata to abstract.
Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Magda – Manager for grid-based data Wensheng Deng Physics Applications Software group Brookhaven National Laboratory.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
Software Engineering 2003 Jyrki Nummenmaa 1 CASE Tools CASE = Computer-Aided Software Engineering A set of tools to (optimally) assist in each.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Tech talk 20th June Andrey Grid architecture at PHENIX Job monitoring and related stuff in multi cluster environment.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
Marianne BargiottiBK Workshop – CERN - 6/12/ Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.
PNPI HEPD seminar 4 th November Andrey Shevel Distributed computing in High Energy Physics with Grid Technologies (Grid tools at PHENIX)
Triggers A Quick Reference and Summary BIT 275. Triggers SQL code permits you to access only one table for an INSERT, UPDATE, or DELETE statement. The.
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
MAGDA Roger Jones UCL 16 th December RWL Jones, Lancaster University MAGDA  Main authors: Wensheng Deng, Torre Wenaus Wensheng DengTorre WenausWensheng.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
Magda Distributed Data Manager Status Torre Wenaus BNL ATLAS Data Challenge Workshop Feb 1, 2002 CERN.
David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.
Andrey Meeting 7 October 2003 General scheme: jobs are planned to go where data are and to less loaded clusters SUNY.
Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
Claudio Grandi INFN Bologna CHEP'03 Conference, San Diego March 27th 2003 BOSS: a tool for batch job monitoring and book-keeping Claudio Grandi (INFN Bologna)
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Grid Scheduler: Plan & Schedule Adam Arbree Jang Uk In.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
Metadata Mòrag Burgon-Lyon University of Glasgow.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
David Adams ATLAS ATLAS Distributed Analysis David Adams BNL September 30, 2004 CHEP2004 Track 5: Distributed Computing Systems and Experiences.
Bookkeeping Tutorial. 2 Bookkeeping content  Contains records of all “jobs” and all “files” that are produced by production jobs  Job:  In fact technically.
K. Harrison CERN, 3rd March 2004 GANGA CONTRIBUTIONS TO ADA RELEASE IN MAY - Outline of Ganga project - Python support for AJDL - LCG analysis service.
K. Harrison CERN, 22nd September 2004 GANGA: ADA USER INTERFACE - Ganga release status - Job-Options Editor - Python support for AJDL - Job Builder - Python.
Recent Enhancements to Quality Assurance and Case Management within the Emissions Modeling Framework Alison Eyth, R. Partheepan, Q. He Carolina Environmental.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
STAR Scheduling status Gabriele Carcassi 9 September 2002.
INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
1 © 2004 Cisco Systems, Inc. All rights reserved. Session Number Presentation_ID Cisco Technical Support Seminar Using the Cisco Technical Support Website.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
Magda Distributed Data Manager Torre Wenaus BNL October 2001.
Lessons from LEAD/VGrADS Demo Yang-suk Kee, Carl Kesselman ISI/USC.
MIKADO – Generation of ISO – SeaDataNet metadata files
Databases and DBMSs Todd S. Bacastow January 2005.
L’analisi in LHCb Angelo Carbone INFN Bologna
Oxana Smirnova, Jakob Nielsen (Lund University/CERN)
Project Objectives Publish to a remote server
U.S. ATLAS Grid Production Experience
ALICE FAIR Meeting KVI, 2010 Kilian Schwarz GSI.
BOSS: the CMS interface for job summission, monitoring and bookkeeping
BOSS: the CMS interface for job summission, monitoring and bookkeeping
AMI – Status November Solveig Albrand Jerome Fulachier
CRAB and local batch submission
BOSS: the CMS interface for job summission, monitoring and bookkeeping
Nicolas Jacq LPC, IN2P3/CNRS, France
Chapter 2 Database Environment Pearson Education © 2009.
Data, Databases, and DBMSs
James Blankenship March , 2018
Michael P. McCumber Task Force Meeting April 3, 2006
Wide Area Workload Management Work Package DATAGRID project
M. Kezunovic (P.I.) S. S. Luo D. Ristanovic Texas A&M University
Chapter 2 Database Environment Pearson Education © 2009.
Status and plans for bookkeeping system and production tools
Chapter 2 Database Environment Pearson Education © 2009.
Presentation transcript:

Vandy Berten Luc Goossens Alvin Tan (CERN/EP/ATC) Alvin Tan (University of Birmingham) 07/11/2018

Pre-history of AtCom project test productions in Jan 2002 at CERN (LSF) enjoyed nearly 100% success rate no need for tools to clean-up/resubmit odd failed job real production in summer 2002 suffered on average of 20% failures many factors were against us at 300 CPU slots capacity, cleanup became overwhelming Sep 2002 Vandy Berten (technical student) needed a well-defined project AtCom (Atlas Commander) project was born 07/11/2018

What is the Atlas Commander? graphical interactive tool to support production manager define jobs in large quantities submit and monitor progress scan log files for (un)known errors update bookkeeping Databases (AMI, Magda) clean up in case of failures 07/11/2018

History of AtCom project total resource count is about 5 man-months ideal situation: two persons in same office, one CS student as developer/designer plus one software engineer as client/designer/developer successful multi-cluster live demo at the Atlas SW workshop in Nov has been in continuous production use at CERN since Oct 2002 v1.0 released end of Jan -> production quality v1.2 released early March removed last important limitation Last important limitation is the remote capability 07/11/2018

AtCom has its own web site http://atlas-project-atcom.web.cern.ch/ atlas-project-atcom/ contains user guide, developer’s guide, documentation, downloads, relevant contact e-mails, etc. 07/11/2018

Architecture: application + plug-ins LSFComputingSystem EDGComputingSystem NGComputingSystem PBSComputingSystem Plug-ins ... Clusters AMIMgt MagdaMgt Bookkeeping DBs Magda AMI AtCom core 07/11/2018

Architecture (continued) plug-in implements abstract ‘cluster’ interface for specific clusters e.g. LSF a plug-in is a Java class + configuration parameters e.g. LSF@TIMBUKTU the AtCom configuration file defines all existing plug-ins and allows each to have its own configuration section they are loaded at run-time 07/11/2018

Available plug-ins LSF NG PBS EDG BQS well understood and supported development suspended after last SW PBS developed by Alvin Tan EDG working, but no EDG based clusters used in production BQS developed by Jerome Fulachier 07/11/2018

Bookkeeping databases 5 logical database domains, two physical databases AMI (Atlas Meta-data Interface) - mySQL DB hosted at Grenoble Magda (Manager for grid-based data) - mySQL DB hosted at BNL physics meta-data recipe catalog permanent production log transient production log replica catalog 07/11/2018

Concepts: datasets abstract transformation process abstract dataset evgen evgen.2000 simul simul.2000 simul.2099 pileup lumi02.2000 07/11/2018

Concepts: partitions concrete tranformation process = job concrete partition = file evgen546 evgen.2000.0001 simul876 evgen546 evgen.2000.0001 simul876 evgen546 evgen.2000.0001 simul876 simul.2000.0035 simul.2000.0035 simul.2000.0035 simul.2099.0812 pileup760 lumi02.2000.0035 simul.2099.0812 pileup760 lumi02.2000.0035 simul.2099.0812 pileup760 lumi02.2000.0035 07/11/2018

Two main functions of AtCom definition of jobs job submission/monitoring 07/11/2018

Select a dataset using the SQL query composer Choose to view relevant fields. Preview SQL query Definition of jobs select a dataset with SQL query composer dataset determines transformation 07/11/2018

07/11/2018

07/11/2018 select a version of its transformation version determines uses, signature, outputs, … 07/11/2018

Transformation-type aware Parameter Load/Save feature. User can define a variable range and various constants for use in expressions. Formulate expressions for partition and job-specific parameters. AtCom allows you to define a counter range and then use the counter value in expressions for the parameter values for each partition you want to create define some constant attributes the logical values for the parameters of the transformation the LocalFileName -> LogicalFileName mapping of all outputs the destination of stdout/stderr 07/11/2018

07/11/2018

07/11/2018

Preview SQL query Construct SQL query: Specify dataset constraints Specify partition constraints Preview SQL query submission select a number of defined partitions using SQL query composer select a target cluster jobs are submitted for most clusters this means a number of auxiliary files are created (wrappers, jdl/xrsl files, …) 07/11/2018

07/11/2018

What happens on submission? (LSF@CERN) when partition is unreserved in DB reserve it and create part_run_info record transformation definition path resolved just prepend an AFS path a wrapper is created insert commands to set up correct environment logical to physical value resolution insert line calling “core” script insert commands to copy outputs to final destination insert commands to set up correct environment resolve logical actual values into physical actual values LFN parameter -> prepend Castor path according to fixed algorithm LFNlist parameter -> expand LFNlist syntax into list of LFNs for each LFN prepend Castor path according insert into wrapper commands to create aux file containing these Castor PFNs 07/11/2018

What happens when you submit? (continued) the job is submitted to LSF using the right queue (conf file) using –o and –e to specify temp locations for stdout and stderr in same dir the wrapper code is saved with a unique name in a dir with a unique name (in the dir specified in the AtCom.conf file) the jobID returned by LSF is recorded in the part_run_info record together with the temp locations for stdout and stderr. a wrapper is created (continued) insert line calling “core” script use PFN/name of aux file instead of LFN/LFNlist insert commands to copy outputs to final destination convert LFN into castor PFN using fixed algorithm rfcp local file to that PFN could insert code here to check rfcp/ retry the job is submitted to LSF using the right queue (conf file) using –o and –e to specify temp locations for stdout and stderr in same dir the wrapper code is saved with a unique name in a dir with a unique name (in the dir specified in the AtCom conf file) the jobID returned by LSF is recorded in the part_run_info record together with the temp locations for stdout and stderr. 07/11/2018

Monitoring jobs you submit are automatically added to list of monitored jobs running jobs can be recovered from the part_run_info table if needed e.g. after having closed AtCom any other partition can be added to the list as well using SQL query composer allows you to “see” also finished, defined jobs for the bar charts of course  07/11/2018

When a job moves from RUNNING to DONE post processing commences resolve validation script logical name into physical name and apply it to stdout/stderr in temp locations returns 1=OK, 2=Undecided or 3=Failed if OK register output files with Magda replica catalog resolve extract script and apply it to stdout writes to stdout a set of attribute value pairs AtCom will attempt an UPDATE query with this on the partition table copy/move logfiles to final destination set status of partition to Validated 07/11/2018

if Failed delete output files if Undecided mark job as such production manager can look at output of validation script or at the logfiles themselves and then force a decision as OK or Failed 07/11/2018

07/11/2018

07/11/2018

Questions ? 07/11/2018

Thank you! 07/11/2018