WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo 15-4-2005 Outline  The Atlas Production System  WMS baseline issues in Atlas.

Slides:



Advertisements
Similar presentations
DataTAG WP4 Meeting CNAF Jan 14, 2003 Interfacing AliEn and EDG 1/13 Stefano Bagnasco, INFN Torino Interfacing AliEn to EDG Stefano Bagnasco, INFN Torino.
Advertisements

1 14 Feb 2007 CMS Italia – Napoli A. Fanfani Univ. Bologna A. Fanfani University of Bologna MC Production System & DM catalogue.
Database System Concepts and Architecture
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Parasol Architecture A mild case of scary asynchronous system stuff.
The Grid Constantinos Kourouyiannis Ξ Architecture Group.
Job Submission The European DataGrid Project Team
INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
Basic Grid Job Submission Alessandra Forti 28 March 2006.
AustrianGrid, LCG & more Reinhard Bischof HPC-Seminar April 8 th 2005.
Asynchronous Web Services Approach Enrique de Andrés Saiz.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
1 ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04.
Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
David Adams ATLAS ATLAS Distributed Analysis David Adams BNL September 30, 2004 CHEP2004 Track 5: Distributed Computing Systems and Experiences.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007.
ATLAS Grid Requirements A First Draft Rich Baker Brookhaven National Laboratory.
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.
EGEE is a project funded by the European Union under contract IST WS-Based Advance Reservation and Co-allocation Architecture Proposal T.Ferrari,
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
DIRAC Review (12 th December 2005)Stuart K. Paterson1 DIRAC Review Workload Management System.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.
ATLAS-specific functionality in Ganga - Requirements for distributed analysis - ATLAS considerations - DIAL submission from Ganga - Graphical interfaces.
Testing the HEPCAL use cases J.J. Blaising, F. Harris, Andrea Sciabà GAG Meeting April,
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
1 The Capone Workflow Manager M. Mambelli, University of Chicago R. Gardner, University of Chicago J. Gieraltowsky, Argonne National Laboratory 14 th February.
The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Introduction to Computing Element HsiKai Wang Academia Sinica Grid Computing Center, Taiwan.
Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.
ATLAS Distributed Analysis S. González de la Hoz 1, D. Liko 2, L. March 1 1 IFIC – Valencia 2 CERN.
Nicolas Jacq LPC, IN2P3/CNRS, France
Presentation transcript:

WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas

M. Branco / A. De Salvo – WMS baseline issues in Atlas – The Atlas Production System (1)

M. Branco / A. De Salvo – WMS baseline issues in Atlas – major components Common Central Database (ProdDB) Common Data Management System (DQ) Production Supervisor (Windmill) Executor One per grid-flavour GRID3  Capone LCG  LEXOR NORDUGRID  Dulcinea One for legacy systems Bequest Atlas Production System (2)

M. Branco / A. De Salvo – WMS baseline issues in Atlas – Production Database (ProdDB) The ProdDB holds records for job transformations a transformation is an “executable” that transforms data from one state to another one, using a set of pre-defined actions, to be executed in sequence job definitions status of jobs job executions logical files The bookkeeping database (AMI) is currently sincronized with ProdDB Oracle database hosted at CERN Data Management System allows global cataloguing of files interfaced to existing replica catalog flavors allows global file movement See previous talks from Miguel about the Data Management Atlas Production System components (1)

M. Branco / A. De Salvo – WMS baseline issues in Atlas – Supervisor consumes jobs from the production database submits them to one of the executors it is connected with follows up on the job validates presence of expected outputs takes care of final registration of output products in case of success possibly takes care of clean-up in case of failure will retry n times if necessary uses Jabber to communicate with executors Executor one for each facility flavor LCG (Lexor), NG (Dulcinea), GRID3 (Capone), PBS, LSF, BQS, Condor?, … translates facility neutral job definition into facility specific language implements facility neutral interface Atlas Production System components (2)

M. Branco / A. De Salvo – WMS baseline issues in Atlas – Information System Ranking some sites do not publish correctly all the informations in the IS the ranking expression based on "EstimatedResponseTime" had to be changed if there are no queued jobs, rank on the percentage of free CPU on the number of running jobs, else use the negative value of percentage of waiting jobs on the number of running jobs The number of free CPUs for Atlas only is not available The value of Max running jobs (which is VO dependent) is not always set to a reasonable value Sites that don't pass the daily test suite run by the CIC on duty were automatically excluded from the ATLAS BDII May exclude good sites for Atlas Not all the tests are relevant to decide if a site is good for ATLAS or not Some test could fail for a temporary problem, but currently there is no retry procedure implemented Often excluded sites are not efficiently informed about the failure May excluded also Storage Resources from a site, also when only the Computing Resources are to be excluded RB/Job submission Only single jobs may be submitted and a single RB cannot serve efficiently more than jobs Submission time reaching peaks of > 40 secs, while normal values are around 10 secs. Submission speed affected also by the load on the L&B server, in particular when querying job by job A global query is impossible via CLI WMS problems seen in recent productions

M. Branco / A. De Salvo – WMS baseline issues in Atlas – Bulk submission Submission of collections of jobs Sandbox sharing Parametric jobs Bulk status Status retrieval of collections of jobs User tags in jobs Associating a tag to a job, in order to label it with its Production System name Middleware jobID could be lost, if a crash occurs Real-time job output inspection Ability to retrieve the partial output of a job WMS baseline requirements for Atlas (1)

M. Branco / A. De Salvo – WMS baseline issues in Atlas – Data-driven matchmaking The replica location informations should be used in the ranking expression For example by having a function that returns “1” if the file, passed as parameter, is available in the close SE) Provide metrics for distance CE-SE, to be used in matchmaking At least something like “local”, “close”, “same continent”, “far away” Task queue in RB Keeping an historical view of the submitted jobs Better optimization Avoid the “black hole” effect, where a misconfigured site pulls many jobs and makes them failing Dynamical rescheduling when the state of the grid changes Better handling of sandboxes RB should not crash if the space used by the sandboxes is too much Input sandbox sharing Better error reporting Avoid generic error messages (like “Max Retry Count exceeded”) WMS baseline requirements for Atlas (2)