School on Grid Computing – July 2003 – n° 1 A.Fanfani INFN Bologna – CMS WP8 – CMS experience on EDG testbed  Introduction  Use of EDG middleware in.

Slides:



Advertisements
Similar presentations
Claudio Grandi INFN Bologna DataTAG WP4 meeting, Bologna 14 jan 2003 CMS Grid Integration Claudio Grandi (INFN – Bologna)
Advertisements

DataTAG WP4 Meeting CNAF Jan 14, 2003 Interfacing AliEn and EDG 1/13 Stefano Bagnasco, INFN Torino Interfacing AliEn to EDG Stefano Bagnasco, INFN Torino.
Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
Réunion DataGrid France, Lyon, fév CMS test of EDG Testbed Production MC CMS Objectifs Résultats Conclusions et perspectives C. Charlot / LLR-École.
Job Submission The European DataGrid Project Team
Workload Management meeting 07/10/2004 Federica Fanzago INFN Padova Grape for analysis M.Corvo, F.Fanzago, N.Smirnov INFN Padova.
CMS-ARDA Workshop 15/09/2003 CMS/LCG-0 architecture Many authors…
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
CMS HLT production using Grid tools Flavia Donno (INFN Pisa) Claudio Grandi (INFN Bologna) Ivano Lippi (INFN Padova) Francesco Prelz (INFN Milano) Andrea.
The DataGrid Project NIKHEF, Wetenschappelijke Jaarvergadering, 19 December 2002
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
11 Dec 2000F Harris Datagrid Testbed meeting at Milan 1 LHCb ‘use-case’ - distributed MC production
A tool to enable CMS Distributed Analysis
Exploiting the Grid to Simulate and Design the LHCb Experiment K Harrison 1, N Brook 2, G Patrick 3, E van Herwijnen 4, on behalf of the LHCb Grid Group.
EDG Application The European DataGrid Project Team
DataGrid is a project funded by the European Commission under contract IST Status and Prospective of EU Data Grid Project Alessandra Fanfani.
CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.
EU 2nd Year Review – Feb – Title – n° 1 WP8: Progress and testbed evaluation F Harris (Oxford/CERN) (WP8 coordinator )
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
3 Sept 2001F HARRIS CHEP, Beijing 1 Moving the LHCb Monte Carlo production system to the GRID D.Galli,U.Marconi,V.Vagnoni INFN Bologna N Brook Bristol.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.
Cosener’s House – 30 th Jan’031 LHCb Progress & Plans Nick Brook University of Bristol News & User Plans Technical Progress Review of deliverables.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Claudio Grandi INFN Bologna CHEP'03 Conference, San Diego March 27th 2003 Plans for the integration of grid tools in the CMS computing environment Claudio.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
Finnish DataGrid meeting, CSC, Otaniemi, V. Karimäki (HIP) DataGrid meeting, CSC V. Karimäki (HIP) V. Karimäki (HIP) Otaniemi, 28 August, 2000.
WP8 Status – Stephen Burke – 30th January 2003 WP8 Status Stephen Burke (RAL) (with thanks to Frank Harris)
Grid Workload Management Massimo Sgaravatto INFN Padova.
CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
13 May 2004EB/TB Middleware meeting Use of R-GMA in BOSS for CMS Peter Hobson & Henry Nebrensky Brunel University, UK Some slides stolen from various talks.
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
Claudio Grandi INFN Bologna CHEP'03 Conference, San Diego March 27th 2003 BOSS: a tool for batch job monitoring and book-keeping Claudio Grandi (INFN Bologna)
First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
EDG Applications The European DataGrid Project Team
29 Sept 2004 CHEP04 A. Fanfani INFN Bologna 1 A. Fanfani Dept. of Physics and INFN, Bologna on behalf of the CMS Collaboration Distributed Computing Grid.
INFSO-RI Enabling Grids for E-sciencE CRAB: a tool for CMS distributed analysis in grid environment Federica Fanzago INFN PADOVA.
Summary from WP 1 Parallel Section Massimo Sgaravatto INFN Padova.
The DataGrid Project NIKHEF, Wetenschappelijke Jaarvergadering, 19 December 2002
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko
CMS Production Management Software Julia Andreeva CERN CHEP conference 2004.
LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.
EDG Project Conference – Barcelona 13 May 2003 – n° 1 A.Fanfani INFN Bologna – CMS WP8 – Grid Planning in CMS Outline  CMS Data Challenges  CMS Production.
BaBar-Grid Status and Prospects
U.S. ATLAS Grid Production Experience
Moving the LHCb Monte Carlo production system to the GRID
BOSS: the CMS interface for job summission, monitoring and bookkeeping
BOSS: the CMS interface for job summission, monitoring and bookkeeping
INFN-GRID Workshop Bari, October, 26, 2004
BOSS: the CMS interface for job summission, monitoring and bookkeeping
Scalability Tests With CMS, Boss and R-GMA
N. De Filippis - LLR-Ecole Polytechnique
Gridifying the LHCb Monte Carlo production system
Presentation transcript:

School on Grid Computing – July 2003 – n° 1 A.Fanfani INFN Bologna – CMS WP8 – CMS experience on EDG testbed  Introduction  Use of EDG middleware in the CMS experiment: o CMS/EDG Stress test o Other Tests A.Fanfani Dept. of Physics and INFN, Bologna on behalf of CMS/EDG Task Force

School on Grid Computing – July 2003 – n° 2 A.Fanfani INFN Bologna – CMS WP8 – Introduction o Large Hadron Collider o CMS (Compact Muon Solenoid) Detector o CMS Data Acquisition o CMS Computing Model

School on Grid Computing – July 2003 – n° 3 A.Fanfani INFN Bologna – CMS WP8 – Large Hadron Collider LHC Proton- Proton Collision Beam energy : 7 TeV Luminosity : cm -2 s -1 Data taking : > 2007 bunch-crossing rate: 40 MHz  20 p-p collisions for each bunch-crossing p-p collisions  10 9 evt/s ( Hz )

School on Grid Computing – July 2003 – n° 4 A.Fanfani INFN Bologna – CMS WP8 – CMS detector p p

School on Grid Computing – July 2003 – n° 5 A.Fanfani INFN Bologna – CMS WP8 – CMS Data Acquisition 75 KHz (75 GB/sec) 100 Hz (100 MB/sec) Level 1 Trigger - special hardware High Level Trigger – PCs multi-level trigger to: filter out not interesting events reduce data volume data Bunch crossing 40 MHz  GHz (  PB/sec) 1event is  1MB in size data recording Online system Offline analysis

School on Grid Computing – July 2003 – n° 6 A.Fanfani INFN Bologna – CMS WP8 –  Large scale distributed Computing and Data Access o Must handle PetaBytes per year o Tens of thousands of CPUs o Tens of thousands of jobs o heterogeneity of resources : hardware, software, architecture and Personnel CMS Computing

School on Grid Computing – July 2003 – n° 7 A.Fanfani INFN Bologna – CMS WP8 – CMS Computing Hierarchy Online system  PB/sec  100MB/sec Tier 0 Tier 1 Tier 2 Tier 3 Offline farm CERN Computer center... Tier2 Center InstituteB InstituteA... workstation Italy Regional Center Fermilab Regional Center  2.4 Gbits/sec  0.6 – 2. Gbits/sec  Mbits/sec  500 PCs  10K PCs * France Regional Center  2K PCs 1PC*  PIII 1GHz

School on Grid Computing – July 2003 – n° 8 A.Fanfani INFN Bologna – CMS WP8 –  The main computing activity of CMS is currently related to the simulation, with Monte Carlo based programs, of how the experimental apparatus will behave once it is operational  The importance of doing simulation large samples of simulated data are needed to : o optimise the detectors and investigate any possible modifications required to the data acquisition and processing o better understand the physics discovery potential o perform large scale test of the computing and analysis models This activity is know as “CMS Production and Analysis” CMS Production and Analysis

School on Grid Computing – July 2003 – n° 9 A.Fanfani INFN Bologna – CMS WP8 – CMS MonteCarlo production chain ORCA:  reproduction of detector signals (Digis)  simulation of trigger response  reconstruction of physical information for final analysis The replacement of Objectivity for the persistency will be POOL. Digitization Reconstruction Analysis Simulation Generation Ntuple files (Hbook zebra) Objectivity  RootI/O  POOL Hbook/Root ntuples FZ files (zebra) CMSIM: Simulation of tracking in the CMS detector, based on GEANT3. The ouput is a sequential access zebra file (FZ). CMKIN: MonteCarlo Generation of the proton-proton interaction, based on PYTHIA. The ouput is a random access zebra file (ntuple). Z Z H p p   e-e- e+e+ Sim cards (text)CMS geometry Gen cards (text)

School on Grid Computing – July 2003 – n° 10 A.Fanfani INFN Bologna – CMS WP8 – CMS Tools for “Production”  RefDB o Contains production requests with all needed parameters to produce a physic channel and the details about the production process. It is a SQL Database located at CERN.  IMPALA o Accepts a production request o Produces the scripts for each single job that needs to be submitted o Submits the jobs and tracks the status  MCRunJob o Evolution of IMPALA: modular (plug-in approach)  BOSS o tool for job submission and real-time job-dependent parameter tracking. The running job standard output/error are intercepted and filtered information are stored in BOSS database. The remote updator is based on MySQL. RefDB Parameters (cards,etc…) IMPALA job1 job2job3...

School on Grid Computing – July 2003 – n° 11 A.Fanfani INFN Bologna – CMS WP8 – CMS/EDG Stress Test o Test of the CMS event simulation programs in EDG environment using the full CMS production system o Running from November 30 th to Xmas (tests continued up to February) o This was a joint effort involving CMS, EDG, EDT and LCG people

School on Grid Computing – July 2003 – n° 12 A.Fanfani INFN Bologna – CMS WP8 – CMS/EDG Stress Test Goals  Verification of the portability of the CMS Production environment into a grid environment;  Verification of the robustness of the European DataGrid middleware in a production environment;  Production of data for the Physics studies of CMS, with an ambitious goal of ~ 1 million simulated events in a 5 weeks time.

School on Grid Computing – July 2003 – n° 13 A.Fanfani INFN Bologna – CMS WP8 – CMS/EDG Strategy  Use as much as possible the High-level Grid functionalities provided by EDG: o Workload Management System (Resource Broker), o Data Management (Replica Manager and Replica Catalog), o MDS (Information Indexes), o Virtual Organization Management, etc.  Interface (modify) the CMS Production Tools to the Grid provided access method  Measure performances, efficiencies and reason of job failures to have feedback both for CMS and EDG

School on Grid Computing – July 2003 – n° 14 A.Fanfani INFN Bologna – CMS WP8 – CMS/EDG Middleware and Software  Middleware was: EDG from version to version o Resource Broker server o Replica Manager and Replica Catalog Servers o MDS and Information Indexes Servers o Computing Elements (CEs) and Storage Elements (SEs) o User Interfaces (UIs) o Virtual Organization Management Servers (VO) and Clients o EDG Monitoring o Etc.  CMS software distributed as rpms and installed on the CE  CMS Production tools installed on UserInterface

School on Grid Computing – July 2003 – n° 15 A.Fanfani INFN Bologna – CMS WP8 – User Interface set-up  IMPALA o Get from RefDB parameters needed to start a production o “JDL” files are produced along with the job scripts  BOSS o BOSS will accept and pass on a JDL file to the Resource Broker o Additional info is stored in the BOSS DB:  Logical file names of input/output files  Name of the SE hosting the output files  Outcome of the copy and registration in the RC of files  Status of the replication of files BOSS DataBase RefDB parameters User Interface IMPALA/BOSS job1 JDL1 job2 JDL2 CMS Production tools installed on the EDG User Interface

School on Grid Computing – July 2003 – n° 16 A.Fanfani INFN Bologna – CMS WP8 – CMSEDG SE CE CMS software CMS production components interfaced to EDG middleware  Production is managed from the EDG User Interface with IMPALA/BOSS BOSS DB Workload Management System RefDB parameters Push data or info Pull info UI IMPALA/BOSS CE CMS software CE CMS software CE SE JDL

School on Grid Computing – July 2003 – n° 17 A.Fanfani INFN Bologna – CMS WP8 – CMS jobs description CMKIN Job CMSIM Job Output data (ntuples) Output data (Fz files) Grid Storage Write to Grid Storage Element Write to Grid Storage Element Read from Grid Storage Element * PIII 1GHz 512MB  46.8 SI95 size/eventtime * /event CMKIN ~ 0.05MB~ sec CMSIM ~ 1.8 MB ~ 6 min Dataset eg02_BigJets  CMS official jobs for “Production” of results used in Physics studies  Production in 2 steps: 1.CMKIN : MC Generation for a physics channel (dataset) 125 events ~ 1 minute ~ 6 MB ntuples 2.CMSIM : CMS Detector Simulation 125 events ~ 12 hours ~ 230 MB FZ files “Short” jobs “Long” jobs

School on Grid Computing – July 2003 – n° 18 A.Fanfani INFN Bologna – CMS WP8 – CMKIN Workflow IMPALA creation and submission of CMKIN jobs:  Resource Broker sends jobs to Computing resources (CEs) having CMS software installed  Output ntuples are saved on Close SE and registered into ReplicaCatalog with a Logical File Name (LFN)  the LFN of the ntuple is recorded in the BOSS Database

School on Grid Computing – July 2003 – n° 19 A.Fanfani INFN Bologna – CMS WP8 – CMSEDG SE CE CMS software BOSS DB Workload Management System RefDB parameters Push data or info Pull info UI IMPALA/BOSS Replica Manager CE CMS software CE CMS software CE SE CMS production of CMKIN jobs  CMKIN jobs running on all EDG Testbed sites with CMS software installed CE CMS software X WN data registration SE write JDL

School on Grid Computing – July 2003 – n° 20 A.Fanfani INFN Bologna – CMS WP8 – CMSIM Workflow IMPALA creation and submission of CMSIM jobs:  Computing resources are matched to the job requirements: o Installed CMS software, MaxCPUTime, etc. o CE near to the input data that have to be processed  FZ files are saved on Close SE or on a predefined SE and registered in the Replica Catalog  the LFN of the FZ file is recorded in the BOSS DB

School on Grid Computing – July 2003 – n° 21 A.Fanfani INFN Bologna – CMS WP8 – CMSEDG SE CE CMS software BOSS DB Workload Management System RefDB parameters Push data or info Pull info UI IMPALA/BOSS Replica Manager CE CMS software CE CMS software CE SE input data location CMS production of CMSIM jobs  CMSIM jobs running on CE close to the input data CE CMS software X WN SE input ntuple read JDL data registration write FZ

School on Grid Computing – July 2003 – n° 22 A.Fanfani INFN Bologna – CMS WP8 – Data management  Two practical approaches: 1. FZ files are directly stored at some dedicated SE 2. FZ files are stored on the “close SE” and later replicated to CERN test the creation of replicas of files : 402 FZ files (  96GB) were replicated  All sites use disk for the file storage, but : –CASTOR at CERN: FZ files replicated to CERN are also automatically copied into CASTOR –HPSS in Lyon: FZ files stored in Lyon are automatically copied into HPSS Mass Storage

School on Grid Computing – July 2003 – n° 23 A.Fanfani INFN Bologna – CMS WP8 – CMSEDG SE CE CMS software monitoring CMS jobs BOSS DB Workload Management System RefDB parameters data registration Job output filtering Runtime monitoring input data location Push data or info Pull info UI IMPALA/BOSS Replica Manager CE CMS software CE CMS software CE WN SE CE CMS software SE  Job monitoring and bookkeeping: BOSS Database, EDG Logging & Bookkeeping service JDL Logging & Bookkeeping

School on Grid Computing – July 2003 – n° 24 A.Fanfani INFN Bologna – CMS WP8 – Monitoring the production Information about the job: nb. of events, executing host, … from BOSS database (boss SQL) Job status from L & B (dg-job-status)

School on Grid Computing – July 2003 – n° 25 A.Fanfani INFN Bologna – CMS WP8 – Monitoring Offline monitoring :  Two main sources of information: o EDG monitoring system (MDS based)  MDS information is volatile and need to be archived somehow  collected regularly by scripts running as cron jobs and stored for offline analysis o BOSS database  permanently stored in the MySQL database  Both sources are processed by boss2root. A tool developed to read the information saved in BOSS and store them in ROOT tree to perform analysis. boss SQL BOSS DB Information System (MDS) CMS UI Workstation boss2root Online monitoring : with Nagios, web based tool developed by the DataTag project ROOT tree

School on Grid Computing – July 2003 – n° 26 A.Fanfani INFN Bologna – CMS WP8 – Organisation of the Test  Four UI’s controlling the production: o Bologna / CNAF o Ecole Polytechnique o Imperial College o Padova –reduces the bottleneck due to the BOSS DB  Several resource brokers (each seeing all resources): o CERN (dedicated to CMS)(EP UI) o CERN (common to all applications)(backup!) o CNAF (common to all applications)(Padova UI) o CNAF (dedicated to CMS)(CNAF UI) o Imperial College (dedicated to CMS and BABAR)(IC UI) - reduces the bottleneck due to intensive use of the RB and the 512-owner limit in Condor-G  Replica catalog at CNAF  Top MDS at CERN  II at CERN and CNAF  VO server at NIKHEF

School on Grid Computing – July 2003 – n° 27 A.Fanfani INFN Bologna – CMS WP8 – EDG hardware resources Site Number of CPUs Disk Space GB Availability of MSS CERN (CH) * (+100) yes CNAF (IT) * 1000 * RAL (UK)16360 Lyon (FR) shared 120 (400) 200yes NIKHEF (NL)2235 Legnaro (IT) * * Ecole Polytechnique (FR) * 4220 Imperial College (UK) * Padova (IT) * Totals402 (400) 3000 * + (2245) * Dedicated to CMS Stress Test CNAF Bologna Legnaro & Padova CERN Ecole Poly RAL. Imperial College NIKHEF Lyon

School on Grid Computing – July 2003 – n° 28 A.Fanfani INFN Bologna – CMS WP8 – distribution of job: executing CEs Nb of jobs Executing Computing Element

School on Grid Computing – July 2003 – n° 29 A.Fanfani INFN Bologna – CMS WP8 – CMS/EDG Production # Events CMKIN “short” jobs Nb of events time job submitted from UI:

School on Grid Computing – July 2003 – n° 30 A.Fanfani INFN Bologna – CMS WP8 – CMS/EDG Production ~260K events produced ~7 sec/event average ~2.5 sec/event peak (12-14 Dec) 30 Nov 20 Dec CMS Week Upgrade of MW Hit some limit of implement. (RC,MDS) CMSIM “long” jobs Nb of events job submitted from UI:

School on Grid Computing – July 2003 – n° 31 A.Fanfani INFN Bologna – CMS WP8 – Total no. of events  each job with 125 events  0.05 MB/event (CMKIN)  1.8 MB/event (CMSIM) UI Submitting Total no. of CMKIN evts % of total Total no. of CMSIM evts % of total CNAF PD IC POLY Total  Total size of data produced:  500 GB  Total number of successful jobs:  7000

School on Grid Computing – July 2003 – n° 32 A.Fanfani INFN Bologna – CMS WP8 – Summary of Stress Test Total EDG Stress Test jobs = 10676, successful =7196, failed = 3480 CMKIN jobs Status EDG Evaluation“CMS Evaluation” Finished Correctly Crashed or bad status Total number of jobs Efficiency87%83% “Short” jobs CMSIM jobs Status EDG Evaluation“CMS” Evaluation Finished Correctly Crashed or bad status Total number of jobs Efficiency39%70% “Long” jobs EDG Evaluation: All submitted jobs are considered Successful jobs are those correctly finished for EDG “CMS Evaluation”: only jobs that had a chance to run are considered Successful jobs are those with the output data properly stored

School on Grid Computing – July 2003 – n° 33 A.Fanfani INFN Bologna – CMS WP8 – EDG reasons of failure (categories) CMKIN jobs Status Totals Crashed or bad status818 Reasons of Failure for Crashed jobs No matching resource found509 Generic Failure:MyProxyServer not found in JDL expr. 102 Running forever74 Failure while executing job wrapper37 Other failures96 “Short” jobs CMSIM jobs Status Totals Crashed or bad status2662 Reasons of Failure for Crashed jobs Failure while executing job wrapper1476 No matching resource found722 Globus failure: Globus down/submit to globus failed 144 Running forever116 Globus failure90 Other failures114 “Long” jobs

School on Grid Computing – July 2003 – n° 34 A.Fanfani INFN Bologna – CMS WP8 – main sources of trouble (I)  The Information service (MDS and Information Index) weakness  “ No matching resources found” error : o As the query rate increase the top MDS and II slow down dramatically. Since the RB relies on the II to discover available resources, the MDS instability caused job to abort due to lack of matching resources. Work-around: Use a cache of the information stored in a Berkeley database LDAP back-end (from EDG version 1.4). The rate of aborted jobs due to information system problems was reduced from 17% to 6%

School on Grid Computing – July 2003 – n° 35 A.Fanfani INFN Bologna – CMS WP8 – main sources of trouble (II)  Problems in the job submission chain related to the Workload Management System  “ Failure while executing job wrapper” error: (the most relevant failure for “long” jobs) o Failures in downloading/uploading the Input/Output Sandboxes files from RB to WN Due for example to problems in the gridftp file transfer, network failures, etc…. o The standard output of the script where the user job is wrapped around was empty. This is transferred via Globus GASS from the CE node to the RB machine in order to check if the job reached the end. There could be many possible reasons (i.e. home directory not available on WN, glitches in the GASS transfer, race conditions for file updates between the WN and CE node with PBS etc..) Several fixes to reduce this effect (if necessary transfer the stdout also with gridftp, PBS specific fixes,…) (from EDG1.4.3)

School on Grid Computing – July 2003 – n° 36 A.Fanfani INFN Bologna – CMS WP8 – main sources of trouble (III)  Replica catalog limitation of performances o limit of the number of lengthy named entries in one file collection  several collections used o The catalog respond badly to a high query/writing rate, with queries hanging indefinitely.  a very difficult situation to deal with since the jobs hung while accessing and stayed in “Running” status forever, and thus requiring manual intervention from the local system administrators The efficiency of copy the output file into SE and register it into RC : Total number of files written into RC :  8000  Some instability of the Testbed due to a variety of reasons (from hardware failures, to network instabilities, to mis-configurations)  (copy)  (register)  ( copy & register) CMKIN97%86%83%  small output file, higher writing rate into RC CMSIM84%93%78%  bigger output file, slower writing rate into RC

School on Grid Computing – July 2003 – n° 37 A.Fanfani INFN Bologna – CMS WP8 – Tests after the StressTest  Including fixes and performance enhancements mainly to reduce the rate of failures in the job submission chain CMKIN jobs Status EDG Evaluation Finished Correctly1014 Crashed or bad status57 Total number of jobs1071 Efficiency95% “Short” jobs “Long” jobs CMSIM jobs Status EDG Evaluation Finished Correctly653 Crashed or bad status264 Total number of jobs917 Efficiency71% Increased efficiency in particular for long jobs (Limited statistic wrt Stess Test)

School on Grid Computing – July 2003 – n° 38 A.Fanfani INFN Bologna – CMS WP8 – Main results and observations  RESULTS o Could distribute and run CMS software in EDG environment o Generated ~250K events for physics with ~10,000 jobs in 3 week period  OBSERVATIONS o Were able to quickly add new sites to provide extra resources o Fast turnaround in bug fixing and installing new software o Test was labour intensive (since software was developing and the overall system was fragile)  WP1 At the start there were serious problems with long jobs- recently improved  WP2 Replication Tools were difficult to use and not reliable, and the performance of the Replica Catalogue was unsatisfactory  WP3 The Information System based on MDS performed poorly with increasing query rate  The system is sensitive to hardware faults and site/system mis-configuration  The user tools for fault diagnosis are limited o EDG 2.0 should fix the major problems providing a system suitable for full integration in distributed production

School on Grid Computing – July 2003 – n° 39 A.Fanfani INFN Bologna – CMS WP8 – Other tests: systematic submission of CMS jobs o Use CMS jobs to test the behaviour/response of the grid as a function of the jobs characteristics o No massive tests in a production environment o systematic submission over a period of  4 months (march-june)

School on Grid Computing – July 2003 – n° 40 A.Fanfani INFN Bologna – CMS WP8 – characteristics of CMS jobs  CMS jobs with different CPU and I/O requirements, varying: o Kind of application : CMKIN and CMSIM jobs o Number of events: 10, 100, 500 o Cards file : define the kind of events to be simulated datasets “ttbar”, “eg02BigJets”, “jm_minbias”  Measure the requirements of these jobs in term of: o Resident Set Size o Wall Clock Time o Input size o Output size 18 different kind of jobs 6400 sec 300 sec sec sec Time(sec) kind of job i.e.

School on Grid Computing – July 2003 – n° 41 A.Fanfani INFN Bologna – CMS WP8 – Definition of Classes and strategy for job submission  Definition of classes of jobs according to their characteristics:  Submission of the various kind of jobs to the EDG testbed: o use of the same EDG functionalities as described for the StressTest (Resource Broker, Replica Catalog, etc…..) o 2 Resource Broker were used (Lyon and CNAF) o several submission for each kind of jobs:  submission in bunches of 5 jobs  submission spread over a long period Class G1Class G2Class G3Class G4 Time (h) RSS (MB) Input (MB) Output (MB) Not demanding CMKIN jobs CMSIM jobs with increasing requirements

School on Grid Computing – July 2003 – n° 42 A.Fanfani INFN Bologna – CMS WP8 – Behaviour of the classes on EDG o Comparison the Wall ClockTime and Grid Wall Clock Time o Report the failure rate for each class WCT GWCT Time (sec) ClassG1 ClassG2 ClassG3 ClassG4 GWCT (sec) WCT (sec) Overhead ClassG % ClassG % ClassG % ClassG % Failure rate % ClassG126% ClassG247% ClassG353% ClassG486%

School on Grid Computing – July 2003 – n° 43 A.Fanfani INFN Bologna – CMS WP8 – Comments  The behaviour of the identified classes of jobs on EDG testbed is: oThe best class is G2 with an execution time ranging from 5 mins to  2 hours oVery short jobs have a huge overhead  Mean time affected by few jobs with strange pathologies oThe failure rate increases dramatically as the CPU time needed increases.  Instability of the testbed: i.e. there where frequent operational intervention on the RB which caused loss of jobs. Jobs lasting more then 20 hours have very little chances to survive Time Overhead %Failure rate % ClassG12072%26% ClassG274%47% ClassG382%53% ClassG432%86% Time (sec) Nb. Of jobs increasing complexity

School on Grid Computing – July 2003 – n° 44 A.Fanfani INFN Bologna – CMS WP8 –  HEP Applications requiring GRID Computing are already there  All the LHC experiments are using the current implementations of many Projects o Need to test the scaling capabilities (Testbeds) o Robustness and reliability are the key issues for the Applications  LHC experiments look forward for EGEE and LCG deployments Conclusions