Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.

Slides:



Advertisements
Similar presentations
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Advertisements

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Windows Server 2008 Chapter 11 Last Update
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
Overview of day-to-day operations Suzanne Poulat.
11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
The GridPP DIRAC project DIRAC for non-LHC communities.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
Operation team at Ccin2p3 Suzanne Poulat –
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
ATLAS DDM Developing a Data Management System for the ATLAS Experiment September 20, 2005 Miguel Branco
CMS-specific services and activities at CC-IN2P3 Farida Fassi October 23th.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.
Servizi core INFN Grid presso il CNAF: setup attuale
SQL Database Management
Compute and Storage For the Farm at Jlab
Service Availability Monitoring
Jean-Philippe Baud, IT-GD, CERN November 2007
Essentials of UrbanCode Deploy v6.1 QQ147
Xiaomei Zhang CMS IHEP Group Meeting December
Eleonora Luppi INFN and University of Ferrara - Italy
Overview of the Belle II computing
U.S. ATLAS Grid Production Experience
Hands-On Microsoft Windows Server 2008
CC - IN2P3 Site Report Hepix Spring meeting 2011 Darmstadt May 3rd
Data Challenge with the Grid in ATLAS
Production Resources & Issues p20.09 MC-data Regeneration
EGEE VO Management.
Pierre Girard Réunion CMS
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
1 VO User Team Alarm Total ALICE ATLAS CMS
CC IN2P3 - T1 for CMS: CSA07: production and transfer
Grid Deployment Board meeting, 8 November 2006, CERN
Short update on the latest gLite status
Glexec/SCAS Pilot: IN2P3-CC status
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
Take the summary from the table on
SAM Alarm Triggering and Masking
Pierre Girard ATLAS Visit
CC and LQCD dimanche 13 janvier 2019dimanche 13 janvier 2019
Michael P. McCumber Task Force Meeting April 3, 2006
lundi 25 février 2019 FTS configuration
The CMS Beijing Site: Status and Application
EGEE Operation Tools and Procedures
Features Overview.
Site availability Dec. 19 th 2006
IPv6 update Duncan Rand Imperial College London
Pete Gronbech, Kashif Mohammad and Vipul Davda
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice

27/04/2007ATLAS production - J.Devemy / N.Lajili2 Summary Summary A few metrics Overview of local monitoring tools Concrete actions taken Issues Advices BQS point of view Questions

27/04/2007ATLAS production - J.Devemy / N.Lajili3 A few metrics - March 2007 Jobs : – Total submitted jobs : – Total submitted jobs class Long : separate farms : – Pistoo : 56 cpus (for parallel jobs) – Anastasie : 1616 cpus Farm usage : – 62 groups (experiment or laboratory) and 384 users

27/04/2007ATLAS production - J.Devemy / N.Lajili4 A few metrics Jobs Total submitted jobs : Total submitted jobs class Long : = 47 % Memory use for jobs on class Long memory consistently requested : 2 GB 69 % used less than 1.5 GB 29 % used more 1.5 GB and less than 2 GB CPU time use for jobs on class Long (in IN2P3 unit) cpu time consistently requested : s 97 % of jobs used less than s

27/04/2007ATLAS production - J.Devemy / N.Lajili5 A few metrics Jobs Total submitted jobs : Total submitted jobs class T : = 59 % of total jobs Memory use for jobs on class Long memory consistently requested : 2 GB 86 % used less than 1.5 GB 14 % used more 1.5 GB and less than 2 GB CPU time use for jobs on class Long (in IN2P3 unit) cpu time consistently requested : s 98 % of jobs used less than s

27/04/2007ATLAS production - J.Devemy / N.Lajili6 Monitoring CC : MRTG Monitoring CC : MRTG Real time production status Green : All ATLAS running jobs at CC-IN2P3 Blue : All LHC running jobs at CC-IN2P3 Orange : ((all LHC running jobs at CC-IN2P3)/(All ATLAS running jobs at CC-IN2P3))*100

27/04/2007ATLAS production - J.Devemy / N.Lajili7 Local job monitoring tool Tools to detect problematic job behaviour : 1. Jobs « slow » : running jobs which do not consume cpu time. 2. Jobs « early ended » : bench of jobs using much less cpu time than requested 3. Alert mails : BQS sends mail to grid site admin in case of job failure 4. Manual check : by running scripts

27/04/2007ATLAS production - J.Devemy / N.Lajili8 Local job monitoring tool

27/04/2007ATLAS production - J.Devemy / N.Lajili9 Concrete actions taken Find a detailed diagnosis of job failures e.g : –Lack of resource, expired proxy, transfers pending, core LCG services unavailable –Job environnement setting for a given VO Find the job Identity – LCG job IDs, BQS job IDs, globus job IDs Inform the users or the VO admin Notify the administrator of services involved in : –mail, GGUS ticket Various tasks for managing the production including : – Jobs could be deleted, locked in queued in case of problem

27/04/2007ATLAS production - J.Devemy / N.Lajili10 Concrete actions taken Increasing VO’s quota – In order to face with intensive computing : DC, MC production.. Create BQS resources – To cope with internal services unavailability (HPSS, dCache) VO agents set up – To regulate automatically job priorities and resources according to the VO requirements

27/04/2007ATLAS production - J.Devemy / N.Lajili11 Issues Sometimes it’s hard to find the user Sometimes very low reactivity from users Users are not well informed about the LCG service status Recurrent problems with files access or copy : remote SRM SE unavailable, LFC not responding, transfers failing… Hard to trace jobs which are not submitted through Ressource Brokers Lack of visibility about core LCG services status

27/04/2007ATLAS production - J.Devemy / N.Lajili12 Issues Zombies processes left by ended jobs on the workers nodes – solved in the next BQS version Lack of tools which may allow us to manage VO priorities – solved in a future version (autumn 2007) Memory wasting with jobs submitted on the class Long

27/04/2007ATLAS production - J.Devemy / N.Lajili13 Advices To have more running jobs: Have always queued jobs to reach a good score of running jobs Limit memory request for jobs submitted on the long class Keep us informed as soon as possible about critical production periods

27/04/2007ATLAS production - J.Devemy / N.Lajili14 BQS BQS in a few words… – Home built batch system (10 years old) – Works on all UNIX (GNU/Linux, Solaris, AIX...) – Under continuous evolution, new functionalities are added enabling scalability, robustness, reliability functionality required by users GRID compliant – Very rich scheduling policy including : quotas, resources status, number of queued and running jobs...

27/04/2007ATLAS production - J.Devemy / N.Lajili15 BQS (2) BQS philosophy : – Dispatch of heterogeneous jobs on a worker node – Usage of BQS resources (kind of semaphores) Current developments : – Addition of GRID functionalities : Managing VOMS groups and roles Storing more GRID information into BQS – New BQS servers (to easily absorb the growth of activity)

27/04/2007ATLAS production - J.Devemy / N.Lajili16 Comments / Questions