11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.

Slides:

Advertisements

Similar presentations

BQS Update Architecture Scheduling Job Types New Users Needs More users & machines, Scalability issues Needs for more sophisticated monitoring and control.

Advertisements

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

Workload Management Massimo Sgaravatto INFN Padova.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.

Zhiling Chen (IPP-ETHZ) Doktorandenseminar June, 4 th, 2009.

OSG Public Storage and iRODS

CC - IN2P3 Site Report Hepix Fall meeting 2009 – Berkeley

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.

:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Overview of day-to-day operations Suzanne Poulat.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.

GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.

Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

JBQS - Bernard CHAMBON - HEPIX, Nov JBQS presentation IN2P3 Computer Center Campus de la DOUA 27, Boulevard du 11 Novembre.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.

INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.

Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –

INFSO-RI Enabling Grids for E-sciencE Policy management and fair share in gLite Andrea Guarise HPDC 2006 Paris June 19th, 2006.

1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Enabling Grids for E-sciencE Experience Supporting the Integration of LHC Experiments Computing Systems with the LCG Middleware Simone.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

A Computing Tier 2 Node Eric Fede – LAPP/IN2P3. 2 Eric Fede – 1st Chinese-French Workshop Plan What is a Tier 2 –Context and definition To be a Tier 2.

StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group

II EGEE conference Den Haag November, ROC-CIC status in Italy

Operation team at Ccin2p3 Suzanne Poulat –

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)

Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.

Servizi core INFN Grid presso il CNAF: setup attuale

Service Availability Monitoring

Short update on the latest gLite status

Glexec/SCAS Pilot: IN2P3-CC status

Simulation use cases for T2 in ALICE

Pierre Girard ATLAS Visit

CC and LQCD dimanche 13 janvier 2019dimanche 13 janvier 2019

Site availability Dec. 19 th 2006

The LHCb Computing Data Challenge DC06

Presentation transcript:

11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero

11/30/2007CMS Production - Exploitation team2 Summary Summary BQS A few metrics Overview of local monitoring tools Concrete actions taken Issues Advices Questions

11/30/2007CMS Production - Exploitation team3 BQS BQS in a few words… – Home built batch system (15 years old) - >2 FT developers – Quick answer to problems and demands – Under continuous evolution, new functionalities are added for enabling scalability, robustness, reliability (SGBD-R) functionalities required by users GRID compliance – Very rich scheduling policy including : quotas, resources status, number of simultaneous spawned and running jobs...

11/30/2007CMS Production - Exploitation team4 BQS (2) BQS philosophy : – Dispatch of heterogeneous jobs on a worker node ( workers not dedicated to one experiment) – Usage of BQS resources (kind of semaphores) Current developments : – Addition of GRID functionalities : Managing VOMS groups and roles Storing more GRID information into BQS – Support HEP-Unit [SI2K]

11/30/2007CMS Production - Exploitation team5 A few metrics – October 2007 Jobs : – Total submitted jobs : > ~30 000/day (October 2006 : ~ jobs/day) – 5000 running jobs (October 2006 : ~3500) – Total submitted jobs in the LONG class : 64% 2 separate farms : – Pistoo : 64 cores (for parallel jobs) – 60 KSI2K – Anastasie : ~5000 cores – 7500 KSI2K SL3 – 700 cores SL4 32b – 3700 cores SL4 64b – 600 cores Farm usage : – 62 unix groups, 3000 users, 500 regular users

11/30/2007CMS Production - Exploitation team6 A few metrics – 2007 Jobs Total submitted jobs : ( <15% of all) Total submitted jobs LONG class : = 96 % CPU : 12.3% of all Memory use for jobs on class Long memory consistently requested : 2 GB (local Job-manager) 72 % used less than 1 GB 93 % used less than 1.5 GB CPU time use for jobs on class Long 93 % of jobs used less than 40% of the class limit (but MEDIUM class has only 1GB of memory)

11/30/2007CMS Production - Exploitation team7 Local jobs monitoring Tools to detect problematic jobs : 1. Stalled Jobs : running jobs which do not consume cpu time. 2. Jobs « early ended » : bunch of jobs using much less cpu time than requested 3. mails : BQS sends a mail in case of job failure 4. Manual checks : by running different scripts

11/30/2007CMS Production - Exploitation team8 Concrete actions taken Diagnose job failures e.g : –Lack of resource, expired proxy, transfers pending, core LCG services unavailable –Job environment setting for a given VO Find the Grid Job Identity LCG job IDs, BQS job IDs, globus job IDs Inform the users or the VO admin Notify the administrator of services involved in the problem mail, GGUS ticket Various tasks for managing the production : Jobs can be deleted, locked in queued, rerun etc.

11/30/2007CMS Production - Exploitation team9 Concrete actions taken Increasing VO’s quota In order to face selective intensive computing : DC, MC production.. Create & modify BQS resources semaphore To cope with internal services unavailability (HPSS, dCache) VO-specific job prioritization To regulate automatically job priorities and resources according to the VO requirements

11/30/2007CMS Production - Exploitation team10 Issues Real needs of resources (CPU, memory) are unknown for GRID jobs CMS soft stresses AFS -> number of running jobs is limited Users are not well informed about the LCG service status (downtimes) Recurrent problems with files access or copy : remote SRM SE unavailable, LFC not responding, transfers failing… Sometimes it’s hard to find the user Sometimes very low reactivity from users Hard to trace jobs which are not submitted through Resource Brokers

11/30/2007CMS Production - Exploitation team11 Issues Zombies processes left by ended jobs on the workers nodes – solved in the current BQS version Lack of tools which may allow us to manage priorities inside the VO (T1/T2, productions/analysis) – solved in a future BQS version (end of 2007) Memory wasting with jobs submitted on the class Long – Glite ? Cream ?

11/30/2007CMS Production - Exploitation team12 Advices To have more running jobs: Have always queued jobs to reach a good score of running jobs Limit memory request for jobs submitted on the LONG class when it’s possible ( direct submissions) Keep us informed as soon as possible about critical production periods Migration to SL4 64b

11/30/2007CMS Production - Exploitation team13 Comments / Questions