Fabric Monitoring at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Joint OSG & EGEE Operations WS, Culham (UK) - 28-9-2005.

Slides:



Advertisements
Similar presentations
Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.
Advertisements

LNL CMS M.Biasotto, Bologna, 29 aprile LNL Analysis Farm Massimo Biasotto - LNL.
“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT.
1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.
A tool to enable CMS Distributed Analysis
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
BaBar Grid Computing Eleonora Luppi INFN and University of Ferrara - Italy.
INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
A.Guarise – F.Rosso 1 Enabling Grids for E-sciencE INFSO-RI Comprehensive Accounting Views on large computing farms. Andrea Guarise & Felice Rosso.
ATLAS DC2 seen from Prague Tier2 center - some remarks Atlas sw workshop September 2004.
Fabric Monitor, Accounting, Storage and Reports experience at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Workshop sul.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Light weight Disk Pool Manager experience and future plans Jean-Philippe Baud, IT-GD, CERN September 2005.
October, HEPiX Fall 2005 at SLACSLAC Site Report Roberto Gomezel INFN.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Presenter Name Facility Name UK Testbed Status and EDG Testbed Two. Steve Traylen GridPP 7, Oxford.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
HLRmon accounting portal DGAS (Distributed Grid Accounting System) sensors collect accounting information at site level. Site data are sent to site or.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
Outline: Status: Report after one month of Plans for the future (Preparing Summer -Fall 2003) (CNAF): Update A. Sidoti, INFN Pisa and.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
EGEE is a project funded by the European Union under contract IST VO box: Experiment requirements and LCG prototype Operations.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
BaBar Cluster Had been unstable mainly because of failing disks Very few (
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
EGEE is a project funded by the European Union under contract INFSO-RI Grid accounting with GridICE Sergio Fantinel, INFN LNL/PD LCG Workshop November.
The Italian Tier-1: INFN-CNAF 11-Oct-2005 Luca dell’Agnello Davide Salomoni.
HLRmon accounting portal The accounting layout A. Cristofori 1, E. Fattibene 1, L. Gaido 2, P. Veronesi 1 INFN-CNAF Bologna (Italy) 1, INFN-Torino Torino.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
II EGEE conference Den Haag November, ROC-CIC status in Italy
– n° 1 Grid di produzione INFN – GRID Cristina Vistoli INFN-CNAF Bologna Workshop di INFN-Grid ottobre 2004 Bari.
Daniele Cesini - INFN CNAF. INFN-CNAF 20 maggio 2014 CNAF 2 CNAF hosts the Italian Tier1 computing centre for the LHC experiments ATLAS, CMS, ALICE and.
ATLAS TIER3 in Valencia Santiago González de la Hoz IFIC – Instituto de Física Corpuscular (Valencia)
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.
Using HLRmon for advanced visualization of resource usage Enrico Fattibene INFN - CNAF ISCG 2010 – Taipei March 11 th, 2010.
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
Enabling Grids for E-sciencE INFN Workshop – May 7-11 Rimini 1 Grid Accounting Status at INFN Riccardo Brunetti INFN-TORINO.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
BaBar & Grid Eleonora Luppi for the BaBarGrid Group TB GRID Bologna 15 febbraio 2005.
Quattor installation and use feedback from CNAF/T1 LCG Operation Workshop 25 may 2005 Andrea Chierici – INFN CNAF
Servizi core INFN Grid presso il CNAF: setup attuale
CASTOR: possible evolution into the LHC era
Real Time Fake Analysis at PIC
Diskpool and cloud storage benchmarks used in IT-DSS
Sergio Fantinel, INFN LNL/PD
PES Lessons learned from large scale LSF scalability tests
Virtualization in the gLite Grid Middleware software process
The INFN Tier-1 Storage Implementation
Bernd Panzer-Steindel CERN/IT
The LHCb Computing Data Challenge DC06
Presentation transcript:

Fabric Monitoring at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Joint OSG & EGEE Operations WS, Culham (UK)

Outline CNAF-INFN Tier1 FARM Monitoring Local Queues Monitoring –Local statistics Storage Monitoring Summary

Introduction Location: INFN-CNAF, Bologna (Italy) –one of the main nodes of GARR network Computing facility for INFN HNEP community –Partecipating to LCG, EGEE, INFNGRID projects Multi-Experiments TIER1 –LHC experiments (Alice, Atlas, CMS, LHCb) –CDF, BABAR –VIRGO, AMS, MAGIC, ARGO, Pamela... Resources assigned to experiments on a yearly Plan.

The Farm in a Nutshell - SLC 3.0.5, LCG 2.6.0, LSF ~ 500 WNs LSF pool (484 GRID usable) CPU slots usable (ht, 1878 GRID usable ~96%) -Common LSF pool: 1 job per logical cpu (slot) -MAX 1 process running at the same time per job –GRID and local submission allowed On the same WN can run GRID and not GRID jobs On the same queue can be submitted GRID and not GRID jobs –For each VO/EXP one or more queues –Since 24th of April almost jobs were executed on our LSF pool (~ GRID)

Access to Batch system “Legacy” non-Grid Access CELSF Wn1WNn SE Grid Access LSF client UI Grid

Farm Monitoring Goals Scalability to Tier1 full size Many parameters for each WN/server DataBase and Plots on Web Pages Data Analysis Report Problems on Web Pages Share data with GRID tools RedEye: T1 local tool monitoring

Farm Monitoring What do we get? CPU load, status and jiffies IDE and SCSI I/O Ethernet I/O, packets, errors, overruns Temperatures, RPM fans Total and type of active TCP connections Processes created, running, zombie etc RAM and SWAP memory Users logged in

Local WN Monitoring

On each WN every 5 min (local crontab) infos are saved locally ( 2-3 TCP packets) 1 minute later a collector “gets” via socket the infos –“gets”: tidy parallel fork with timeout control To get and save locally datas from 500 WN ~ 4 sec (best case). 15 sec worst case (timeout knife) Upgrade DataBase (daily, weekly, monthly, yearly) Each WN --> 1 file (possibility of cumulative rack plots) Check values (temperatures too high etc.) Local thumbnail cache creation (web clickable)

Web Snapshot CPU-RAM

Web Snapshot TCP connections

Web Snapshot NIC

Fabric  GRID Monitoring Effort on exporting relevant fabric metrics to the Grid level e.g.: –# of active WNs, –average CPU load –etc… GridICE integration –Configuration based on Quattor Avoid duplication of sensors on farm

Local Queues Monitoring Every 5 minutes on batch manager is saved queues status (snapshot) A collector gets the infos and upgrades the local database (same logic of farm monitoring) –Daily / Weekly / Monthly / Yearly DB –DB: Total and single queues 3 classes of users for each queue Plots generator: Gnuplot 4.0

Web Snapshot LSF Status

UGRID: general GRID user (cms001, cms030…) SGM: Software GRID Manager (cmssgm) OTHER: local user

RedEye - LSF Monitoring Almost real time slot usage Fast, few CPU power needed, stable, works on WAN RedEye simple user, not root BUT… 1.all slots have the same weight 2.Jobs shorter than 5 minutes can be lost 3.No info about RAM or CPU time AND We need something good for ALL jobs. We need to know who and how uses our FARM. Solution: Offline parsing LSF log file (lsb.acct) one time per day

Job-related metrics From LSF log file we got the follwing non-GRID info: LSF JobID, local UID owner of the JOB “any kind of time” (submission, WCT etc) Max RSS and Virtual Memory usage From which computer (hostname) the job was submitted (GRID CE/locally) Where the job was executed (WN hostname) We complete this set with KspecINT2K of the slot

Queues usage report

GRID jobs come from CEs

Queues usage report

How we use KspecINT2K? -1 CPU → 1 job (XEON → 4 jobs; P3/Opteron → 2 jobs)

Fabric and GRID monitoring Effort on exporting relevant queue and job metrics to the Grid level. –Integration with GridICE (monitoring), DGAS (Grid accounting) –Grid (VO) level view of resource usage Integration of local job information with Grid related metrics. E.g.: –DN of the user proxy –VOMS extensions to user proxy –Grid Job ID

GridICE screenshots

Job Check Control Lsb.acct had a big bug! –Randomly: CPU-user-time = 0.00 sec –From bjobs -l correct CPUtime –Fixed by Platform at 25th of July 2005 CPUtime > WCT? --> Possible Spawn RAM memory: is a job on the right WN? We have a daily report

The Storage in a Nutshell Different hardware (NAS, SAN, Tapes) –275 TB HD, 130 TB Tape Different access methods (NFS/RFIO/Xrootd/gridftp) HD Volumes FileSystem: EXT3 and GPFS Tape access: CASTOR (50 TB of HD as stage) Volumes management via Postgresql DB 80 servers to manage and export FS Nagios to check hw & services status –gridftp, srm, rfio, castor, ssh RRD to plot (volume space total and used)

Storage usage report

Summary Fabric level monitoring is needed to ease management At Grid level more specific tools like GridICE and DGAS to publish the VOs, jobs and users related metrics are mandatory An effort to integrate these 2 levels is on- going –unify data collection –avoid duplications