Fabric Monitor, Accounting, Storage and Reports experience at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Workshop sul.

Slides:

Advertisements

Similar presentations

1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.

Advertisements

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

A.Guarise – F.Rosso 1 Enabling Grids for E-sciencE INFSO-RI Comprehensive Accounting Views on large computing farms. Andrea Guarise & Felice Rosso.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov

Light weight Disk Pool Manager experience and future plans Jean-Philippe Baud, IT-GD, CERN September 2005.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

Fabric Monitoring at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Joint OSG & EGEE Operations WS, Culham (UK)

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

Presenter Name Facility Name UK Testbed Status and EDG Testbed Two. Steve Traylen GridPP 7, Oxford.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.

HLRmon accounting portal DGAS (Distributed Grid Accounting System) sensors collect accounting information at site level. Site data are sent to site or.

Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Recent improvements in HLRmon, an accounting portal suitable for national Grids Enrico Fattibene (speaker), Andrea Cristofori, Luciano Gaido, Paolo Veronesi.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

EGEE is a project funded by the European Union under contract IST VO box: Experiment requirements and LCG prototype Operations.

PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.

Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.

LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.

EGEE is a project funded by the European Union under contract INFSO-RI Grid accounting with GridICE Sergio Fantinel, INFN LNL/PD LCG Workshop November.

HLRmon accounting portal The accounting layout A. Cristofori 1, E. Fattibene 1, L. Gaido 2, P. Veronesi 1 INFN-CNAF Bologna (Italy) 1, INFN-Torino Torino.

03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –

INFN GRID Production Infrastructure Status and operation organization Cristina Vistoli Cnaf GDB Bologna, 11/10/2005.

INFSO-RI Enabling Grids for E-sciencE DGAS, current status & plans Andrea Guarise EGEE JRA1 All Hands Meeting Plzen July 11th, 2006.

WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

II EGEE conference Den Haag November, ROC-CIC status in Italy

– n° 1 Grid di produzione INFN – GRID Cristina Vistoli INFN-CNAF Bologna Workshop di INFN-Grid ottobre 2004 Bari.

1/3/2006 Grid operations: structure and organization Cristina Vistoli INFN CNAF – Bologna - Italy.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Management Claudio Grandi.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.

DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.

Using HLRmon for advanced visualization of resource usage Enrico Fattibene INFN - CNAF ISCG 2010 – Taipei March 11 th, 2010.

Enabling Grids for E-sciencE INFN Workshop – May 7-11 Rimini 1 Grid Accounting Status at INFN Riccardo Brunetti INFN-TORINO.

The status of IHEP Beijing Site WLCG Asia-Pacific Workshop Yaodong CHENG IHEP, China 01 December 2006.

Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)

Quattor installation and use feedback from CNAF/T1 LCG Operation Workshop 25 may 2005 Andrea Chierici – INFN CNAF

1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.

Servizi core INFN Grid presso il CNAF: setup attuale

CASTOR: possible evolution into the LHC era

Real Time Fake Analysis at PIC

LCG Service Challenge: Planning and Milestones

Monitoring: problems, solutions, experiences

Accounting at the T1/T2 Sites of the Italian Grid

Sergio Fantinel, INFN LNL/PD

Luca dell’Agnello INFN-CNAF

Virtualization in the gLite Grid Middleware software process

a VO-oriented perspective

Data Management cluster summary

INFNGRID Workshop – Bari, Italy, October 2004

The LHCb Computing Data Challenge DC06

Presentation transcript:

Fabric Monitor, Accounting, Storage and Reports experience at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Workshop sul calcolo e reti INFN - Otranto

Outline CNAF-INFN Tier1 FARM and GRID Monitoring Local Queues Monitoring –Local and GRID accounting Storage Monitoring and accounting Summary

Introduction Location: INFN-CNAF, Bologna (Italy) –one of the main nodes of GARR network Computing facility for INFN HNEP community –Partecipating to LCG, EGEE, INFNGRID projects Multi-Experiments TIER1 –LHC experiments (Alice, Atlas, CMS, LHCb) –CDF, BABAR –VIRGO, MAGIC, ARGO, Bio, TheoPhys, Pamela... Resources assigned to experiments on a yearly Plan.

The Farm in a Nutshell - SLC 3.0.6, LCG 2.7, LSF ~ 720 WNs LSF pool (~1580 KSI2K) -Common LSF pool: 1 job per logical cpu (slot) -MAX 1 process running at the same time per job –GRID and local submission allowed On the same WN can run GRID and not GRID jobs On the same queue can be submitted GRID and not GRID jobs –For each VO/EXP one or more queues –Since 24th of April jobs were executed on our LSF pool (~ GRID) –3 CEs (main CE 4 opteron dualcore, 24 GB RAM) + 1 CE gLite

Access to Batch system “Legacy” non-Grid Access CELSF Wn1WNn SE Grid Access LSF client UI Grid

Farm Monitoring Goals Scalability to Tier1 full size Many parameters for each WN/server DataBase and Plots on Web Pages Data Analysis Report problems on Web Page(s) Share data with GRID tools RedEye: INFN-T1 tool monitoring RedEye: simple local user. No root!

Tier1 Fabric Monitoring What do we get? CPU load, status and jiffies Ethernet I/O, (MRTG by net-boys) Temperatures, RPM fans (IPMI) Total and type of active TCP connections Processes created, running, zombie etc RAM and SWAP memory Users logged in SLC3 and SLC4 compatible

Tier1 Fabric Monitor

Local WN Monitoring On each WN every 5 min (local crontab) infos are saved locally ( 2-3 TCP packets) 1 minute later a collector “gets” via socket the infos –“gets”: tidy parallel fork with timeout control To get and save locally datas from 750 WN ~ 6 sec best case. 20 sec worst case (timeout knife) Upgrade DataBase (last day, week, month) For each WN --> 1 file (possibility of cumulative plots) Analysis monitor datas Local thumbnail cache creation (web clickable)

Web Snapshot CPU-RAM

Web Snapshot TCP connections

Web Snapshot users logged

Analyzer.html

Fabric  GRID Monitoring Effort on exporting relevant fabric metrics to the Grid level e.g.: –# of active WNs, –# of free slots, –etc… GridICE integration –Configuration based on Quattor Avoid duplication of sensors on farm

Local Queues Monitoring Every 5 minutes on batch manager is saved queues status (snapshot) A collector gets the infos and upgrades the local database (same logic of farm monitoring) –Daily / Weekly / Monthly / Yearly DB –DB: Total and single queues 3 classes of users for each queue Plots generator: Gnuplot 4.0

Web Snapshot LSF Status

UGRID: general GRID user (lhcb001, lhcb030…) SGM: Software GRID Manager (lhcbsgm) OTHER: local user

UGRID: general GRID user (babar001, babar030…)SGM: Software GRID Manager (babarsgm)OTHER: local user

RedEye - LSF Monitoring Real time slot usage Fast, few CPU power needed, stable, works on WAN RedEye simple user, not root BUT… 1.all slots have the same weight (Future: Jeep solution) 2.Jobs shorter than 5 minutes can be lost SO: We need something good for ALL jobs. We need to know who and how uses our FARM. Solution: Offline parsing LSF log files one time per day (Jeep integration)

Job-related metrics From LSF log file we got the following non-GRID info: LSF JobID, local UID owner of the JOB “any kind of time” (submission, WCT etc) Max RSS and Virtual Memory usage From which computer (hostname) the job was submitted (GRID CE/locally) Where the job was executed (WN hostname) We complete this set with KSI2K & GRID infos (Jeep) DGAS interface

Queues accounting report

KSI2K [WCT] May 2006, All jobs

Queues accounting report CPUTime [hours] May 2006, GRID jobs

How we use KspecINT2K? -1 slot → 1 job - -For each job:

KSI2K T1-INFN Story

Job Check and Report Lsb.acct had a big bug! –Randomly: CPU-user-time = 0.00 sec –From bjobs -l correct CPUtime –Fixed by Platform at 25th of July 2005 CPUtime > WCT? --> Possible Spawn RAM memory: is a job on the right WN? Is the WorkerNode a “black hole”? We have a daily report (Web page)

Fabric and GRID monitoring Effort on exporting relevant queue and job metrics to the Grid level. –Integration with GridICE –Integration with DGAS (done!) –Grid (VO) level view of resource usage Integration of local job information with Grid related metrics. E.g.: –DN of the user proxy –VOMS extensions to user proxy –Grid Job ID

GRID ICE Dissemination GridICE server (development with upcoming features) GridICE server for EGEE Grid GridICE server for INFN-Grid

GRID ICE For each site check GRID services (RB, BDII, CE, SE…) Check service--> Does PID exist? Summary and/or notification From GRID servers: Summary CPU and Storage resources available per site and/or per VO Storage available on SE per VO from BDII Downtimes

GRID ICE Grid Ice as fabric monitor for “small” sites Based on LeMon (server and sensors) Parsing of LeMon flatfiles logs Plots based on RRD Tools Legnaro: ~70 WorkerNodes

GridICE screenshots

Jeep General Purpose collector datas (push technology) DB-WNINFO: Historical hardware DB (MySQL on HLR node). KSI2K used by each single job (DGAS) Job Monitoring (Check RAM usage in real time, efficiency history) FS-INFO: Enough available space on volumes? AutoFS: all dynamic mount-points are working? Match making UID/GID --> VO

The Storage in a Nutshell Different hardware (NAS, SAN, Tapes) –More than 300 TB HD, 130 TB Tape Different access methods (NFS/RFIO/Xrootd/gridftp) Volumes FileSystem: EXT3, XFS and GPFS Volumes bigger than 2 TBytes: RAID 50 (EXT3/XFS). Direct (GPFS) Tape access: CASTOR (50 TB of HD as stage) Volumes management via Postgresql DB 60 servers to export FileSystems to WNs

Storage at T1-INFN Hierarchical Nagios servers to check services status –gridftp, srm, rfio, castor, ssh Local tool to sum space used by VOs RRD to plot (volume space total and used) Binary and owner (IBM/STEK) software to check some hardware status. Very very very difficult to interface owner software to T1 framework For now: only report for bad blocks, disks failure and FileSystem failure Plots: intranet & on demand by VO

Tape/Storage usage report

Summary Fabric level monitoring with smart report is needed to ease management T1 has already solution for 2 next years! Not exportable due to man-power (no support) Future at INFN? What is T2s man-power? LeMon&Oracle? What is T2s man-power? RedEye? What is T2s man-power? Real collaboration is far from mailing list and phone conferences only