And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR 17.05.2012.

Slides:

Advertisements

Similar presentations

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

Advertisements

23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.

A tool to enable CMS Distributed Analysis

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

TeraGrid Information Services December 1, 2006 JP Navarro GIG Software Integration.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Experience of xrootd monitoring for ALICE at RDIG sites G.S. Shabratova JINR A.K. Zarochentsev SPbSU.

High Energy Physics At OSCER A User Perspective OU Supercomputing Symposium 2003 Joel Snow, Langston U.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.

The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.

Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.

F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;

Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.

Configuration Management with Cobbler and Puppet Kashif Mohammad University of Oxford.

ATLAS and GridPP GridPP Collaboration Meeting, Edinburgh, 5 th November 2001 RWL Jones, Lancaster University.

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.

Information Services Andrew Brown Jon Ludwig Elvis Montero grid:seminar1:lectures:seminar-grid-1-information-services.ppt.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.

AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

23.March 2004Bernd Panzer-Steindel, CERN/IT1 LCG Workshop Computing Fabric.

XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,

Jean-Roch Vlimant, CERN Physics Performance and Dataset Project Physics Data & MC Validation Group McM : The Evolution of PREP. The CMS tool for Monte-Carlo.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Julia Andreeva on behalf of the MND section MND review.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –

Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.

INRNE's participation in LCG Elena Puncheva Preslav Konstantinov IT Department.

WLCG Transfers Dashboard A unified monitoring tool for heterogeneous data transfers. Alexandre Beche.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.

Dario Barberis: ATLAS DB S&C Week – 3 December Oracle/Frontier and CondDB Consolidation Dario Barberis Genoa University/INFN.

Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.

WLCG Transfers monitoring EGI Technical Forum Madrid, 17 September 2013 Pablo Saiz on behalf of the Dashboard Team CERN IT/SDC.

Grid Colombia Workshop with OSG Week 2 Startup Rob Gardner University of Chicago October 26, 2009.

Daniele Bonacorsi Andrea Sciabà

Database Replication and Monitoring

Blueprint of Persistent Infrastructure as a Service

ALICE Monitoring

POW MND section.

FTS Monitoring Ricardo Rocha

Artem Petrosyan (JINR), Danila Oleynik (JINR), Julia Andreeva (CERN)

Monitoring Of XRootD Federation

Monitoring of the infrastructure from the VO perspective

Production Manager Tools (New Architecture)

Presentation transcript:

and Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

Overview WLCG structure Tier3 T3mon concept Monitoring tools T3mon implementation

WLCG structure Goals of WLCG Hierarchic approach Production and analysis Argument for Tier3

Goals of WLCG Turn detector signal into physics Atlas for example: 1 Pb/s => The trigger => 200 Mb/s 15 petabytes of raw data per year To do: store processes analyze

Hierarchic approach Tier 0 the CERN computer centre safe-keeping the first copy of raw data first pass reconstruction Tier 1 11 centers all around the world safe-keeping shares of raw, reconstructed, reprocessed and simulated data reprocessing Tier 2 about 140 sites production and reconstruction of simulated events analysis

Production and analysis Data selection algorithms improve Calibration data change Re-processing several times a year of all data gathered since LHC start-up

Argument for Tier3 Analysis on Tier2 is inconvenient Institutions have local computing resources Local access and resources dedicated to analysis improve user response time dramatically

Tier3 What is Tier3? Types of Tier3 Compare and contrast: Tier2 Current status Need for monitoring

What is Tier 3 non-pledged computing resources institutional computing no formal commitment to WLCG for analysis not really another level of the model

Types of Tier3 sites Tier3 with Tier2 functionality Collocated with Tier2 National analysis facilities Non-grid Tier3’s

Compare and contrast: Tier2Tier3 Different jobs All VO users Strict requirements on the quality of service Pledged resources A set of mandatory grid services on site Processed data can go back in the grid VO central control Analysis only Local/regional users No QOS information gathered globally (yet!) Non-pledged resources Often no grid services, may be a full or partial set No data allowed back in the grid Local control

Current status More than 300 Tier3 sites right now A survey by Atlas in 2010 Tier 3 come many different sizes Storage methods vary Different LRMSs used Different ways to access WLCG Many don’t have monitoring Limited manpower

Need for monitoring any system needs monitoring some information is required on the global level dataset popularity usage statistics grid services may not be present existing grid monitoring systems can’t be used Tier 3 sites have limited manpower need an easy way to set up monitoring

T3mon concept Users and requirements What to monitor Structure Local monitoring Global monitoring

Users and requirements local administrators detailed fabric monitoring resources management systems (LRMS) mass storage systems (MSS) VO managers general usage statistics and quality of service global services dataset popularity

What to monitor local resources management systems Proof PBS Condor Oracle Grid Engine mass storage systems (MSS): XRootD Lustre

Components Local monitoring detailed fabric monitoring gather data used by the global system present detailed data to local administrators Global monitoring aggregate metrics from local monitoring give necessary information to central services present data via Dashboard

Local monitoring system Condor Lustre OGE PBS XRootD Proof Local DB MSG Publishing agent

Global monitoring system Local monitoring system MSG Local monitoring system Local monitoring system Local monitoring system Dashboard PPPP C Data management C

Tools Ganglia data flow plug-in system Dashboard MSG ActiveMQ

Ganglia 21 distributed monitoring system for clusters and Grids Condor Lustre OGE PBS XRootD Proof Local DB MSG Publishing agent

Why Ganglia? easy to set up fabric monitoring popular choice among Tier 3 sites extension modules for LRMS and MSS monitoring

Ganglia data flow 23 gmond gmetad web frontend rrdtool string metrics numeric metrics xml by request XDR via UDP head node node

Ganglia web interface

gmond Adding custom metrics 25 module monitored subsystem callback monitored subsystem custom monitoring daemon gmetric ganglia

Dashboard “The Experiment Dashboard's main goal is to collect and expose to users relevant information about the services and applications running on the grid environment“ Other Applications Feeders Web Application Collectors Data Access Layer (DAO) Messaging system Tier3 software

MSG WLCG Messaging System for Grids “Aims to help the integration and consolidation of the various grid monitoring systems used in WLCG” Based on ActiveMQ open-source message broker

T3Mon implementation Project structure Subsystem modules Proof monitoring module PBS monitoring module Condor monitoring module Lustre monitoring module XRootD monitoring module Testing infrastructure

Project structure Python SVN provided by CERN RPM repository with a separate package for each monitoring module Each module handles one software system to be monitored on Tier3 One configuration file for all modules

Proof 30 gmond database MSG Ganglia Proof Proof plug-in

PBS 31 gmond log files MSG Ganglia PBS PBS plug-in

… … Condor 32 gmond database MSG Ganglia Condor plug-in Condor condor_quill condor_master condor_startd

Lustre 33 gmond Ganglia Lustre Lustre plug-in /proc/fs/lustre

cmsd xrootd XRootD 34 mpxstats summary_to_ganglia.py gmetric gmond cmsd xrootd xrdsummond xrootd.py xrddetmond database MSG Ganglia

Testing infrastructure Goals Document installing Ganglia on a cluster Document configuring Tier3 subsystems for monitoring Test modules in a minimal cluster environment Clusters: PBS: 3 nodes (1 head node, 2 worker nodes) Proof: 3 nodes (1 hn, 2 wns) Condor: 3 nodes (1 hn, 1 wn, 1 client) OGE: 3 nodes (1 hn, 2 wn) Lustre: 3 nodes (1 MDS, 1 OSS, 1 client) Xrootd: 3 nodes (1 manager, 2 servers) Xrootd II: 3 nodes (1 manager, 2 servers) Development machine Installation testing machine

Virtual testing infrastructure 23 nodes total only 2 physical servers running virtualization software (OpenVZ and Xen) fast deployment and reconfiguring of nodes as required performance is not a deciding factor

Results and plans The project is nearing completion Most modules are done Proof and XRootD modules already testing on real clusters Next steps: Message consumers OGE Testing and support Data transfer monitoring project

Thank you!