Mario Reale – IGI / GARR Lyon, Sept 19, 2011

Slides:

Advertisements

Similar presentations

TeraGrid Deployment Test of Grid Software JP Navarro TeraGrid Software Integration University of Chicago OGF 21 October 19, 2007.

Advertisements

30-31 Jan 2003J G Jensen, RAL/WP5 Storage Elephant Grid Access to Mass Storage.

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Wrap up on perfSONAR-Lite_TSS and Network Troubleshooting Mario Reale GARR.

1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.

Computer Monitoring System for EE Faculty By Yaroslav Ross And Denis Zakrevsky Supervisor: Viktor Kulikov.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Etienne Dublé - CNRS/UREC Alfredo Pagano – GARR NetJobs: Network Monitoring.

Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug

DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.

Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.

Honeypot and Intrusion Detection System

Chapter 34 Java Technology for Active Web Documents methods used to provide continuous Web updates to browser – Server push – Active documents.

Csi315csi315 Client/Server Models. Client/Server Environment LAN or WAN Server Data Berson, Fig 1.4, p.8 clients network.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks gLite IPv6 compliance project tests Further.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.

CEOS WGISS-21 CNES GRID related R&D activities Anne JEAN-ANTOINE PICCOLO CEOS WGISS-21 – Budapest – 2006, 8-12 May.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Experiment Management System CSE 423 Aaron Kloc Jordan Harstad Robert Sorensen Robert Trevino Nicolas Tjioe Status Report Presentation Industry Mentor:

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Etienne Dublé - CNRS/UREC EGEE SA2 Xavier.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.

EMI INFSO-RI Argus Policies in Action Valery Tschopp (SWITCH) on behalf of the Argus PT.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

Grid Interoperability Update on GridFTP tests Gregor von Laszewski

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

INFSO-RI Enabling Grids for E-sciencE EGEE is a project funded by the European Union under contract IST Job sandboxes.

INFSO-RI Enabling Grids for E-sciencE /10/20054th EGEE Conference - Pisa1 gLite Configuration and Deployment Models JRA1 Integration.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The LCG interface Stefano BAGNASCO INFN Torino.

Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group

Accounting in DataGrid HLR software demo Andrea Guarise Milano, September 11, 2001.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

The Institute of High Energy of Physics, Chinese Academy of Sciences Sharing LCG files across different platforms Cheng Yaodong, Wang Lu, Liu Aigui, Chen.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.

Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Consorzio COMETA - Progetto PI2S2 UNIONE EUROPEA Grid2Win : gLite for Microsoft Windows Elisa Ingrà - INFN.

A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.

G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

EGI-InSPIRE EGI-InSPIRE RI Network Troubleshooting and PerfSONAR-Lite_TSS Mario Reale GARR.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI IPv6 Report for HEPiX CERN October 5, 2012 CERN 1

CH-1211 Genève 23 Job efficiencies at CERN Review of job efficiencies at CERN status report James Casey, Daniel Rodrigues, Ulrich Schwickerath.

Progress Apama Fundamentals

Jean-Philippe Baud, IT-GD, CERN November 2007

Gri2Win: Porting gLite to run under Windows XP Platform

Introduction to Distributed Platforms

High Availability Linux (HA Linux)

Use of Nagios in Central European ROC

Design rationale and status of the org.glite.overlay component

ALICE FAIR Meeting KVI, 2010 Kilian Schwarz GSI.

Global Banning List and Authorization Service

ATLAS support in LCG.

Network Monitoring Using Grid Jobs EGEE SA2

Short update on the latest gLite status

Network Requirements Javier Orellana

Gri2Win: Porting gLite to run under Windows XP Platform

Interoperability & Standards

SCL, Institute of Physics Belgrade, Serbia

Lecture 1: Multi-tier Architecture Overview

Pierre Girard ATLAS Visit

Grid Coordination by Using the Grid Coordination Protocol

Job Application Monitoring (JAM)

Site availability Dec. 19 th 2006

Presentation transcript:

Mario Reale – IGI / GARR mario.reale@garr.it Lyon, Sept 19, 2011 Network Support Operations Workshop @ EGI TF

Outline Basic idea Architecture GUI Potentials

“Instead of installing a probe at each site, run a grid job” basic idea: “Instead of installing a probe at each site, run a grid job”

Features of NetJobs Lightweight deployment High scalability Security Reliability Cost-effectiveness

Pros No installation/deployment needed in the sites Monitoring 10 or 300 sites is just a matter of configuration Running on a proven architecture (the grid) Possibility to use grid services (ex: AuthN and AuthZ)

Cons Some low-level metrics can’t be implemented in the job Because we have no control of the “Worker Node” environment (hardware, software) where the job is running Some sites will have to slightly update their middleware configuration The maximum lifetime of jobs should be increased if it is too low (at least for the DN of the certificate that the system uses)

System architecture: Global view

System Architecture the components www request Front-end DB 1 Monitoring server Grid network monitoring jobs DB 2 Monitoring server Monitoring server DB ROC1 Monitoring server @ ROC1 – Server A Monitoring server @ ROC1 – Server B Possible new configuration Frontend: Apache Tomcat, Ajax, Google Web Toolkit (GWT) Monitoring server & Jobs: Python, bash script (portability is a major aspect for jobs) Database: PostgreSQL

Choice of network paths The system is completely configurable about the e2e paths and the scheduling of measurements The admin specifies a list of scheduled tests, giving for each one The source and the remote site The type of test The time and frequency of the test Users can contact and request the administrator to have a given path monitored (form available on the UI) This request is then validated by the administrator. If you still have many paths, you can start several server instances (in order to achieve the required performance)

Example of scheduling Latency test Hop count MTU size TCP RTT Every 10 minutes Hop count Iterative connect() test MTU size Socket (IP_MTU socket option) Achievable Bandwidth TCP throughput transfer via GridFTP transfer between 2 Storage Elements Every 8h In order to avoid too many connections these three measurements are done in the same test

System architecture: The Server, the Jobs, and the Grid

Technical constraints When running a job, the grid user is mapped to a Linux user of the Worker Node (WN): This means the job is not running as root on the WN Some low level operations are not possible (for example opening an ICMP listening socket is not allowed) Heterogeneity of the WN environments (various OS, 32/64 bits…) Ex: making the job download and run an external tool may be tricky (except if it is written in an OS independent programming language) The system has to deal with the grid mechanism overhead (delays, job lifetime limit…)

Initialization of grid jobs Site paris-urec-ipv6 Site X Ready! UI WMS Central monitoring server program (CMSP) Site A Site B Site C CE CE CE Request: RTT test to site A Request: BW test to site B WN WN WN Job Job Job Job submission Socket connection Probe Request 13

Remarks Chosen design (1 job <-> many probes) is much more efficient than starting a job for each probe Considering (grid-related) delays Considering the handling of middleware failures (nearly 100% of failures occur at job submission, not once the job is running) TCP connection is initiated by the job No open port needed on the WN  better for security of sites An authentication mechanism is implemented between the job and the server A job cannot last forever (GlueCEPolicyMaxWallClockTime), so actually there are 2 jobs running at each site A ‘main’ one, and A ‘redundant’ one which is waiting and will become ‘main’ when the other one ends

Request: RTT test to site C RTT, MTU and hop count Site paris-urec-ipv6 UI Central monitoring server program (CMSP) Site B Site C CE Request: RTT test to site C WN Job Probe Request Socket connection Probe Result

Just sending => no network delay RTT, MTU and hop test The ‘RTT’ measure is the time a TCP ‘connect()’ call takes: Because a connect() call involves a round-trip of packets: SYN SYN-ACQ ACQ Results very similar to the ones of ‘ping’ The MTU is given by the IP_MTU socket option The number of hops is calculated in an iterative way These measures require: To connect to an accessible port (1) on a machine of the remote site To close the connection (no data is sent) Note: This (connect/disconnect) is detected in the application log (1): We use the port of the gatekeeper of the CE since it is known to be accessible (it is used by the grid middleware gLite) Round trip Just sending => no network delay

Request: GridFTP BW test to site C Active GridFTP BW Test Site paris-urec-ipv6 UI Central monitoring server program (CMSP) Replication of a large grid file Site A Site C SE SE Read the gridFTP log file Request: GridFTP BW test to site C WN Job Probe Request Socket connection Probe Result

GridFTP BW test If the GridFTP log file is not accessible (cf. dCache? ) In this case we just do the transfer via globus-url-copy in a verbose mode in order to get the transfer rate. A passive version of this BW test could be envisageable The job just reads the gridftp log file periodically (the system does not request additional transfers) This is only possible if the log file is available on the Storage Element (i.e. it is a DPM)

User Interface

Changes/Progress since January Set up server/DB and GUI at GARR at http://netjobs.dir.garr.it No new sites enrolled Some minor code update to cope with minor changes in gLite Updated GUI (GWT ext 2.0.0 to 2.0.2) to fix issues with latest Firefox version

Potentials & Future Highly scalable tool Robust w.r.t. the Grid worker nodes environment Easy to configure Currently focusing on low level network metrics Scheduled bandwidth measurements can be integrated easily Any other network measurement (unpriv.user on EGI Worker Nodes) can be integrated Anyone (site) wishing to be included in the current system server just tell us We can provide support for setting up new servers & GUIs Documentation (Install & Admin Guide) will be provided - pending potential interested users

References / Contacts http://netjobs.dir.garr.it/ Current Instance: Wiki: https://twiki.cern.ch/twiki/bin/view/EGI/GridNetworkMonitoring Contacts: etienne.duble@urec.cnrs.fr mario.reale@garr.it Acknowledgements: Stefano Gargiulo / GARR Roma Etienne Duble / LIG IMAG Grenoble