Www.egi.eu EGI-InSPIRE RI-261323 EGI-InSPIRE www.egi.eu EGI-InSPIRE RI-261323 Etienne Dublé - CNRS/UREC Alfredo Pagano – GARR NetJobs: Network Monitoring.

Slides:



Advertisements
Similar presentations
Security middleware Andrew McNab University of Manchester.
Advertisements

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Wrap up on perfSONAR-Lite_TSS and Network Troubleshooting Mario Reale GARR.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
CCNA2 Module 4. Discovering and Connecting to Neighbors Enable and disable CDP Use the show cdp neighbors command Determine which neighboring devices.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Computer Monitoring System for EE Faculty By Yaroslav Ross And Denis Zakrevsky Supervisor: Viktor Kulikov.
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
How Clients and Servers Work Together. Objectives Learn about the interaction of clients and servers Explore the features and functions of Web servers.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Wireless Video Sensor Networks Vijaya S Malla Harish Reddy Kottam Kirankumar Srilanka.
Implementing High Availability
Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug
Client/Server Architectures
Linux Operations and Administration
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
MySQL and GRID Gabriele Carcassi STAR Collaboration 6 May Proposal.
Chapter 34 Java Technology for Active Web Documents methods used to provide continuous Web updates to browser – Server push – Active documents.
What’s New in Fireware v11.9.5
G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Hands-On Microsoft Windows Server Implementing Microsoft Internet Information Services Microsoft Internet Information Services (IIS) –Software included.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Security monitoring boxes Andrew McNab University of Manchester.
CEOS WGISS-21 CNES GRID related R&D activities Anne JEAN-ANTOINE PICCOLO CEOS WGISS-21 – Budapest – 2006, 8-12 May.
1 Client-Server Interaction. 2 Functionality Transport layer and layers below –Basic communication –Reliability Application layer –Abstractions Files.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Etienne Dublé - CNRS/UREC EGEE SA2 Xavier.
OSG AuthZ components Dane Skow Gabriele Carcassi.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
Paul Graham Software Architect, EPCC PCP – The P robes C oordination P rotocol A secure, robust framework.
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Update on Network Performance Monitoring.
VO Box Issues Summary of concerns expressed following publication of Jeff’s slides Ian Bird GDB, Bologna, 12 Oct 2005 (not necessarily the opinion of)
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
INFSO-RI Enabling Grids for E-sciencE /10/20054th EGEE Conference - Pisa1 gLite Configuration and Deployment Models JRA1 Integration.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Xavier Jeannin (CNRS/UREC Paris, FR) 24.
Module 11: Configuring and Managing Distributed File System.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.
Connect communicate collaborate Performance Metrics & Basic Tools Robert Stoy, DFN EGI TF, Madrid September 2013.
II EGEE conference Den Haag November, ROC-CIC status in Italy
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Etienne Dublé.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Network Support Workshop Mario Reale / IGI - GARR EGI Network Support.
G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.
INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
EGI-InSPIRE EGI-InSPIRE RI Network Troubleshooting and PerfSONAR-Lite_TSS Mario Reale GARR.
Cisco 3 - Switch Perrine. J Page 17/3/2016 Chapter 3 Which of the following are found in the EIGRP neighbor table? (Choose two.) 1.routes installed by.
Introduction to Distributed Platforms
High Availability Linux (HA Linux)
Use of Nagios in Central European ROC
Database System Concepts and Architecture
Network Monitoring Using Grid Jobs EGEE SA2
Short update on the latest gLite status
Network Requirements Javier Orellana
Mario Reale – IGI / GARR Lyon, Sept 19, 2011
Interoperability & Standards
Pierre Girard ATLAS Visit
Presentation transcript:

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Etienne Dublé - CNRS/UREC Alfredo Pagano – GARR NetJobs: Network Monitoring Using Grid Jobs

EGI-InSPIRE RI Content Network Monitoring… –In the context of grids –In the context of EGI The idea System architecture –Global view –The Server, the Jobs and the Grid –User Interface Next steps 2

EGI-InSPIRE RI Network Monitoring… - In the context of grids - In the context of EGI 3

EGI-InSPIRE RI Network Monitoring for Grids GRIDs are big users and they will exercise the network –The LHC generating ~15 PetaBytes of raw data/year for sure is a big user Grid middleware can benefit from monitoring: –Example: Network aware job and data transfer scheduling When a problem occurs, a grid operator / user would like to check quickly if the network is involved in the problem:  This is especially important for grids because in such a complex environment the network is one of many layers 4

EGI-InSPIRE RI Previous related efforts e2emonit (pingER, UDPmon, IPERF) NPM (Network Performance Monitor) –PCP (Probe Control Protocol) Diagnostic Tool PerfSONAR_Lite-TSS PerfSONAR-MDM 5

EGI-InSPIRE RI The EGI context The EGEE/EGI project did not recommend any specific solution for network monitoring –A part of the grid is already monitored (LHCOPN, specific national initiatives, …), and there are plans to monitor more links  Monitor all Tier-1 Tier-2 links using PerfSONAR? PerfSONAR Lite TSS is dedicated to troubleshooting  In this project we are trying to address the needs which are not yet addressed 6

EGI-InSPIRE RI Characteristics of the tool Our approach had to take into account: –High scalability –Security –Reliability –Cost-effectiveness And preferably: –A lightweight deployment 7

EGI-InSPIRE RI The idea: “Instead of installing a probe at each site, run a grid job” 8

EGI-InSPIRE RI pros and cons Added value: –No installation/deployment needed in the sites  Monitoring 10 or 300 sites is just a matter of configuration –A monitoring system running on a proven architecture (the grid) –Possibility to use grid services (ex: AuthN and AuthZ) 9

EGI-InSPIRE RI pros and cons Limits: –Some low-level metrics can’t be implemented in the job  Because we have no control of the “Worker Node” environment (hardware, software) where the job is running –Some sites will have to slightly update their middleware configuration  The maximum lifetime of jobs should be increased if it is too low (at least for the DN of the certificate that the system uses) 10

EGI-InSPIRE RI System architecture: Global view 11

EGI-InSPIRE RI System Architecture the components Frontend: Apache Tomcat, Ajax, Google Web Toolkit (GWT) Monitoring server & Jobs: Python, bash script (portability is a major aspect for jobs) Database: PostgreSQL www request Front-end DB 1 Monitoring server Grid network monitoring jobs DB 2 Monitoring server DB ROC1 Monitoring ROC1 – Server A Monitoring ROC1 – Server B Possible new configuration Monitoring server 12

EGI-InSPIRE RI Current prototype: 8 Sites 13

EGI-InSPIRE RI Choice of network paths To Monitor all possible site-to-site paths will be too much: N x (N-1) paths and N ~ 300 sites for a whole grid coverage We must restrict the number of these paths –To a specific VO, to an experiment, to the most used paths, etc. –We have studied this at

EGI-InSPIRE RI Choice of network paths The system is completely configurable about these paths and the scheduling of measurements –The admin specifies a list of scheduled tests, giving for each one »The source and the remote site »The type of test »The time and frequency of the test –Users can contact and request the administrator to have a given path monitored (form available on the UI) This request is then validated by the administrator. If you still have many paths, you can start several server instances (in order to achieve the needed performance) 15

EGI-InSPIRE RI Example of scheduling Latency test –TCP RTT –Every 10 minutes Hop count –Iterative connect() test –Every 10 minutes MTU size –Socket (IP_MTU socket option) –Every 10 minutes Achievable Bandwidth –TCP throughput transfer via GridFTP transfer between 2 Storage Elements –Every 8h In order to avoid too many connections these three measurements are done in the same test 16

EGI-InSPIRE RI System architecture: The Server, the Jobs, and the Grid 17

EGI-InSPIRE RI Technical constraints When running a job, the grid user is mapped to a Linux user of the Worker Node (WN): –This means the job is not running as root on the WN  Some low level operations are not possible (for example opening an ICMP listening socket is not allowed) Heterogeneity of the WN environments (various OS, 32/64 bits…) –Ex: making the job download and run an external tool may be tricky (except if it is written in an OS independent programming language) The system has to deal with the grid mechanism overhead (delays, job lifetime limit…) 18

EGI-InSPIRE RI Initialization of grid jobs Site paris-urec-ipv6 UI Central monitoring server program (CMSP) Site A WN Job Site X WMS CE Site B WN Job CE Site C WN Job CE Job submission Socket connection Ready! Probe Request Request: RTT test to site A Request: BW test to site B 19

EGI-InSPIRE RI Remarks Chosen design (1 job many probes) is much more efficient than starting a job for each probe –Considering (grid-related) delays –Considering the handling of middleware failures (nearly 100% of failures occur at job submission, not once the job is running) TCP connection is initiated by the job  No open port needed on the WN  better for security of sites An authentication mechanism is implemented between the job and the server A job cannot last forever (GlueCEPolicyMaxWallClockTime), so actually there are 2 jobs running at each site –A ‘main’ one, and –A ‘redundant’ one which is waiting and will become ‘main’ when the other one ends 20

EGI-InSPIRE RI RTT, MTU and hop count Site paris-urec-ipv6 UI Central monitoring server program (CMSP) Site B WN Job Site C CE Socket connection Probe Request Probe Result Request: RTT test to site C 21

EGI-InSPIRE RI RTT, MTU and hop test The ‘RTT’ measure is the time a TCP ‘connect()’ call takes: –Because a connect() call involves a round-trip of packets: SYN SYN-ACQ ACQ –Results very similar to the ones of ‘ping’ The MTU is given by the IP_MTU socket option The number of hops is calculated in an iterative way These measures require: –To connect to an accessible port (1) on a machine of the remote site –To close the connection (no data is sent) –Note: This (connect/disconnect) is detected in the application log (1): We use the port of the gatekeeper of the CE since it is known to be accessible (it is used by the grid middleware gLite) Round trip Just sending => no network delay 22

EGI-InSPIRE RI Active GridFTP BW Test Site paris-urec-ipv6 UI Central monitoring server program (CMSP) Site A WN Job Site C Probe Request Socket connection SE Replication of a large grid file Read the gridFTP log file Probe Result Request: GridFTP BW test to site C 23

EGI-InSPIRE RI GridFTP BW test If the GridFTP log file is not accessible (cf. dCache?) –In this case we just do the transfer via globus-url- copy in a verbose mode in order to get the transfer rate. A passive version of this BW test is being developed –The job just reads the gridftp log file periodically (the system does not request additional transfers) –This is only possible if the log file is available on the Storage Element (i.e. it is a DPM) 24

EGI-InSPIRE RI System architecture: User Interface 25

EGI-InSPIRE RI The user interface 26

EGI-InSPIRE RI The contact form 27

EGI-InSPIRE RI Next steps 28

EGI-InSPIRE RI Next steps 1.Near future: GridFTP passive BW test alerts 29

EGI-InSPIRE RI Next steps 2.Other possible enhancements: Refresh measurements on-demand (don’t wait several hors for the next bw test...) Add more types of measurements? Consider adding a dedicated box (VObox?) o If some of the metrics needed are not available with the job-based approach Ex: low level measurements requiring root privileges o The job would interact with this box and transport the results o This might be done in a restricted set of major sites Consider interaction with other systems (some probes may be already installed at some sites, we could benefit from them) 30

EGI-InSPIRE RI Thank You Feedback, discussion, requests… Wiki: