Network Monitoring for OSG Shawn McKee/University of Michigan OSG Staff Planning Retreat July 10 th, 2012 July 10 th, 2012.

Slides:



Advertisements
Similar presentations
Circuit Monitoring July 16 th 2011, OGF 32: NMC-WG Jason Zurawski, Internet2 Research Liaison.
Advertisements

Microsoft ® System Center Configuration Manager 2007 R3 and Forefront ® Endpoint Protection Infrastructure Planning and Design Published: October 2008.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Technical Brief v1.0. Communication tools that broadcast visual content directly onto the screens of computers, using multiple channels and formats Easy.
Update on OSG/WLCG perfSONAR infrastructure Shawn McKee, Marian Babik HEPIX Spring Workshop, Oxford 23 rd - 27 th March 2015.
Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
Windows Server 2008 Chapter 11 Last Update
Microsoft ® Application Virtualization 4.6 Infrastructure Planning and Design Published: September 2008 Updated: February 2010.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.
1 ESnet Network Measurements ESCC Feb Joe Metzger
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Connect communicate collaborate perfSONAR MDM updates: New interface, new possibilities Domenico Vicinanza perfSONAR MDM Product Manager
WG Goals and Workplan We have a charter, we have a group of interested people…what are our plans? goalsOur goals should reflect what we have listed in.
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
Connect communicate collaborate perfSONAR MDM updates: New interface, new weathermap, towards a complete interoperability Domenico Vicinanza perfSONAR.
Internet2 Performance Update Jeff W. Boote Senior Network Software Engineer Internet2.
Thoughts on Future LHCOPN Some ideas Artur Barczyk, Vancouver, 31/08/09.
New perfSonar Dashboard Andy Lake, Tom Wlodek. What is the dashboard? I assume that everybody is familiar with the “old dashboard”:
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
TeraPaths TeraPaths: Establishing End-to-End QoS Paths through L2 and L3 WAN Connections Presented by Presented by Dimitrios Katramatos, BNL Dimitrios.
DYNES Storage Infrastructure Artur Barczyk California Institute of Technology LHCOPN Meeting Geneva, October 07, 2010.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.
Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015.
Connect. Communicate. Collaborate perfSONAR MDM Service for LHC OPN Loukik Kudarimoti DANTE.
Microsoft Management Seminar Series SMS 2003 Change Management.
Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015.
PerfSONAR-PS Functionality February 11 th 2010, APAN 29 – perfSONAR Workshop Jeff Boote, Assistant Director R&D.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,
Jeremy Nowell EPCC, University of Edinburgh A Standards Based Alarms Service for Monitoring Federated Networks.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Network and Transfer WG perfSONAR operations Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 28 h January 2015.
PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.
OSG Networking: Summarizing a New Area in OSG Shawn McKee/University of Michigan Network Planning Meeting Esnet/Internet2/OSG August 23 rd, 2012.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
US LHC Tier-2 Network Performance BCP Mar-3-08 LHC Community Network Performance Recommended BCP Eric Boyd Deputy Technology Officer Internet2.
PerfSONAR-PS Working Group Aaron Brown/Jason Zurawski January 21, 2008 TIP 2008 – Honolulu, HI.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.
Network Awareness and perfSONAR Why we want it. What are the challenges? Where are we going? Shawn McKee / University of Michigan OSG AHG - US CMS Tier-2.
DICE: Authorizing Dynamic Networks for VOs Jeff W. Boote Senior Network Software Engineer, Internet2 Cándido Rodríguez Montes RedIRIS TNC2009 Malaga, Spain.
GEMINI: Active Network Measurements Martin Swany, Indiana University.
New perfSonar Dashboard Andy Lake, Tom Wlodek. Outline Why we need a dashboard? Current dashboard – overview New dashboard – proposed architecture New.
Mardi 8 mars 2016 Status of new features in CIC Portal Latest Release of 22/08/07 Osman Aidel, Hélène Cordier, Cyril L’Orphelin, Gilles Mathieu IN2P3/CNRS.
LHCONE Monitoring Thoughts June 14 th, LHCOPN/LHCONE Meeting Jason Zurawski – Research Liaison.
Strawman LHCONE Point to Point Experiment Plan LHCONE meeting Paris, June 17-18, 2013.
© 2014 Level 3 Communications, LLC. All Rights Reserved. Proprietary and Confidential. Simple, End-to-End Performance Management Application Performance.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Deploying perfSONAR-PS for WLCG: An Overview Shawn McKee/University of Michigan WLCG PS Deployment TF Co-chair Fall 2013 HEPiX Ann Arbor, Michigan October.
Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.
1 Network related topics Bartosz Belter, Wojbor Bogacki, Marcin Garstka, Maciej Głowiak, Radosław Krzywania, Roman Łapacz FABRIC meeting Poznań, 25 September.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
1 Network Measurement Challenges LHC E2E Network Research Meeting October 25 th 2006 Joe Metzger Version 1.1.
PerfSONAR operations meeting 3 rd October Agenda Propose changes to the current operations of perfSONAR Discuss current and future deployment model.
Shawn McKee, Marian Babik for the
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
Robert Szuman – Poznań Supercomputing and Networking Center, Poland
Monitoring the US ATLAS Network Infrastructure with perfSONAR-PS
Shawn McKee/University of Michigan ATLAS Technical Interchange Meeting
Deployment & Advanced Regular Testing Strategies
LHCONE perfSONAR: Status and Plans
#01 Client/Server Computing
“Detective”: Integrating NDT and E2E piPEs
#01 Client/Server Computing
Presentation transcript:

Network Monitoring for OSG Shawn McKee/University of Michigan OSG Staff Planning Retreat July 10 th, 2012 July 10 th, 2012

Outline  Motivation for Network Monitoring  Status and Related Work  perfSONAR-PS  Modular Dashboard  Goals  Draft Work Plan 7/10/2012OSG Staff Planning Retreat2

Motivations for OSG Network Monitoring  Distributed collaborations rely upon the network as a critical part of their infrastructure, yet finding and debugging network problems can be difficult and, in some cases, take months.  There is typically no differentiation of how the network is used amongst the OSG users. (Quantity may vary)  We need a standardized way to monitor the network and locate problems quickly if they arise  We don’t want to have a network monitoring system per VO! 7/10/2012OSG Staff Planning Retreat3

Data Movement for Science 7/10/2012OSG Staff Planning Retreat4 – Special requirements (e.g. Streaming media is sensitive to jitter, bulk data transfer is sensitive to loss) – Number of users/devices is increasing – Locations are spread out – Everything is cross domain This should not be news to anyone here … Flows getting larger (e.g. Science datasets in the R&E world) Slide from Jason Zurawski

Network Realities 7/10/2012OSG Staff Planning Retreat5 TWhere are the problems? Network Core? Everything is well connected, well provisioned, and flawlessly configured, RIGHT? End Systems? Properly tuned for optimal TCP performance (no matter the operating system), proper drivers installed and functioning optimally, RIGHT? LAN? Regional Net? Better to ask “Where aren’t there problems?” Slide from Jason Zurawski

Need for a “Finger Pointing” Tool  As you can imagine (or have experienced), network problems can be hard to identify and/or isolate.  To first order most users identify any problem where the WAN is involved as being a “network problem” (sometimes they are right)  How can we quickly identify when problems are network problems and help isolate their locations?  The perfSONAR project was designed to help do this 7/10/2012OSG Staff Planning Retreat6

History of perfSONAR  perfSONAR: a joint effort of ESnet, Internet2, GEANT and RNP to standardize network monitoring protocols, schema and tools  USATLAS adopted perfSONAR-PS toolkit starting in All Tier-2s and the Tier-1 instrumented + full mesh tests by  Modular dashboard developed by Tom Wlodek/BNL based upon USATLAS requirements to better understand deployed infrastructure (working well for USATLAS).  LHCOPN choose to adopt in June 2011…mostly deployed within 3 months (by September 2011). 7/10/2012OSG Staff Planning Retreat7

OSG perfSONAR-PS Deployment  We want a set of tools that:  Are easy to install  Measure the “network” behavior  Provide a baseline of network performance between end-sites  Are standardized and broadly deployed  Details of how LHCONE sites setup the perfSONAR-PS installations is documented on the Twiki at:  An example OSG could follow (with minor changes)  In the next few slides I will highlight some of the relevant details 7/10/2012OSG Staff Planning Retreat8

OSG Network Monitoring Goals  We want OSG sites to have the ability to easily monitor their network status  Sites should be able to determine if network problems are occurring  Sites should have a reasonable “baseline” measurement of usable bandwidth between themselves and selected peers  Sites should have standardized diagnostic tools available to identify, isolate and aid in the repair of network-related issues  We want OSG VOs to have the ability to easily monitor the set of network paths used by their sites  VOs should be able to identify problematic sites regarding their network  VOs should be able to track network performance and alert-on network problems between VO sites 7/10/2012OSG Staff Planning Retreat9

How To Achieve These Goals?  OSG should plan to leverage the existing and ongoing efforts in LHC regarding network monitoring  The perfSONAR-PS toolkit is a actively developed set of network monitoring tools following the perfSONAR standards  There is an existing modular dashboard which is currently undergoing a redesign. OSG should not only use this but provide input about design features needed to enable its effective use for OSG  Some effort is underway to enable alerting for network problems. I have an undergraduate working on an example system.  Details of how best to integrate within OSG planning and existing and future infrastructure are why we are here  Later we can discuss a draft workplan. 7/10/2012OSG Staff Planning Retreat10

perfSONAR-PS Deployment Considerations  We want to measure (to the extent possible) the entire network path between OSG resources. This means:  We want to locate perfSONAR-PS instances as close as possible to the storage/compute resources associated with a site. The goal is to ensure we are measuring the same network path to/from the relevant site resources.  There are two separate instances that should be deployed: latency & bandwidth (Two instances to prevent interference)  The latency instance measures one-way delay by using an NTP synchronized clock and send 10 packets per second to target destinations (Important metric is packet-loss!)  The bandwidth instance measures achievable bandwidth via a short test (20-60 seconds) per src-dst pair every 4 (or ‘n’) hour period 7/10/2012OSG Staff Planning Retreat11

perfSONAR-PS Deployment Considerations  Each “site” should have perfSONAR-PS instances in place.  If an OSG site has more than one “network” location, each should be instrumented and made part of scheduled testing.  Standardized hardware and software is a good idea  Measurements should represent what the network is doing and not differences in hardware/firmware/software.  USATLAS has identified and tested systems from Dell for perfSONAR-PS hardware. Two variants: R310 and R610.  R310 cheaper (<$900), can host 10G (Intel X520 NIC) but not supported by Dell (Most US ATLAS sites choose this)  R610 officially supports X520 NIC (Canadian sites choose this)  Orderable off the Dell LHC portal for LHC sites  VOs should try to upgrade perfSONAR-PS toolkit versions together 7/10/2012OSG Staff Planning Retreat12

Network Impact of perfSONAR-PS  To provide an idea of the network impact of a typical deployment here are some numbers as configured in USATLAS  Latency tests send 10Hz of small packets (20 bytes) for each testing location. USATLAS Tier-2’s test to ~9 locations. Since headers account for 54 bytes each packet is 74 bytes or the rate for testing to 9 sites is 6.7 kbytes/sec.  Bandwidth tests try to maximize the throughput. A 20 second test is run from each site in each direction once per 4 hour window. Each site runs tests in both directions. Typically the best result is around 925 Mbps on a 1Gbps link for a 20 second test. That means we send 4x925 Mbps*20 sec every 4 hours per testing pair (src-dst) or about Mbps average for testing with 9 other sites.  Tests are configurable but the above settings are working fine. 7/10/2012OSG Staff Planning Retreat13

Modular Dashboard  While the perfSONAR-PS toolkit is very nice, it was designed to be a distributed, federated installation.  Not easy to get an “overview” of a set of sites or their status  USATLAS needed some “summary interface”  Thanks to Tom Wlodek’s work on developing a “modular dashboard” we have a very nice way to summarize the extensive information being collected for the near-term network characterization.  The dashboard provides a highly configurable interface to monitor a set of perfSONAR-PS instances via simple plug- in test modules. Users can be authorized based upon their grid credentials. Sites, clouds, services, tests, alarms and hosts can be quickly added and controlled. 7/10/2012OSG Staff Planning Retreat14

Example of Dashboard for US CMS 7/10/2012OSG Staff Planning Retreat15 See “Primitive” service status Other Dashboards

VO Site Configuration Considerations  Determine what VO wants for scheduled tests  Recommendation for tests:  Latency tests (for the packet loss info). Use default settings  Throughput. How often and how long (USATLAS one per 4 hrs, 20 second duration; 10GE may need longer test)  Traceroute: Sites should setup a traceroute test to each other VO site  Use a “community” to self-identify VO sites of interest. I recommend the VO name. This will allow VO sites to pick that community and see everyone “advertising” that attribute. Allows adding sites to tests with a “click”  Get VO sites at the same (current) version  Make sure firewalls are not blocking either VO sites nor the collector at BNL (or OSG?): rnagios01.usatlas.bnl.gov  Copy/rewrite the LHCONE info on the Twiki for VO use 7/10/2012OSG Staff Planning Retreat16

Targets for OSG  Two “clients” for OSG Network Monitoring: sites and VOs  How to support both most effectively?  Sites need:  Details of options for required hardware  Software (perfSONAR-PS) and detailed installation instructions  Configuration options documented with suggested best-practices  Notification when problems are identified  VOs need:  Site details (perfSONAR-PS instances at each VO site)  Software (modular dashboard host by OSG?) and detailed configuration options.  Dashboard configuration details: How to add my VO sites for monitoring?  Centralized test/scheduling management (“pull” model seems best) 7/10/2012OSG Staff Planning Retreat17

Draft Work Plan for OSG  Develop OSG site install procedures for perfSONAR-PS  Use existing infrastructure for software download or provide OSG distribution?  Provide site recommendations and best practices guide  Provide VO-level recommendations and best practices doc  OSG should host a set of services providing a modular dashboard for VOs. Need to determine details  Should OSG provide packaged “modular dashboard” components to allow sites/VOs to deploy their own instance?  OSG should allow VOs or sites to request “alerting” when monitoring identifies network problems. Need to create and deploy such a capability 7/10/2012OSG Staff Planning Retreat18

Challenges Ahead  Getting hardware/software platform installed at OSG sites  Dashboard development: Currently USATLAS/BNL and soon OSG, Canada (ATLAS, HEPnet) and USCMS. OSG input?  Managing site and test configurations  Determining the right level of scheduled tests for a site, e.g., which other OSG or VO sites?  Improving the management of the configurations for VOs/Clouds  Tools to support “central” configuration (Internet2 working on this)  Alerting: A high-priority need but complicated:  Alert who? Network issues could arise in any part of end-to-end path  Alert when? Defining criteria for alert threshold. Primitive services are easier. Network test results more complicated to decide  Integration with existing VO and OSG infrastructures. 7/10/2012OSG Staff Planning Retreat19

Discussion/Questions 7/10/2012OSG Staff Planning Retreat20 Questions or Comments?

References  perfSONAR-PS site  Install/configuration guide: ps/wiki/pSPerformanceToolkit32 ps/wiki/pSPerformanceToolkit32 ps/wiki/pSPerformanceToolkit32  Modular Dashboard: or  Tools, tips and maintenance:  LHCONE perfSONAR:  LHCOPN perfSONAR:  CHEP 2012 presentation on USATLAS perfSONAR-PS experience: 2&confId= &confId= &confId= /10/2012OSG Staff Planning Retreat21

Modular Dashboard Development  The dashboard that currently exists has some shortcomings which are being addressed by a new development effort  There is a mailing list tracking the effort at:  We (OSG) need to ensure the product will meet our needs. If there is input appropriate for the development effort we need to make sure it gets into the development process. Coding is just starting now… 7/10/2012OSG Staff Planning Retreat22

Old dashboard - overview dashboard Collector API Collector PS Host database user

Proposed structure of new dashboard framework Data Store Data Access API Data Persistence Layer Database Display GUIObject config GUIAlarmsAuthenticationCollectorOther?

Modular Dashboard Schedule  Current modular dashboard development schedule from Tom Wlodek/BNL and Andy Lake/ESnet  July 1 st : We will have official version 1.0 of the design document ready and we can start coding. We can add changes to the document later but it will be a stating point for development. See X9dNXH1K-62Ax9rFnZvKE/edit?pli=1 X9dNXH1K-62Ax9rFnZvKE/edit?pli=1  August 1 st : We will have first version of dashboard deployed. It shall consist of collector (Andy), data store and data access API (myself) and some rudimentary text gui. We may reuse Andy's gui if possible, Andy is going to look into that. Not included will be: Configuration gui, persistence and probe history.  Sep 1 st : We will have full dashboard including history, configuration gui and persistence. I am not sure if we will fit the alarms by then.. 7/10/2012OSG Staff Planning Retreat25