PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015.

Slides:



Advertisements
Similar presentations
Update on OSG/WLCG perfSONAR infrastructure Shawn McKee, Marian Babik HEPIX Spring Workshop, Oxford 23 rd - 27 th March 2015.
Advertisements

Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
WLCG Cloud Traceability Working Group progress Ian Collier Pre-GDB Amsterdam 10th March 2015.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.
Proximity service Main idea – provide “glue” between experiments and sonar topology – mainly map sonars to storages and vice versa – determine existing.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Network Performance Measurement Atlas Tier 2 Meeting at BNL December Joe Metzger
Use Cases. Summary Define and understand slow transfers – Identify weak links, narrow down the source – Understand what perfSONAR measurements mean wrt.
IEEE R lmap 23 Feb 2015.
Connect communicate collaborate perfSONAR MDM updates: New interface, new possibilities Domenico Vicinanza perfSONAR MDM Product Manager
Service Transition & Planning Service Validation & Testing
Network Monitoring for OSG Shawn McKee/University of Michigan OSG Staff Planning Retreat July 10 th, 2012 July 10 th, 2012.
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
Internet2 Performance Update Jeff W. Boote Senior Network Software Engineer Internet2.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 8 th April 2015.
CRISP & SKA WP19 Status. Overview Staffing SKA Preconstruction phase Tiered Data Delivery Infrastructure Prototype deployment.
Connect. Communicate. Collaborate Implementing Multi-Domain Monitoring Services for European Research Networks Szymon Trocha, PSNC A. Hanemann, L. Kudarimoti,
New perfSonar Dashboard Andy Lake, Tom Wlodek. What is the dashboard? I assume that everybody is familiar with the “old dashboard”:
1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik perfSONAR Operations Sub-group 22 nd October 2014.
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015.
Connect. Communicate. Collaborate perfSONAR MDM Service for LHC OPN Loukik Kudarimoti DANTE.
Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015.
PerfSONAR-PS Functionality February 11 th 2010, APAN 29 – perfSONAR Workshop Jeff Boote, Assistant Director R&D.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 18 h March 2015.
13-Oct-2003 Internet2 End-to-End Performance Initiative: piPEs Eric Boyd, Matt Zekauskas, Internet2 International.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,
WLCG Network and Transfer Metrics WG After One Year Shawn McKee, Marian Babik GDB 4 th November
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Update on Network Performance Monitoring.
Network and Transfer WG perfSONAR operations Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 28 h January 2015.
Connect communicate collaborate perfSONAR MDM for LHCOPN/LHCONE: partnership, collaboration, interoperability, openness Domenico Vicinanza perfSONAR MDM.
Update on Network and Transfer Metrics WG Shawn McKee, Marian Babik GDB 8 th October 2014.
PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.
OSG Networking: Summarizing a New Area in OSG Shawn McKee/University of Michigan Network Planning Meeting Esnet/Internet2/OSG August 23 rd, 2012.
US LHC Tier-2 Network Performance BCP Mar-3-08 LHC Community Network Performance Recommended BCP Eric Boyd Deputy Technology Officer Internet2.
Network Awareness and perfSONAR Why we want it. What are the challenges? Where are we going? Shawn McKee / University of Michigan OSG AHG - US CMS Tier-2.
WLCG Latency Mesh Comments + – It can be done, works consistently and already provides useful data – Latency mesh stable, once configured sonars are stable.
GEMINI: Active Network Measurements Martin Swany, Indiana University.
LHCONE Monitoring Thoughts June 14 th, LHCOPN/LHCONE Meeting Jason Zurawski – Research Liaison.
Strawman LHCONE Point to Point Experiment Plan LHCONE meeting Paris, June 17-18, 2013.
Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.
ITMT 1371 – Window 7 Configuration 1 ITMT Windows 7 Configuration Chapter 8 – Managing and Monitoring Windows 7 Performance.
Advanced Network Diagnostic Tools Richard Carlson EVN-NREN workshop.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) HEPIX, BNL 13 Oct 2015.
WLCG Operations Coordination report Maria Dimou Andrea Sciabà IT/SDC On behalf of the WLCG Operations Coordination team GDB 12 th November 2014.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
1 Deploying Measurement Systems in ESnet Joint Techs, Feb Joseph Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
PerfSONAR operations meeting 3 rd October Agenda Propose changes to the current operations of perfSONAR Discuss current and future deployment model.
Bob Jones EGEE Technical Director
Shawn McKee, Marian Babik for the
WLCG Network Discussion
perfSONAR-PS Deployment: Status/Plans
Ian Bird GDB Meeting CERN 9 September 2003
LHCOPN/LHCONE perfSONAR Update
Networking for the Future of Science
LHCOPN/LHCONE perfSONAR Update
Monitoring the US ATLAS Network Infrastructure with perfSONAR-PS
Update from the HEPiX IPv6 WG
Alerting/Notifications (MadAlert)
Deployment & Advanced Regular Testing Strategies
LHCONE perfSONAR: Status and Plans
Network Monitoring Update: June 14, 2017 Shawn McKee
Basic Configuration & Deployment
Presentation transcript:

perfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015

Overview of Talk  perfSONAR Changes and updates  WLCG, LHCONE and LHCOPN infrastructure overview  Status and changes in our meshes  Some new tools  ElasticSearch, MadAlert and topology explorations for our data  Summary and Discussion October 28, 2015LHCONE-Amsterdam2

perfSONAR v3.5 Toolkit  perfSONAR v3.5 released on the 28 th of September  Main themes for this release:  Support for central host management and node auto-configuration  Support for low cost nodes  Support for Debian, VMs, and other installation options  Modernize the GUIs  In addition v3.5 incorporates feedback and bugfixes from our WLCG/OSG deployments, improving robustness.  WLCG/OSG Deployment status as of today (great progress): Deployment statusDeployment status  : 6  : 23  3.5 : 2  : 195  Unknown: 18 (These nodes are either down or hung) LHCONE-Amsterdam3October 28, 2015

New perfSONAR Deployment Options  Configuration managed deployments via bundles ( see ) ttp://docs.perfsonar.net/install_options.html  perfSONAR Tools (just tools)  perfSONAR TestPoint (passive, no MA)  perfSONAR Core (+MA)  perfSONAR Complete (+Web and Toolkit Configuration)  perfSONAR Central Management (MaDDash, Auto-config, Centralized config service)  Low-cost nodes to support large-scale deployment ( )  $ range should enable broad deployment  Small form factor enables more locations  Some limitations in capabilities due to hardware  VMs - Still not recommended but possible  Target: whole node VMs, VMs with dedicated physical NICs  Main use “end-to-end” infrastructure testing (not network)  What about Docker?  LHCONE-Amsterdam4October 28, 2015

Current perfSONAR Deployment LHCONE-Amsterdam5 278 perfSONAR instances registered in GOCDB/OIM 245 Active perfSONAR instances 197 Running latest version (3.5+) Initial deployment coordinated by WLCG perfSONAR TF Commissioning of the network followed by WLCG Network and Transfer Metrics WG for stats October 28, 2015

Overview of perfSONAR Pipeline LHCONE-Amsterdam6 The diagram on the right provides a high-level view of how WLCG/OSG is managing our perfSONAR deployments, gathering metrics and making them available for use. We will cover some of the details in what follows October 28, 2015

Gathering & Storing Metrics  OSG is providing network metric data for its members and WLCG via the Network Datastore  The data is gathered from all WLCG/OSG perfSONAR instances  Stored indefinitely on OSG hardware  Made available via API  In production since September 14 th  The primary use-cases  Network problem identification and localization  Network-related decision support  Network baseline: set expectations and identify weak points for upgrading LHCONE-Amsterdam7October 28, 2015

OSG Network Datastore Diagram LHCONE-Amsterdam8 q OSG is gathering relevant metrics from the complete set of OSG and WLCG perfSONAR instances q Operating now q Running VMs on dedicated hardware q Data also published to CERN Active MQ instance and available for user subscription q Actively tuning and debugging 8 VMs Storage must host 7 distinct areas October 28, 2015

Changes for LHCOPN/LHCONE  We have changed to use uni-directional tests for OWAMP to reduce the load  Source host is responsible for initiating and recording test results to each destination  We are using iperf3 as the baseline for bandwidth measurements (adds retry information)  Recent fix for NDT ensured the TCP congestion protocol would use ‘htcp’ rather than ‘reno’ when NDT and NPAD are not in use. This should improve BW results.  Plan to gather all the LHCOPN and LHCONE data into ElasticSearch (ongoing) October 28, 2015LHCONE-Amsterdam9

Existing Test Coverage  Current perfSONAR measurement coverage for WLCG/OSG:  Full latency (one-direction only, 10Hz, OWAMP, IPv4)  Full traceroute (bi-directional, hourly, BWCTL/OWAMP, IPv4, IPv6)  Full bandwidth (one-direction only, fortnightly, BWCTL-only!, IPv4, IPv6)  Regional meshes still disabled, need to discuss how to evolve  We can create any sub-mesh of the full latency mesh (for free, but only IPv4 and using same params)  We could move from regional to bigger meshes (European, Asia/Pacific, US)  We can create new bandwidth meshes as bwclt needs fewer resources (but only for BWCTL-only nodes, not on dual-nodes)  We re-enabled project meshes  Belle II  Belle II – both latency and bandwidth  Dual-stack  Dual-stack – just bandwidth (both IPv4 and IPv6)  LHCONE/LHCOPN  LHCONE/LHCOPN – These need to be separately tracked LHCONE-Amsterdam10October 28, 2015

OMD for LHCONE/LHCOPN perfSONARs October 28, 2015LHCONE-Amsterdam11 (Prototype) (Production) We monitor: “Expected” test coverage NDT/NPAD running? Memory on hosts (<4GB) New “version” test Access requires x509 credential from IGTF CA Gives us a good view into where problems still exist

OMD Hostgroup Summary LHCOPN/LHCONE October 28, 2015LHCONE-Amsterdam12

perfSONAR Monitoring Pages  We have 3 versions of our perfSONAR monitoring pages  Prototype at maddash.aglt2.org (intending to phase this out soon)  Testing at OSG’s ITB instance  Production at OSG’s production instance  Main monitoring types are MaDDash and OMD/Check_MK  Prototype:  Testing: / /  Production:  Notes:  OSG instances rely upon OSG Datastore:  X509 cert needed to view check_mk/OMD pages (any IGTF cert) October 28, 2015LHCONE-Amsterdam13

Monitoring Metrics  Use MaDDash to view metric summaries  Provide quick view about how networks are working  OSG hosts production instance LHCONE-Amsterdam14 Metrics are displayed via source-destination matrix Multiple dashboards (meshes) can be selected Custom menus link to relevant resources New release (2.0) will incorporate MadAlert October 28, 2015

LHCONE MaDDash – 01 Jun 2015 October 28, 2015LHCONE-Amsterdam15 Things looked pretty good. Some missing measurements (orange) but not much packet loss seen. Bandwidth measurements so-so with some problems indicated.

LHCONE MaDDash – 27 Oct 2015 October 28, 2015LHCONE-Amsterdam16 Some issues getting data from Internet2/GEANT instances we need to look into

LHCOPN MaDDash – 01 Jun 2015 October 28, 2015LHCONE-Amsterdam17 Latency had some issues: RAL showed signs of continuing network problems.  BW mesh similar. Kisti still has problems,“red” throughput is worth examining.

LHCOPN MaDDash – 27 Oct 2015 October 28, 2015LHCONE-Amsterdam18 Some firewall problems for the OSG collector from FNAL. Setup being examined at INFN and PIC. Patches to address IPv6/IPv4 and http/https put in yesterday and things are broken  Should be fixed later today.

Existing Tools  We have a number of tools available to help debug and understand network problems.  There are very good presentations on these tools in the training materials provided by perfSONAR:  While I don’t have time to cover all the details (see and especially the Measurement Tools, Use Cases and Debugging presentations from Jason Zurawski) I do want to note that command line tools exist to allow you to create on-demand 3 rd party tests (between two remote instances) for bandwidth, latency and traceroute.  Follow the debugging strategy as a guide to finding and fixing LHCONE/LHCOPN network issues using perfSONAR capabilities  As for new tools…. LHCONE-Amsterdam19October 28, 2015

How to Find Nearest pS Instances  perfSONAR measures our networks but how do we find the “right” perfSONAR instance (or metrics) when we need to understand a path?  We need tools that let us identify which perfSONAR is relevant:  provide “glue” between experiments and sonar topology  map sonars to storages and vice versa  determine existing tools/technologies that can be used  Looked at different approaches  Site-based – join based on common mapping to GOCDB/OIM sites  GeoIP – join based on geographical distance (prototyped)  Traceroutes – join based on network distance (prototyped)  RTT – multilateration based on RTT – determine position by measuring distance from reference points (perfSONARs)  GeoIP API prototype exists:   (works with any hostname)  Interest from GEANT and ESNet to work on an open-source based project LHCONE-Amsterdam20October 28, 2015

ATLAS Network Metrics Pipeline  Ilija Vukotic, Kaushik De, Rob Gardner and Jorge Batista are working with the Network and Transfer Metrics WG to make perfSONAR metrics available to PANDA  Pipeline: OSG Network Datastore -> CERN Active MQ -> Flume -> ES -> PANDA  Prototype working and analytics being performed in Elastic Search to validate data (see following slide)  Plan is to create a network source-destination cost-matrix PANDA can use to evaluate options  Actual interface details being discussed with PANDA team  Can also be used to analyze LHCONE/LHCOPN data! 21

perfSONAR Data into ElasticSearch Avg src loss Avg src loss % Avg dst loss Avg dst loss % for example plots using WLCG data October 28, 2015LHCONE-Amsterdam22

MadAlert: A new project to analyze meshes  Gabriele Carcassi has been working with me on creating a new utility to analyze meshes: MadAlert  See details at  You can see meshes and reports from the page infrastructurenetwork  Reports find both infrastructure and network problems  We are now working with Andy Lake/ESnet to incorporate this into the next major release of MaDDash (v2.0)  Now testing a “diff” to allow us to compare meshes; e.g., IPv4 vs IPv6, testing vs production, mesh(t1) vs mesh(t2)   Could be really helpful for understanding new software versions or changes in time. Time based comparison will require some modifications to MaDDash to allow specifying time-based meshes. October 28, 2015LHCONE-Amsterdam23

Understanding Network Topology  Can we create tools to manipulate, visualize, compare and analyze network topologies from the OSG network datastore contents?  Can we build upon these tools to create a set of next- generation network diagnostic tools to make debugging network problems easier, quicker and more accurate?  Even without requiring the ability to perform complicated data analysis and correlation, basic tools developed in the area of network topology-based metric visualization would be very helpful in letting users and network engineers better understand what is happening in our networks.  Last time I showed some examples potentially useful components to begin looking at network topology… October 28, 2015LHCONE-Amsterdam24

Exploring Path Analysis LHCONE-Amsterdam25 latency, packet-loss, throughput DFN JANET GEANT RAL Aachen ITEP QMUL We can correlate paths with packet-loss/latency information We can simplify the graph by aggregating nodes that belong to same NREN (visual debugging) October 28, 2015

WLCG Support Unit   Reminder: We have a GGUS support unit (WLCG Network Throughput; used to report incidents (mailing list: wlcg-network-throughput at cern.ch)   Experiments can report potential network performance incidents.  WLCG perfSONAR support investigates and confirms if this is network related issue.  Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Tracking of ongoing incidents will be via the WG page.   Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider.  If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging. For the non-technical (policy) issues, sites should escalate to the WLCG operations coordination.  ormance_Incidents. ormance_Incidents   LHCOPN/LHCONE experts are very important in this coordinated activity. October 28, 2015LHCONE-Amsterdam26

Next Steps  We are working on getting ALL WLCG/OSG perfSONAR instances fully operational and properly configured  We have hints that some perfSONAR services stop or hang under some circumstances. Working with developers to isolate/fix.  Some hosts are underpowered (<4GB in latency) or broken  There are some bugs we know of in the data acquisition chain that need fixing. Ongoing effort on this  As we fix known issues and get to reliable operation, we can free up time to pursue possible issues in the network itself, rather than the framework that gets us network metrics.  We need to plan for a campaign to clear up remaining LHCONE/LHCOPN problems.  Currently working with FNAL, INFN and PIC on some issues October 28, 2015LHCONE-Amsterdam27

References  Network Documentation  Deployment documentation for OSG and WLCG hosted in OSG  New MA guide  Modular Dashboard and OMD Prototypes   OSG Production instances for OMD, MaDDash and Datastore     Mesh-config in OSG  Use-cases document for experiments and middleware c/edit c/edit c/edit LHCONE-Amsterdam28October 28, 2015

Discussion/Questions/Comments? October 28, 2015LHCONE-Amsterdam29

Discussion Topics  What would you like to see as next steps?  Are the tools sufficient to help us with our goals?  Any changes in coverage/tests needed?  Should we push forward on using our metrics to start addressing issues in the network? (who is “we” ? ) October 28, 2015LHCONE-Amsterdam30