Download presentation
Presentation is loading. Please wait.
1
LHCOPN/LHCONE perfSONAR Update
Ian Collier/RAL Presenting for Shawn McKee/UM LHCONE/LHCOPN Meeting Taipei, Taiwan March 13th, 2016
2
Overview of Talk perfSONAR Changes and updates
WLCG, LHCONE and LHCOPN infrastructure overview Status and changes in our meshes Some new tools ElasticSearch, MadAlert and topology explorations for our data Summary and Discussion LHCONE-Taipei March 13, 2016
3
Importance of LHCONE perfSONAR
As we start this presentation, it is important to note the usefulness of having LHCONE perfSONAR instance in place. Just within the last 2 months we have used instances in the US and Europe to help diagnose network issues We see a gap in coverage for Asia and it would be very good to get additional instances in place…especially in the regional R&E networks. We are hoping this LHCONE/LHCOPN meeting will be a chance to encourage additional instances in Asia to join the LHCONE monitoring mesh. Contact Shawn McKee and Marian Babik if you are interested! LHCONE-Taipei March 13, 2016
4
perfSONAR v3.5.1 Toolkit perfSONAR v3.5.1 released on the 4th of March 2016 Main themes for this release: A new web interface for creating/managing your regular tests Normalized package names, configuration files and paths Upgrade to Esmond (backward incompatibilities for writing data) Improved support for Debian 7 and 8 See release notes In addition v3.5.1 incorporates feedback and bugfixes from our WLCG/OSG deployments, improving robustness. WLCG/OSG Deployment status as of today (great progress): 3.4.1 : 6 3.4.2 : 8 3.5 : 2 3.5.0 : 37 3.5.1 : 169 Unknown: 25 (These nodes are either down or hung) LHCONE-Taipei March 13, 2016
5
Review perfSONAR Deployment Options
Configuration managed deployments via bundles (see ) perfSONAR Tools (just tools) perfSONAR TestPoint (passive, no MA) perfSONAR Core (+MA) perfSONAR Complete (+Web and Toolkit Configuration) perfSONAR Central Management (MaDDash, Auto-config, Centralized config service) Low-cost nodes to support large-scale deployment ( ) $ range should enable broad deployment Small form factor enables more locations Some limitations in capabilities due to hardware VMs - Still not recommended but possible Target: whole node VMs, VMs with dedicated physical NICs Main use “end-to-end” infrastructure testing (not network) What about Docker? LHCONE-Taipei March 13, 2016
6
Map of perfSONAR Deployment
for stats 278 perfSONAR instances registered in GOCDB/OIM 248 Active perfSONAR instances 208 Running latest version (3.5+) Jorge volunteered to help out here … Initial deployment coordinated by WLCG perfSONAR TF Commissioning of the network followed by WLCG Network and Transfer Metrics WG LHCONE-Taipei March 13, 2016
7
Gathering & Storing Metrics
OSG is providing network metric data for its members and WLCG via the Network Datastore The data is gathered from all WLCG/OSG perfSONAR instances Stored indefinitely on OSG hardware Data available via Esmond API In production since September 14th 2015 The primary use-cases Network problem identification and localization Network-related decision support Network baseline: set expectations and identify weak points for upgrading LHCONE-Taipei March 13, 2016
8
Review of perfSONAR Pipeline
The diagram on the right provides a high-level view of how WLCG/OSG is managing our perfSONAR deployments, gathering metrics and making them available for use. End users can get monitor the data via the OSG MaDDash instance, grab the data directly from the OSG datastore or subscribe to the ActiveMQ bus at CERN LHCONE-Taipei March 13, 2016
9
Configuration for LHCOPN/LHCONE
We have changed to use uni-directional tests for OWAMP to reduce the load Source host is responsible for initiating and recording test results to each destination We are using iperf3 as the baseline for bandwidth measurements (adds retry information) Fall fix for NDT ensured the TCP congestion protocol would use ‘htcp’ rather than ‘reno’ when NDT and NPAD are not in use and improves BW results. We are sending all the LHCOPN and LHCONE data into ElasticSearch (ongoing) NDT – Network Diagnostic Tool LHCONE-Taipei March 13, 2016
10
Existing Test Coverage
Current perfSONAR measurement coverage for WLCG/OSG: Full latency (one-direction only, 10Hz, OWAMP, IPv4) Full traceroute (bi-directional, hourly, BWCTL/OWAMP, IPv4, IPv6) Full bandwidth (one-direction only, fortnightly, BWCTL-only!, IPv4, IPv6) Regional meshes still disabled, need to discuss how to evolve We can create any sub-mesh of the full latency mesh (for free, but only IPv4 and using same params) We could move from regional to bigger meshes (European, Asia/Pacific, US) We can create new bandwidth meshes as bwclt needs fewer resources (but only for BWCTL-only nodes, not on dual-nodes) We re-enabled project meshes Belle II – both latency and bandwidth Dual-stack – just bandwidth (both IPv4 and IPv6) LHCONE/LHCOPN – These are separately tracked LHCONE-Taipei March 13, 2016
11
perfSONAR Monitoring Pages
We have 3 versions of our perfSONAR monitoring pages Prototype at maddash.aglt2.org (intending to phase this out soon) Testing at OSG’s ITB instance Production at OSG’s production instance Main monitoring types are MaDDash and OMD/Check_MK Prototype: Testing: Production: Notes: OSG instances rely upon OSG Datastore: X509 cert needed to view check_mk/OMD pages (any IGTF cert) LHCONE-Taipei March 13, 2016
12
Check_mk for LHCONE/LHCOPN perfSONARs
(Prototype) (Production) Access requires x509 credential from IGTF CA Gives us a good view into where problems still exist We monitor: “Expected” test coverage NDT/NPAD running? Memory on hosts (<4GB) New “version” test LHCONE-Taipei March 13, 2016
13
Monitoring Metrics Metrics are displayed via source-destination matrix
Use MaDDash to view metric summaries Provide quick view about how networks are working OSG hosts production instance Metrics are displayed via source-destination matrix Multiple dashboards (meshes) can be selected Custom menus link to relevant resources New release (2.0) will incorporate MadAlert LHCONE-Taipei March 13, 2016
14
Evolution of LHCOPN/LHCONE Monitoring
As usual we will show how the monitoring in MaDDash is changing since the last meeting We have two known problems with LHCONE instances from GEANT and Internet2 GEANT instance in Amsterdam was recently upgraded to perfSONAR v3.5.1 BUT there is a problem writing to the updated Esmond The Internet2 instances are “multi-purpose” and have an MA which uses a different FQDN/IP than the LHCONE measurement interface. The current mesh-config isn’t setup to handle this configuration. Additionally there may be some problems with these v3.4.1 instances MA – Measurement Archive LHCONE-Taipei March 13, 2016
15
LHCONE MaDDash – 27 Oct 2015 Some issues getting data from Internet2/GEANT instances we need to look into LHCONE-Taipei March 13, 2016
16
LHCONE MaDDash – 11 Mar 2016 Things are looking a bit worse. We have known issues with the AMS_GEANT and Internet2 instances that are being worked on. Real issues into IN2P3 as well as problems outbound? Should be investigated. LHCONE-Taipei March 13, 2016
17
LHCOPN MaDDash – 27 Oct 2015 Some firewall problems for the OSG collector from FNAL. Setup being examined at INFN and PIC. Patches to address IPv6/IPv4 and http/https put in yesterday and things are broken Should be fixed later today. LHCONE-Taipei March 13, 2016
18
LHCOPN MaDDash – 11 Mar 2016 RAL and TRIUMF showing signs of continuing network problems. Latency mesh improved. BW mesh still shows many issues. Kisti still has BW problems. LHCONE-Taipei March 13, 2016
19
Existing Tools We have a number of tools available to help debug and understand network problems. There are very good presentations on these tools in the training materials provided by perfSONAR: While I don’t have time to cover all the details (see and especially the Measurement Tools, Use Cases and Debugging presentations from Jason Zurawski) I do want to note that command line tools exist to allow you to create on-demand 3rd party tests (between two remote instances) for bandwidth, latency and traceroute. Follow the debugging strategy as a guide to finding and fixing LHCONE/LHCOPN network issues using perfSONAR capabilities As for new tools…. LHCONE-Taipei March 13, 2016
20
ATLAS Network Metrics Pipeline
Ilija Vukotic, Kaushik De, Rob Gardner and Jorge Batista are working with the Network and Transfer Metrics WG to make perfSONAR metrics available to PANDA See Ilija’s presentation at Pipeline: OSG Network Datastore -> CERN Active MQ -> Flume -> ES -> PANDA Prototype working and analytics being performed in Elastic Search to validate data (see following slide) Working on a network source-destination cost-matrix PANDA can use to evaluate options Interface details being discussed with PANDA team Could also be used to analyze LHCONE/LHCOPN data! LHCONE-Taipei March 13, 2016
21
perfSONAR Data into ElasticSearch
Avg src loss % Avg dst loss for example plots using WLCG data LHCONE-Taipei March 13, 2016
22
MadAlert: A project to analyze meshes
Gabriele Carcassi has been working with me on creating a new utility to analyze meshes: MadAlert See details at You can see meshes and reports from the page Reports find both infrastructure and network problems We are now working with Andy Lake/ESnet to incorporate this into the next major release of MaDDash (v2.0) Now testing a “diff” to allow us to compare meshes; e.g., IPv4 vs IPv6, testing vs production, mesh(t1) vs mesh(t2) Could be really helpful for understanding new software versions or changes in time. Time based comparison will require some modifications to MaDDash to allow specifying time-based meshes. LHCONE-Taipei March 13, 2016
23
Understanding Network Topology
Can we create tools to manipulate, visualize, compare and analyze network topologies from the OSG network datastore contents? Can we build upon these tools to create a set of next-generation network diagnostic tools to make debugging network problems easier, quicker and more accurate? Even without requiring the ability to perform complicated data analysis and correlation, basic tools developed in the area of network topology-based metric visualization would be very helpful in letting users and network engineers better understand what is happening in our networks. This area is under active investigation in various projects. Lots of work to do here. LHCONE-Taipei March 13, 2016
24
Exploring Path Analysis
We can simplify the graph by aggregating nodes that belong to same NREN (visual debugging) We can correlate paths with packet-loss/latency information (PuNDIT) DFN JANET GEANT RAL Aachen ITEP QMUL latency, packet-loss, throughput LHCONE-Taipei March 13, 2016
25
WLCG Support Unit Reminder: We have a GGUS support unit (WLCG Network Throughput; used to report incidents (mailing list: wlcg-network-throughput at cern.ch) Experiments can report potential network performance incidents. WLCG perfSONAR support investigates and confirms if this is network related issue. Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Tracking of ongoing incidents will be via the WG page. Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider. If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging. For the non-technical (policy) issues, sites should escalate to the WLCG operations coordination. LHCOPN/LHCONE experts are very important in this coordinated activity. LHCONE-Taipei March 13, 2016
26
Next Steps We are working on getting ALL WLCG/OSG perfSONAR instances fully operational and properly configured We have hints that some perfSONAR services stop or hang under some circumstances. Working with developers to isolate/fix. Some hosts are underpowered (<4GB in latency) or broken As we fix known issues and get to reliable operation, we can free up time to pursue possible issues in the network itself, rather than the framework that gets us network metrics. We need to plan for a campaign to clear up remaining LHCONE/LHCOPN problems. Currently working on the LHCONE issues we noted previously. Need more instances in Asia in the regional R&E networks!! LHCONE-Taipei March 13, 2016
27
Discussion/Questions/Comments?
LHCONE-Taipei March 13, 2016
28
References Network Documentation Deployment documentation for OSG and WLCG hosted in OSG New MA guide Modular Dashboard and OMD Prototypes OSG Production instances for OMD, MaDDash and Datastore Mesh-config in OSG New mesh config info: Send feedback to Soichi Use-cases document for experiments and middleware LHCONE-Taipei March 13, 2016
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.