Presentation is loading. Please wait.

Presentation is loading. Please wait.

PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015.

Similar presentations


Presentation on theme: "PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015."— Presentation transcript:

1 perfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015

2 Overview of Talk  perfSONAR Changes and updates  WLCG, LHCONE and LHCOPN infrastructure overview  Status and changes in our meshes  Some new tools  ElasticSearch, MadAlert and topology explorations for our data  Summary and Discussion October 28, 2015LHCONE-Amsterdam2

3 perfSONAR v3.5 Toolkit  perfSONAR v3.5 released on the 28 th of September  Main themes for this release:  Support for central host management and node auto-configuration  Support for low cost nodes  Support for Debian, VMs, and other installation options  Modernize the GUIs  In addition v3.5 incorporates feedback and bugfixes from our WLCG/OSG deployments, improving robustness.  WLCG/OSG Deployment status as of today (great progress): Deployment statusDeployment status  3.4.1 : 6  3.4.2 : 23  3.5 : 2  3.5.0 : 195  Unknown: 18 (These nodes are either down or hung) LHCONE-Amsterdam3October 28, 2015

4 New perfSONAR Deployment Options  Configuration managed deployments via bundles ( see http://docs.perfsonar.net/install_options.html ) ttp://docs.perfsonar.net/install_options.html  perfSONAR Tools (just tools)  perfSONAR TestPoint (passive, no MA)  perfSONAR Core (+MA)  perfSONAR Complete (+Web and Toolkit Configuration)  perfSONAR Central Management (MaDDash, Auto-config, Centralized config service)  Low-cost nodes to support large-scale deployment (http://docs.perfsonar.net/low_cost_nodes.html ) http://docs.perfsonar.net/low_cost_nodes.html  $100-200 range should enable broad deployment  Small form factor enables more locations  Some limitations in capabilities due to hardware  VMs - Still not recommended but possible  Target: whole node VMs, VMs with dedicated physical NICs  Main use “end-to-end” infrastructure testing (not network)  What about Docker?  http://www.perfsonar.net/deploy/installation-and-configuration/ http://www.perfsonar.net/deploy/installation-and-configuration/ LHCONE-Amsterdam4October 28, 2015

5 Current perfSONAR Deployment LHCONE-Amsterdam5 278 perfSONAR instances registered in GOCDB/OIM 245 Active perfSONAR instances 197 Running latest version (3.5+) Initial deployment coordinated by WLCG perfSONAR TF Commissioning of the network followed by WLCG Network and Transfer Metrics WG http://grid-monitoring.cern.ch/perfsonar_report.txthttp://grid-monitoring.cern.ch/perfsonar_report.txt for stats https://www.google.com/fusiontables/DataSource?docid=1QT4r17HEufkvnqhJu24nIptZ66XauYEIBWWh5Kpa#map:id=3 October 28, 2015

6 Overview of perfSONAR Pipeline LHCONE-Amsterdam6 The diagram on the right provides a high-level view of how WLCG/OSG is managing our perfSONAR deployments, gathering metrics and making them available for use. We will cover some of the details in what follows October 28, 2015

7 Gathering & Storing Metrics  OSG is providing network metric data for its members and WLCG via the Network Datastore  The data is gathered from all WLCG/OSG perfSONAR instances  Stored indefinitely on OSG hardware  Made available via API  In production since September 14 th  The primary use-cases  Network problem identification and localization  Network-related decision support  Network baseline: set expectations and identify weak points for upgrading LHCONE-Amsterdam7October 28, 2015

8 OSG Network Datastore Diagram LHCONE-Amsterdam8 q OSG is gathering relevant metrics from the complete set of OSG and WLCG perfSONAR instances q Operating now q Running VMs on dedicated hardware q Data also published to CERN Active MQ instance and available for user subscription q Actively tuning and debugging 8 VMs Storage must host 7 distinct areas October 28, 2015

9 Changes for LHCOPN/LHCONE  We have changed to use uni-directional tests for OWAMP to reduce the load  Source host is responsible for initiating and recording test results to each destination  We are using iperf3 as the baseline for bandwidth measurements (adds retry information)  Recent fix for NDT ensured the TCP congestion protocol would use ‘htcp’ rather than ‘reno’ when NDT and NPAD are not in use. This should improve BW results.  Plan to gather all the LHCOPN and LHCONE data into ElasticSearch (ongoing) October 28, 2015LHCONE-Amsterdam9

10 Existing Test Coverage  Current perfSONAR measurement coverage for WLCG/OSG:  Full latency (one-direction only, 10Hz, OWAMP, IPv4)  Full traceroute (bi-directional, hourly, BWCTL/OWAMP, IPv4, IPv6)  Full bandwidth (one-direction only, fortnightly, BWCTL-only!, IPv4, IPv6)  Regional meshes still disabled, need to discuss how to evolve  We can create any sub-mesh of the full latency mesh (for free, but only IPv4 and using same params)  We could move from regional to bigger meshes (European, Asia/Pacific, US)  We can create new bandwidth meshes as bwclt needs fewer resources (but only for BWCTL-only nodes, not on dual-nodes)  We re-enabled project meshes  Belle II  Belle II – both latency and bandwidth  Dual-stack  Dual-stack – just bandwidth (both IPv4 and IPv6)  LHCONE/LHCOPN  LHCONE/LHCOPN – These need to be separately tracked LHCONE-Amsterdam10October 28, 2015

11 OMD for LHCONE/LHCOPN perfSONARs October 28, 2015LHCONE-Amsterdam11 https://maddash.aglt2.org/WLCGperfSONAR/check_mk/https://maddash.aglt2.org/WLCGperfSONAR/check_mk/ (Prototype) https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/ (Production) We monitor: “Expected” test coverage NDT/NPAD running? Memory on hosts (<4GB) New “version” test Access requires x509 credential from IGTF CA Gives us a good view into where problems still exist

12 OMD Hostgroup Summary LHCOPN/LHCONE October 28, 2015LHCONE-Amsterdam12

13 perfSONAR Monitoring Pages  We have 3 versions of our perfSONAR monitoring pages  Prototype at maddash.aglt2.org (intending to phase this out soon)  Testing at OSG’s ITB instance  Production at OSG’s production instance  Main monitoring types are MaDDash and OMD/Check_MK  Prototype: http://maddash.aglt2.org/maddash-webuihttp://maddash.aglt2.org/maddash-webui https://maddash.aglt2.org/WLCGperfSONAR/check_mk  Testing: http://perfsonar-itb.grid.iu.edu/maddash-webui/http://perfsonar-itb.grid.iu.edu/maddash-webui/ https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk /https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk /  Production: http://psmad.grid.iu.edu/maddash-webui/http://psmad.grid.iu.edu/maddash-webui/ https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk  Notes:  OSG instances rely upon OSG Datastore: http://psds.grid.iu.eduhttp://psds.grid.iu.edu  X509 cert needed to view check_mk/OMD pages (any IGTF cert) October 28, 2015LHCONE-Amsterdam13

14 Monitoring Metrics  Use MaDDash to view metric summaries  Provide quick view about how networks are working  OSG hosts production instance http://psmad.grid.iu.edu/maddash-webui/ http://psmad.grid.iu.edu/maddash-webui/ LHCONE-Amsterdam14 Metrics are displayed via source-destination matrix Multiple dashboards (meshes) can be selected Custom menus link to relevant resources New release (2.0) will incorporate MadAlert http://maddash.aglt2.org/madalert.html October 28, 2015

15 LHCONE MaDDash – 01 Jun 2015 October 28, 2015LHCONE-Amsterdam15 Things looked pretty good. Some missing measurements (orange) but not much packet loss seen. Bandwidth measurements so-so with some problems indicated.

16 LHCONE MaDDash – 27 Oct 2015 October 28, 2015LHCONE-Amsterdam16 Some issues getting data from Internet2/GEANT instances we need to look into

17 LHCOPN MaDDash – 01 Jun 2015 October 28, 2015LHCONE-Amsterdam17 Latency had some issues: RAL showed signs of continuing network problems.  BW mesh similar. Kisti still has problems,“red” throughput is worth examining.

18 LHCOPN MaDDash – 27 Oct 2015 October 28, 2015LHCONE-Amsterdam18 Some firewall problems for the OSG collector from FNAL. Setup being examined at INFN and PIC. Patches to address IPv6/IPv4 and http/https put in yesterday and things are broken  Should be fixed later today.

19 Existing Tools  We have a number of tools available to help debug and understand network problems.  There are very good presentations on these tools in the training materials provided by perfSONAR: http://www.perfsonar.net/about/training-materials/ http://www.perfsonar.net/about/training-materials/  While I don’t have time to cover all the details (see http://www.perfsonar.net/about/training-materials/201507-ps-training/ and especially the Measurement Tools, Use Cases and Debugging presentations from Jason Zurawski) I do want to note that command line tools exist to allow you to create on-demand 3 rd party tests (between two remote instances) for bandwidth, latency and traceroute. http://www.perfsonar.net/about/training-materials/201507-ps-training/  Follow the debugging strategy as a guide to finding and fixing LHCONE/LHCOPN network issues using perfSONAR capabilities  As for new tools…. LHCONE-Amsterdam19October 28, 2015

20 How to Find Nearest pS Instances  perfSONAR measures our networks but how do we find the “right” perfSONAR instance (or metrics) when we need to understand a path?  We need tools that let us identify which perfSONAR is relevant:  provide “glue” between experiments and sonar topology  map sonars to storages and vice versa  determine existing tools/technologies that can be used  Looked at different approaches  Site-based – join based on common mapping to GOCDB/OIM sites  GeoIP – join based on geographical distance (prototyped)  Traceroutes – join based on network distance (prototyped)  RTT – multilateration based on RTT – determine position by measuring distance from reference points (perfSONARs)  GeoIP API prototype exists:  http://proximity.cern.ch/api/0.3/geoip/nearest?sonar=psum02.aglt2.org&count=5 http://proximity.cern.ch/api/0.3/geoip/nearest?sonar=psum02.aglt2.org&count=5  http://proximity.cern.ch/api/0.3/geoip/nearest?se=lapp-se01.in2p3.fr&count=10 (works with any hostname) http://proximity.cern.ch/api/0.3/geoip/nearest?se=lapp-se01.in2p3.fr&count=10  Interest from GEANT and ESNet to work on an open-source based project LHCONE-Amsterdam20October 28, 2015

21 ATLAS Network Metrics Pipeline  Ilija Vukotic, Kaushik De, Rob Gardner and Jorge Batista are working with the Network and Transfer Metrics WG to make perfSONAR metrics available to PANDA  Pipeline: OSG Network Datastore -> CERN Active MQ -> Flume -> ES -> PANDA  Prototype working and analytics being performed in Elastic Search to validate data (see following slide)  Plan is to create a network source-destination cost-matrix PANDA can use to evaluate options  Actual interface details being discussed with PANDA team  Can also be used to analyze LHCONE/LHCOPN data! 21

22 perfSONAR Data into ElasticSearch Avg src loss Avg src loss % Avg dst loss Avg dst loss % http://tinyurl.com/ogcyqh9http://tinyurl.com/ogcyqh9 for example plots using WLCG data http://tinyurl.com/ogcyqh9 October 28, 2015LHCONE-Amsterdam22

23 MadAlert: A new project to analyze meshes  Gabriele Carcassi has been working with me on creating a new utility to analyze meshes: MadAlert  See details at http://madalert.aglt2.org/madalert/index.html http://madalert.aglt2.org/madalert/index.html  You can see meshes and reports from the page infrastructurenetwork  Reports find both infrastructure and network problems  We are now working with Andy Lake/ESnet to incorporate this into the next major release of MaDDash (v2.0)  Now testing a “diff” to allow us to compare meshes; e.g., IPv4 vs IPv6, testing vs production, mesh(t1) vs mesh(t2)  http://madalert.aglt2.org/madalert/testDiff.html http://madalert.aglt2.org/madalert/testDiff.html  Could be really helpful for understanding new software versions or changes in time. Time based comparison will require some modifications to MaDDash to allow specifying time-based meshes. October 28, 2015LHCONE-Amsterdam23

24 Understanding Network Topology  Can we create tools to manipulate, visualize, compare and analyze network topologies from the OSG network datastore contents?  Can we build upon these tools to create a set of next- generation network diagnostic tools to make debugging network problems easier, quicker and more accurate?  Even without requiring the ability to perform complicated data analysis and correlation, basic tools developed in the area of network topology-based metric visualization would be very helpful in letting users and network engineers better understand what is happening in our networks.  Last time I showed some examples potentially useful components to begin looking at network topology… October 28, 2015LHCONE-Amsterdam24

25 Exploring Path Analysis LHCONE-Amsterdam25 latency, packet-loss, throughput DFN JANET GEANT RAL Aachen ITEP QMUL We can correlate paths with packet-loss/latency information We can simplify the graph by aggregating nodes that belong to same NREN (visual debugging) October 28, 2015

26 WLCG Support Unit   Reminder: We have a GGUS support unit (WLCG Network Throughput; https://wiki.egi.eu/wiki/GGUS:WLCG_Network_Throughput) used to report incidents (mailing list: wlcg-network-throughput at cern.ch)https://wiki.egi.eu/wiki/GGUS:WLCG_Network_Throughput   Experiments can report potential network performance incidents.  WLCG perfSONAR support investigates and confirms if this is network related issue.  Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Tracking of ongoing incidents will be via the WG page.   Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider.  If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging. For the non-technical (policy) issues, sites should escalate to the WLCG operations coordination.  https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Perf ormance_Incidents. https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Perf ormance_Incidents   LHCOPN/LHCONE experts are very important in this coordinated activity. October 28, 2015LHCONE-Amsterdam26

27 Next Steps  We are working on getting ALL WLCG/OSG perfSONAR instances fully operational and properly configured  We have hints that some perfSONAR services stop or hang under some circumstances. Working with developers to isolate/fix.  Some hosts are underpowered (<4GB in latency) or broken  There are some bugs we know of in the data acquisition chain that need fixing. Ongoing effort on this  As we fix known issues and get to reliable operation, we can free up time to pursue possible issues in the network itself, rather than the framework that gets us network metrics.  We need to plan for a campaign to clear up remaining LHCONE/LHCOPN problems.  Currently working with FNAL, INFN and PIC on some issues October 28, 2015LHCONE-Amsterdam27

28 References  Network Documentation https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG  Deployment documentation for OSG and WLCG hosted in OSG https://twiki.opensciencegrid.org/bin/view/Documentation/DeployperfSONAR  New MA guide http://software.es.net/esmond/perfsonar_client_rest.html http://software.es.net/esmond/perfsonar_client_rest.html  Modular Dashboard and OMD Prototypes  http://maddash.aglt2.org/maddash-webui https://maddash.aglt2.org/WLCGperfSONAR/check_mk http://maddash.aglt2.org/maddash-webuihttps://maddash.aglt2.org/WLCGperfSONAR/check_mk  OSG Production instances for OMD, MaDDash and Datastore  http://psmad.grid.iu.edu/maddash-webui/ http://psmad.grid.iu.edu/maddash-webui/  https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/ https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/  http://psds.grid.iu.edu/esmond/perfsonar/archive/?format=json http://psds.grid.iu.edu/esmond/perfsonar/archive/?format=json  Mesh-config in OSG https://oim.grid.iu.edu/oim/meshconfig https://oim.grid.iu.edu/oim/meshconfig  Use-cases document for experiments and middleware https://docs.google.com/document/d/1ceiNlTUJCwSuOuvbEHZnZp0XkWkwdkPQTQic0VbH1m c/edit https://docs.google.com/document/d/1ceiNlTUJCwSuOuvbEHZnZp0XkWkwdkPQTQic0VbH1m c/edit https://docs.google.com/document/d/1ceiNlTUJCwSuOuvbEHZnZp0XkWkwdkPQTQic0VbH1m c/edit LHCONE-Amsterdam28October 28, 2015

29 Discussion/Questions/Comments? October 28, 2015LHCONE-Amsterdam29

30 Discussion Topics  What would you like to see as next steps?  Are the tools sufficient to help us with our goals?  Any changes in coverage/tests needed?  Should we push forward on using our metrics to start addressing issues in the network? (who is “we” ? ) October 28, 2015LHCONE-Amsterdam30


Download ppt "PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015."

Similar presentations


Ads by Google