Download presentation
Presentation is loading. Please wait.
Published byChristian Patrick Modified over 9 years ago
1
perfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015
2
Overview of Talk perfSONAR Changes and updates WLCG, LHCONE and LHCOPN infrastructure overview Status and changes in our meshes Some new tools ElasticSearch, MadAlert and topology explorations for our data Summary and Discussion October 28, 2015LHCONE-Amsterdam2
3
perfSONAR v3.5 Toolkit perfSONAR v3.5 released on the 28 th of September Main themes for this release: Support for central host management and node auto-configuration Support for low cost nodes Support for Debian, VMs, and other installation options Modernize the GUIs In addition v3.5 incorporates feedback and bugfixes from our WLCG/OSG deployments, improving robustness. WLCG/OSG Deployment status as of today (great progress): Deployment statusDeployment status 3.4.1 : 6 3.4.2 : 23 3.5 : 2 3.5.0 : 195 Unknown: 18 (These nodes are either down or hung) LHCONE-Amsterdam3October 28, 2015
4
New perfSONAR Deployment Options Configuration managed deployments via bundles ( see http://docs.perfsonar.net/install_options.html ) ttp://docs.perfsonar.net/install_options.html perfSONAR Tools (just tools) perfSONAR TestPoint (passive, no MA) perfSONAR Core (+MA) perfSONAR Complete (+Web and Toolkit Configuration) perfSONAR Central Management (MaDDash, Auto-config, Centralized config service) Low-cost nodes to support large-scale deployment (http://docs.perfsonar.net/low_cost_nodes.html ) http://docs.perfsonar.net/low_cost_nodes.html $100-200 range should enable broad deployment Small form factor enables more locations Some limitations in capabilities due to hardware VMs - Still not recommended but possible Target: whole node VMs, VMs with dedicated physical NICs Main use “end-to-end” infrastructure testing (not network) What about Docker? http://www.perfsonar.net/deploy/installation-and-configuration/ http://www.perfsonar.net/deploy/installation-and-configuration/ LHCONE-Amsterdam4October 28, 2015
5
Current perfSONAR Deployment LHCONE-Amsterdam5 278 perfSONAR instances registered in GOCDB/OIM 245 Active perfSONAR instances 197 Running latest version (3.5+) Initial deployment coordinated by WLCG perfSONAR TF Commissioning of the network followed by WLCG Network and Transfer Metrics WG http://grid-monitoring.cern.ch/perfsonar_report.txthttp://grid-monitoring.cern.ch/perfsonar_report.txt for stats https://www.google.com/fusiontables/DataSource?docid=1QT4r17HEufkvnqhJu24nIptZ66XauYEIBWWh5Kpa#map:id=3 October 28, 2015
6
Overview of perfSONAR Pipeline LHCONE-Amsterdam6 The diagram on the right provides a high-level view of how WLCG/OSG is managing our perfSONAR deployments, gathering metrics and making them available for use. We will cover some of the details in what follows October 28, 2015
7
Gathering & Storing Metrics OSG is providing network metric data for its members and WLCG via the Network Datastore The data is gathered from all WLCG/OSG perfSONAR instances Stored indefinitely on OSG hardware Made available via API In production since September 14 th The primary use-cases Network problem identification and localization Network-related decision support Network baseline: set expectations and identify weak points for upgrading LHCONE-Amsterdam7October 28, 2015
8
OSG Network Datastore Diagram LHCONE-Amsterdam8 q OSG is gathering relevant metrics from the complete set of OSG and WLCG perfSONAR instances q Operating now q Running VMs on dedicated hardware q Data also published to CERN Active MQ instance and available for user subscription q Actively tuning and debugging 8 VMs Storage must host 7 distinct areas October 28, 2015
9
Changes for LHCOPN/LHCONE We have changed to use uni-directional tests for OWAMP to reduce the load Source host is responsible for initiating and recording test results to each destination We are using iperf3 as the baseline for bandwidth measurements (adds retry information) Recent fix for NDT ensured the TCP congestion protocol would use ‘htcp’ rather than ‘reno’ when NDT and NPAD are not in use. This should improve BW results. Plan to gather all the LHCOPN and LHCONE data into ElasticSearch (ongoing) October 28, 2015LHCONE-Amsterdam9
10
Existing Test Coverage Current perfSONAR measurement coverage for WLCG/OSG: Full latency (one-direction only, 10Hz, OWAMP, IPv4) Full traceroute (bi-directional, hourly, BWCTL/OWAMP, IPv4, IPv6) Full bandwidth (one-direction only, fortnightly, BWCTL-only!, IPv4, IPv6) Regional meshes still disabled, need to discuss how to evolve We can create any sub-mesh of the full latency mesh (for free, but only IPv4 and using same params) We could move from regional to bigger meshes (European, Asia/Pacific, US) We can create new bandwidth meshes as bwclt needs fewer resources (but only for BWCTL-only nodes, not on dual-nodes) We re-enabled project meshes Belle II Belle II – both latency and bandwidth Dual-stack Dual-stack – just bandwidth (both IPv4 and IPv6) LHCONE/LHCOPN LHCONE/LHCOPN – These need to be separately tracked LHCONE-Amsterdam10October 28, 2015
11
OMD for LHCONE/LHCOPN perfSONARs October 28, 2015LHCONE-Amsterdam11 https://maddash.aglt2.org/WLCGperfSONAR/check_mk/https://maddash.aglt2.org/WLCGperfSONAR/check_mk/ (Prototype) https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/ (Production) We monitor: “Expected” test coverage NDT/NPAD running? Memory on hosts (<4GB) New “version” test Access requires x509 credential from IGTF CA Gives us a good view into where problems still exist
12
OMD Hostgroup Summary LHCOPN/LHCONE October 28, 2015LHCONE-Amsterdam12
13
perfSONAR Monitoring Pages We have 3 versions of our perfSONAR monitoring pages Prototype at maddash.aglt2.org (intending to phase this out soon) Testing at OSG’s ITB instance Production at OSG’s production instance Main monitoring types are MaDDash and OMD/Check_MK Prototype: http://maddash.aglt2.org/maddash-webuihttp://maddash.aglt2.org/maddash-webui https://maddash.aglt2.org/WLCGperfSONAR/check_mk Testing: http://perfsonar-itb.grid.iu.edu/maddash-webui/http://perfsonar-itb.grid.iu.edu/maddash-webui/ https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk /https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk / Production: http://psmad.grid.iu.edu/maddash-webui/http://psmad.grid.iu.edu/maddash-webui/ https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk Notes: OSG instances rely upon OSG Datastore: http://psds.grid.iu.eduhttp://psds.grid.iu.edu X509 cert needed to view check_mk/OMD pages (any IGTF cert) October 28, 2015LHCONE-Amsterdam13
14
Monitoring Metrics Use MaDDash to view metric summaries Provide quick view about how networks are working OSG hosts production instance http://psmad.grid.iu.edu/maddash-webui/ http://psmad.grid.iu.edu/maddash-webui/ LHCONE-Amsterdam14 Metrics are displayed via source-destination matrix Multiple dashboards (meshes) can be selected Custom menus link to relevant resources New release (2.0) will incorporate MadAlert http://maddash.aglt2.org/madalert.html October 28, 2015
15
LHCONE MaDDash – 01 Jun 2015 October 28, 2015LHCONE-Amsterdam15 Things looked pretty good. Some missing measurements (orange) but not much packet loss seen. Bandwidth measurements so-so with some problems indicated.
16
LHCONE MaDDash – 27 Oct 2015 October 28, 2015LHCONE-Amsterdam16 Some issues getting data from Internet2/GEANT instances we need to look into
17
LHCOPN MaDDash – 01 Jun 2015 October 28, 2015LHCONE-Amsterdam17 Latency had some issues: RAL showed signs of continuing network problems. BW mesh similar. Kisti still has problems,“red” throughput is worth examining.
18
LHCOPN MaDDash – 27 Oct 2015 October 28, 2015LHCONE-Amsterdam18 Some firewall problems for the OSG collector from FNAL. Setup being examined at INFN and PIC. Patches to address IPv6/IPv4 and http/https put in yesterday and things are broken Should be fixed later today.
19
Existing Tools We have a number of tools available to help debug and understand network problems. There are very good presentations on these tools in the training materials provided by perfSONAR: http://www.perfsonar.net/about/training-materials/ http://www.perfsonar.net/about/training-materials/ While I don’t have time to cover all the details (see http://www.perfsonar.net/about/training-materials/201507-ps-training/ and especially the Measurement Tools, Use Cases and Debugging presentations from Jason Zurawski) I do want to note that command line tools exist to allow you to create on-demand 3 rd party tests (between two remote instances) for bandwidth, latency and traceroute. http://www.perfsonar.net/about/training-materials/201507-ps-training/ Follow the debugging strategy as a guide to finding and fixing LHCONE/LHCOPN network issues using perfSONAR capabilities As for new tools…. LHCONE-Amsterdam19October 28, 2015
20
How to Find Nearest pS Instances perfSONAR measures our networks but how do we find the “right” perfSONAR instance (or metrics) when we need to understand a path? We need tools that let us identify which perfSONAR is relevant: provide “glue” between experiments and sonar topology map sonars to storages and vice versa determine existing tools/technologies that can be used Looked at different approaches Site-based – join based on common mapping to GOCDB/OIM sites GeoIP – join based on geographical distance (prototyped) Traceroutes – join based on network distance (prototyped) RTT – multilateration based on RTT – determine position by measuring distance from reference points (perfSONARs) GeoIP API prototype exists: http://proximity.cern.ch/api/0.3/geoip/nearest?sonar=psum02.aglt2.org&count=5 http://proximity.cern.ch/api/0.3/geoip/nearest?sonar=psum02.aglt2.org&count=5 http://proximity.cern.ch/api/0.3/geoip/nearest?se=lapp-se01.in2p3.fr&count=10 (works with any hostname) http://proximity.cern.ch/api/0.3/geoip/nearest?se=lapp-se01.in2p3.fr&count=10 Interest from GEANT and ESNet to work on an open-source based project LHCONE-Amsterdam20October 28, 2015
21
ATLAS Network Metrics Pipeline Ilija Vukotic, Kaushik De, Rob Gardner and Jorge Batista are working with the Network and Transfer Metrics WG to make perfSONAR metrics available to PANDA Pipeline: OSG Network Datastore -> CERN Active MQ -> Flume -> ES -> PANDA Prototype working and analytics being performed in Elastic Search to validate data (see following slide) Plan is to create a network source-destination cost-matrix PANDA can use to evaluate options Actual interface details being discussed with PANDA team Can also be used to analyze LHCONE/LHCOPN data! 21
22
perfSONAR Data into ElasticSearch Avg src loss Avg src loss % Avg dst loss Avg dst loss % http://tinyurl.com/ogcyqh9http://tinyurl.com/ogcyqh9 for example plots using WLCG data http://tinyurl.com/ogcyqh9 October 28, 2015LHCONE-Amsterdam22
23
MadAlert: A new project to analyze meshes Gabriele Carcassi has been working with me on creating a new utility to analyze meshes: MadAlert See details at http://madalert.aglt2.org/madalert/index.html http://madalert.aglt2.org/madalert/index.html You can see meshes and reports from the page infrastructurenetwork Reports find both infrastructure and network problems We are now working with Andy Lake/ESnet to incorporate this into the next major release of MaDDash (v2.0) Now testing a “diff” to allow us to compare meshes; e.g., IPv4 vs IPv6, testing vs production, mesh(t1) vs mesh(t2) http://madalert.aglt2.org/madalert/testDiff.html http://madalert.aglt2.org/madalert/testDiff.html Could be really helpful for understanding new software versions or changes in time. Time based comparison will require some modifications to MaDDash to allow specifying time-based meshes. October 28, 2015LHCONE-Amsterdam23
24
Understanding Network Topology Can we create tools to manipulate, visualize, compare and analyze network topologies from the OSG network datastore contents? Can we build upon these tools to create a set of next- generation network diagnostic tools to make debugging network problems easier, quicker and more accurate? Even without requiring the ability to perform complicated data analysis and correlation, basic tools developed in the area of network topology-based metric visualization would be very helpful in letting users and network engineers better understand what is happening in our networks. Last time I showed some examples potentially useful components to begin looking at network topology… October 28, 2015LHCONE-Amsterdam24
25
Exploring Path Analysis LHCONE-Amsterdam25 latency, packet-loss, throughput DFN JANET GEANT RAL Aachen ITEP QMUL We can correlate paths with packet-loss/latency information We can simplify the graph by aggregating nodes that belong to same NREN (visual debugging) October 28, 2015
26
WLCG Support Unit Reminder: We have a GGUS support unit (WLCG Network Throughput; https://wiki.egi.eu/wiki/GGUS:WLCG_Network_Throughput) used to report incidents (mailing list: wlcg-network-throughput at cern.ch)https://wiki.egi.eu/wiki/GGUS:WLCG_Network_Throughput Experiments can report potential network performance incidents. WLCG perfSONAR support investigates and confirms if this is network related issue. Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Tracking of ongoing incidents will be via the WG page. Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider. If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging. For the non-technical (policy) issues, sites should escalate to the WLCG operations coordination. https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Perf ormance_Incidents. https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Perf ormance_Incidents LHCOPN/LHCONE experts are very important in this coordinated activity. October 28, 2015LHCONE-Amsterdam26
27
Next Steps We are working on getting ALL WLCG/OSG perfSONAR instances fully operational and properly configured We have hints that some perfSONAR services stop or hang under some circumstances. Working with developers to isolate/fix. Some hosts are underpowered (<4GB in latency) or broken There are some bugs we know of in the data acquisition chain that need fixing. Ongoing effort on this As we fix known issues and get to reliable operation, we can free up time to pursue possible issues in the network itself, rather than the framework that gets us network metrics. We need to plan for a campaign to clear up remaining LHCONE/LHCOPN problems. Currently working with FNAL, INFN and PIC on some issues October 28, 2015LHCONE-Amsterdam27
28
References Network Documentation https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG Deployment documentation for OSG and WLCG hosted in OSG https://twiki.opensciencegrid.org/bin/view/Documentation/DeployperfSONAR New MA guide http://software.es.net/esmond/perfsonar_client_rest.html http://software.es.net/esmond/perfsonar_client_rest.html Modular Dashboard and OMD Prototypes http://maddash.aglt2.org/maddash-webui https://maddash.aglt2.org/WLCGperfSONAR/check_mk http://maddash.aglt2.org/maddash-webuihttps://maddash.aglt2.org/WLCGperfSONAR/check_mk OSG Production instances for OMD, MaDDash and Datastore http://psmad.grid.iu.edu/maddash-webui/ http://psmad.grid.iu.edu/maddash-webui/ https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/ https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/ http://psds.grid.iu.edu/esmond/perfsonar/archive/?format=json http://psds.grid.iu.edu/esmond/perfsonar/archive/?format=json Mesh-config in OSG https://oim.grid.iu.edu/oim/meshconfig https://oim.grid.iu.edu/oim/meshconfig Use-cases document for experiments and middleware https://docs.google.com/document/d/1ceiNlTUJCwSuOuvbEHZnZp0XkWkwdkPQTQic0VbH1m c/edit https://docs.google.com/document/d/1ceiNlTUJCwSuOuvbEHZnZp0XkWkwdkPQTQic0VbH1m c/edit https://docs.google.com/document/d/1ceiNlTUJCwSuOuvbEHZnZp0XkWkwdkPQTQic0VbH1m c/edit LHCONE-Amsterdam28October 28, 2015
29
Discussion/Questions/Comments? October 28, 2015LHCONE-Amsterdam29
30
Discussion Topics What would you like to see as next steps? Are the tools sufficient to help us with our goals? Any changes in coverage/tests needed? Should we push forward on using our metrics to start addressing issues in the network? (who is “we” ? ) October 28, 2015LHCONE-Amsterdam30
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.