Download presentation
Presentation is loading. Please wait.
Published byBernard Hutchinson Modified over 9 years ago
1
WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st, 2014
2
Overview perfSONAR in WLCG The WLCG perfSONAR-PS Deployment Task-force The (new) WLCG Network and Transfer Metrics WG Future Plans 21 May 2014HEPiX - Annecy2
3
Introductory Considerations All distributed, data-intensive projects critically depend upon the network. Network problems can be hard to diagnose and slow to fix Network problems are multi-domain, complicating the process Standardizing on specific tools and methods allows groups to focus resources more effectively and better self-support (as well as benefiting from others work) Performance issues involving the network are complicated by the number of components involved end-to-end. We need the ability to better isolate performance bottlenecks WLCG wants to make sure their(our) scientists are able to effectively use the network and quickly resolve network issues when and where they occur 21 May 2014HEPiX - Annecy3
4
Vision for perfSONAR-PS in WLCG 21 May 2014HEPiX - Annecy4 Goals: Find and isolate “network” problems; alerting in a timely way Characterize network use (base-lining) Provide a source of network metrics for higher level services First step: get monitoring in place to create a baseline of the current situation between sites (see details later) Next: continuing measurements to track the network, alerting on problems as they develop Choice of a standard “tool/framework”: perfSONAR We wanted to benefit from the R&E community consensus perfSONAR’s purpose is to aid in network diagnosis by allowing users to characterize and isolate problems. It provides measurements of network performance metrics over time as well as “on-demand” tests.
5
WLCG deployment plan WLCG choose to deploy perfSONAR-PS at all sites worldwide A dedicated WLCG Operations Task-Force was started in Fall 2013 Sites are organized in regions Based on geographical locations and experiments computing models All sites are expected to deploy a bandwidth host and a latency host Regular testing is setup using a centralized (“mesh”) configuration Bandwidth tests: 30 seconds tests every 6 hours intra-region, 12 hours for T2-T1 inter-region, 1 week elsewhere Latency tests; 10 Hz of packets to each WLCG site Traceroute tests between all WLCG sites each hour Ping(ER) tests between all site every 20 minutes 21 May 2014HEPiX - Annecy5
6
Summary of perfSONAR Deployment WLCG Deployment Task-force completed work April 27, 2014 All WLCG sites should have installed/upgraded perfSONAR following instructions at https://twiki.cern.ch/twiki/bin/view/LCG/PerfsonarDeployment https://twiki.cern.ch/twiki/bin/view/LCG/PerfsonarDeployment Baseline release is 3.3.2. Task-force deadline for deployment was April 1, 2014 We have 205 hosts running and in the mesh There are 8 sites not installed We have 64 sites not at the current version Versions prior to 3.3 unable to use the Mesh-config 21 May 2014HEPiX - Annecy6
7
Remaining WLCG Deployments There are 8 sites not installed and configured yet: BelGrid-UCL BelGrid-UCL: asked for SLC6 installation, pointed to https://code.google.com/p/perfsonar- ps/wiki/Level1and2Installhttps://code.google.com/p/perfsonar- ps/wiki/Level1and2Install GR-07-UOI-HEPLAB GR-07-UOI-HEPLAB: no hardware, on hold. GoeGrid GoeGrid: no reply, 4 reminders ICM ICM: "We do not have free resources to deploy perfSonar", ticket closed. MPPMU MPPMU: procuring hardware RO-11-NIPNE RO-11-NIPNE: site under upgrade on 09/01/2014, recent progress, needs to be added to FR T2_Estonia T2_Estonia: under installation/configuration TECHNION-HEP TECHNION-HEP: first reply yesterday (3 reminders). USCMS-FNAL-WC1 USCMS-FNAL-WC1: installed and configured (since long time), not publishing in OIM Reported at WLCG Ops Coordination meeting https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes140403#perfSONAR_de ployment_TF https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes140403#perfSONAR_de ployment_TF https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes140403#perfSONAR_de ployment_TF Good that we have so few missing and most should eventually be deployed 21 May 2014HEPiX - Annecy7
8
Monitoring Status MaDDash instance at http://maddash.aglt2.org/maddash-webui http://maddash.aglt2.org/maddash-webui Has shown we still have some issues: Too much “orange” meaning data is either not be taken (configuration or firewall) or access to results are blocked 21 May 2014HEPiX - Annecy8 Have OMD monitoring the perfSONAR-PS instances https://maddash.aglt2.org/ WLCGperfSONAR/omd/ OSG These services should migrate to OSG over the next month. This monitoring should be useful for any future work to find/fix problems. March 6 April 16
9
MaDDash LHCONE Matrices: 28Apr2014 21 May 2014HEPiX - Annecy9 OWAMP (Latency)BWCTL (Bandwidth) No packet loss, packet loss>0.01 0.5 0.9 Gb, 0.5<BW<0.9 Gb, BW<0.5 Gb orange Main issue is too much “orange” indicating missing measurements/data Sources are “row”, Destination is “column” Each box split into two regions indicating where the test is run: top corresponds to “row”, bottom to “column”
10
Task-Force Lessons Learned Installing a service at every site is one thing, but commissioning a NxN system of links is squared the effort. This is why we have perfSONAR-PS installed but not all links are monitored. perfSONAR is a “special” service It tests a multi-domain network path, with a service at the source and the destination It requires dedicated hardware and comes in a bundle with the OS. We understand this creates complications to some fabric infrastructure. An RPM bundle was provided to help those sites, we encouraged sites also to share configuration experience We had many releases of perfSONAR during the deployment process, each coming with new features or bug-fixes we requested. Some sites did install perfSONAR but they are at old releases with many missing functionalities. The change of OS version (v3.2 -> v3.3) was a major reason for the inertia of some sites. We still have issues with firewalls. There are 2 kid of firewalls to be considered: For the hosts to be able to run the tests among themselves For the hosts to be able to expose information to the monitoring tools. Many sites get the first one right but not the second ones. 21 May 2014HEPiX - Annecy10
11
Important Remaining Issues Get sites running older versions to upgrade Verify we consistently get the needed metrics Involve cloud/VO leads in debugging/fixing issues Fix Firewalls: still a problem for many sites Test coverage and parameters Should we have more VO-specific meshes/tests? e.g., WLCG->WLCG- ATLAS, WLCG-CMS? What frequency of testing for traceroute, BW? Better Docs: How-tos, Debugging “orange” WLCG Operations has convened a new Working Group to address these issues: Network and Transfer Metrics 21 May 2014HEPiX - Annecy11
12
Mandate: Network & Transfer Metrics 21 May 2014HEPiX - Annecy12 Ensure all relevant network and transfer metrics are identified, collected and published Ensure sites and experiments can better understand and fix networking issues Enable use of network-aware tools to improve transfer efficiency and optimize experiment workflows (e.g. ANSE project; see http://www.internet2.edu/presentations/tip2013/20130116_b arczyk_anse.pdf ) http://www.internet2.edu/presentations/tip2013/20130116_b arczyk_anse.pdf http://www.internet2.edu/presentations/tip2013/20130116_b arczyk_anse.pdf
13
Working Group Objectives 21 May 2014HEPiX - Annecy13 ●Identify and continuously make available relevant transfer and network metrics ●Document metrics and their use ●Facilitate their integration in the middleware and/or experiment tool chain ●Coordinate commissioning and maintenance of WLCG network monitoring o Finalize perfSONAR deployment o Ensure all links continue to be monitored and sites stay correctly configured o Verify coverage and optimize test parameters
14
Working Group Membership 21 May 2014HEPiX - Annecy14 ●Chairs: Shawn McKee, Marian Babik ●Members: Proposing to invite previous members of the perfSONAR-PS TF responsible for different clouds o See https://twiki.cern.ch/twiki/bin/view/LCG/MeshUpdates https://twiki.cern.ch/twiki/bin/view/LCG/MeshUpdates o Fill-in members for missing clouds ●Inviting members knowledgeable about FAX, AAA, FTS, PhEDEx, Panda or Rucio ●If anyone is interested in joining the effort contact me!
15
Use of perfSONAR-PS Metrics Throughput: Notice problems and debug network, also help differentiate server problems from path problems Latency: Notice route changes, asymmetric routes Watch for excessive Packet Loss On-demand tests and NPAD/NDT diagnostics via web Optionally: Install additional perfSONAR nodes inside local network, and/or at periphery Characterize local performance and internal packet loss Separate WAN performance from internal performance Daily Dashboard check of own site, and peers 21 May 2014HEPiX - Annecy15
16
Debugging Network Problems Using perfSONAR-PS we (the VOs) identify network problems by observing degradation in regular metrics for a particular “path” Packet-loss appearance in Latency tests Significant and persistent decrease in bandwidth Currently requires a “human” to trigger. Next check for correlation with other metric changes between sites at either end and other sites (is the problem likely at one of the ends or in the middle?) Correlate with paths and traceroute information. Something changed in the routing? Known issue in the path? In general NOT as easy to do all this as we would like with the current perfSONAR-PS toolkit 21 May 2014HEPiX - Annecy16
17
Improving perfSONAR-PS Deployments Based upon the issues we have encountered we setup a Wiki to gather best practices and solutions to issues we have identified: http://www.usatlas.bnl.gov/twiki/bin/view/Projects/LHCperfSONAR http://www.usatlas.bnl.gov/twiki/bin/view/Projects/LHCperfSONAR This page is shared with the perfSONAR-PS developers and we expect many of the “fixes” will be incorporated into future releases (most are in v3.3.2 already) Improving resiliency (set-it-and-forget-it) a high priority. Instances should self-maintain and the infrastructure should be able to alert when services fail (OIM/GOCDB tests) Disentangling problems with the measurement infrastructure versus problems in the network indicated by those measurements… 21 May 2014HEPiX - Annecy17
18
Future Use of Network Metrics Once we have a source of network metrics being acquired we need to understand how best to incorporate those metrics into our facility operations. Some possibilities: ANSE Characterizing paths with “costs” to better optimize decisions in workflow and data management (underway in ANSE) Noting when paths change and providing appropriate notification Optimizing data-access or data-distribution based upon a better understanding of the network between sites Identifying structural bottlenecks in need of remediation Aiding network problem diagnosis and speeding repairs In general, incorporating knowledge of the network into our processes We will require testing and iteration to better understand when and where the network metrics are useful. 21 May 2014HEPiX - Annecy18
19
OSG & Networking Service OSG is building a centralized service for gathering, viewing and providing network information to users and applications. Goal: OSG becomes the “source” for networking information for its constituents, aiding in finding/fixing problems and enabling applications and users to better take advantage of their networks MaDDashOMDOSG Plan is to migrate MaDDash and OMD to OSG in the next month. The critical missing component is the datastore to organize and store the network metrics and associated metadata OSG (via MaDDash) is gathering relevant metrics from the complete set of OSG and WLCG perfSONAR-PS instances This data must be available via an API, must be visualized and must be organized to provide the “OSG Networking Service” 21 May 2014HEPiX - Annecy19
20
Closing Remarks Over the last few years WLCG sites have converged on perfSONAR-PS as their way to measure and monitor their networks for data- intensive science. Not easy to get global consensus but we have it now after pushing since 2008 The assumption is that perfSONAR (and the perfSONAR-PS toolkit) is the de-facto standard way to do this and will be supported long-term Especially critical that R&E networks agree on its use and continue to improve and develop the reference implementation Dashboard critical for “visibility” into networks. We can’t manage/fix/respond-to problems if we can’t “see” them. Having perfSONAR-PS fully deployed should give us some interesting options for better management and use of our networks 21 May 2014HEPiX - Annecy20
21
Discussion/Questions 21 May 2014HEPiX - Annecy21 Questions or Comments?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.