WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,

WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st, 2014

Overview  perfSONAR in WLCG  The WLCG perfSONAR-PS Deployment Task-force  The (new) WLCG Network and Transfer Metrics WG  Future Plans 21 May 2014HEPiX - Annecy2

Introductory Considerations   All distributed, data-intensive projects critically depend upon the network.  Network problems can be hard to diagnose and slow to fix  Network problems are multi-domain, complicating the process  Standardizing on specific tools and methods allows groups to focus resources more effectively and better self-support (as well as benefiting from others work)  Performance issues involving the network are complicated by the number of components involved end-to-end. We need the ability to better isolate performance bottlenecks   WLCG wants to make sure their(our) scientists are able to effectively use the network and quickly resolve network issues when and where they occur 21 May 2014HEPiX - Annecy3

Vision for perfSONAR-PS in WLCG 21 May 2014HEPiX - Annecy4  Goals:  Find and isolate “network” problems; alerting in a timely way  Characterize network use (base-lining)  Provide a source of network metrics for higher level services  First step: get monitoring in place to create a baseline of the current situation between sites (see details later)  Next: continuing measurements to track the network, alerting on problems as they develop  Choice of a standard “tool/framework”: perfSONAR  We wanted to benefit from the R&E community consensus  perfSONAR’s purpose is to aid in network diagnosis by allowing users to characterize and isolate problems. It provides measurements of network performance metrics over time as well as “on-demand” tests.

WLCG deployment plan  WLCG choose to deploy perfSONAR-PS at all sites worldwide  A dedicated WLCG Operations Task-Force was started in Fall 2013  Sites are organized in regions  Based on geographical locations and experiments computing models  All sites are expected to deploy a bandwidth host and a latency host  Regular testing is setup using a centralized (“mesh”) configuration  Bandwidth tests: 30 seconds tests  every 6 hours intra-region, 12 hours for T2-T1 inter-region, 1 week elsewhere  Latency tests; 10 Hz of packets to each WLCG site  Traceroute tests between all WLCG sites each hour  Ping(ER) tests between all site every 20 minutes 21 May 2014HEPiX - Annecy5

Summary of perfSONAR Deployment  WLCG Deployment Task-force completed work April 27, 2014  All WLCG sites should have installed/upgraded perfSONAR following instructions at https://twiki.cern.ch/twiki/bin/view/LCG/PerfsonarDeployment https://twiki.cern.ch/twiki/bin/view/LCG/PerfsonarDeployment  Baseline release is 3.3.2.  Task-force deadline for deployment was April 1, 2014  We have 205 hosts running and in the mesh  There are 8 sites not installed  We have 64 sites not at the current version  Versions prior to 3.3 unable to use the Mesh-config 21 May 2014HEPiX - Annecy6

Remaining WLCG Deployments  There are 8 sites not installed and configured yet:  BelGrid-UCL  BelGrid-UCL: asked for SLC6 installation, pointed to https://code.google.com/p/perfsonar- ps/wiki/Level1and2Installhttps://code.google.com/p/perfsonar- ps/wiki/Level1and2Install  GR-07-UOI-HEPLAB  GR-07-UOI-HEPLAB: no hardware, on hold.  GoeGrid  GoeGrid: no reply, 4 reminders  ICM  ICM: "We do not have free resources to deploy perfSonar", ticket closed.  MPPMU  MPPMU: procuring hardware  RO-11-NIPNE  RO-11-NIPNE: site under upgrade on 09/01/2014, recent progress, needs to be added to FR  T2_Estonia  T2_Estonia: under installation/configuration  TECHNION-HEP  TECHNION-HEP: first reply yesterday (3 reminders).  USCMS-FNAL-WC1  USCMS-FNAL-WC1: installed and configured (since long time), not publishing in OIM  Reported at WLCG Ops Coordination meeting https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes140403#perfSONAR_de ployment_TF https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes140403#perfSONAR_de ployment_TF https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes140403#perfSONAR_de ployment_TF  Good that we have so few missing and most should eventually be deployed 21 May 2014HEPiX - Annecy7

Monitoring Status  MaDDash instance at http://maddash.aglt2.org/maddash-webui http://maddash.aglt2.org/maddash-webui  Has shown we still have some issues: Too much “orange” meaning data is either not be taken (configuration or firewall) or access to results are blocked 21 May 2014HEPiX - Annecy8 Have OMD monitoring the perfSONAR-PS instances https://maddash.aglt2.org/ WLCGperfSONAR/omd/ OSG These services should migrate to OSG over the next month. This monitoring should be useful for any future work to find/fix problems. March 6 April 16

MaDDash LHCONE Matrices: 28Apr2014 21 May 2014HEPiX - Annecy9 OWAMP (Latency)BWCTL (Bandwidth) No packet loss, packet loss>0.01 0.5 0.9 Gb, 0.5<BW<0.9 Gb, BW<0.5 Gb orange Main issue is too much “orange” indicating missing measurements/data Sources are “row”, Destination is “column” Each box split into two regions indicating where the test is run: top corresponds to “row”, bottom to “column”

Task-Force Lessons Learned  Installing a service at every site is one thing, but commissioning a NxN system of links is squared the effort.  This is why we have perfSONAR-PS installed but not all links are monitored.  perfSONAR is a “special” service  It tests a multi-domain network path, with a service at the source and the destination  It requires dedicated hardware and comes in a bundle with the OS.  We understand this creates complications to some fabric infrastructure. An RPM bundle was provided to help those sites, we encouraged sites also to share configuration experience  We had many releases of perfSONAR during the deployment process, each coming with new features or bug-fixes we requested.  Some sites did install perfSONAR but they are at old releases with many missing functionalities.  The change of OS version (v3.2 -> v3.3) was a major reason for the inertia of some sites.  We still have issues with firewalls. There are 2 kid of firewalls to be considered:  For the hosts to be able to run the tests among themselves  For the hosts to be able to expose information to the monitoring tools.  Many sites get the first one right but not the second ones. 21 May 2014HEPiX - Annecy10

Important Remaining Issues  Get sites running older versions to upgrade  Verify we consistently get the needed metrics  Involve cloud/VO leads in debugging/fixing issues  Fix Firewalls: still a problem for many sites  Test coverage and parameters  Should we have more VO-specific meshes/tests? e.g., WLCG->WLCG- ATLAS, WLCG-CMS?  What frequency of testing for traceroute, BW?  Better Docs: How-tos, Debugging “orange”  WLCG Operations has convened a new Working Group to address these issues: Network and Transfer Metrics 21 May 2014HEPiX - Annecy11

Mandate: Network & Transfer Metrics 21 May 2014HEPiX - Annecy12  Ensure all relevant network and transfer metrics are identified, collected and published  Ensure sites and experiments can better understand and fix networking issues  Enable use of network-aware tools to improve transfer efficiency and optimize experiment workflows (e.g. ANSE project; see http://www.internet2.edu/presentations/tip2013/20130116_b arczyk_anse.pdf ) http://www.internet2.edu/presentations/tip2013/20130116_b arczyk_anse.pdf http://www.internet2.edu/presentations/tip2013/20130116_b arczyk_anse.pdf

Working Group Objectives 21 May 2014HEPiX - Annecy13 ●Identify and continuously make available relevant transfer and network metrics ●Document metrics and their use ●Facilitate their integration in the middleware and/or experiment tool chain ●Coordinate commissioning and maintenance of WLCG network monitoring o Finalize perfSONAR deployment o Ensure all links continue to be monitored and sites stay correctly configured o Verify coverage and optimize test parameters

Working Group Membership 21 May 2014HEPiX - Annecy14 ●Chairs: Shawn McKee, Marian Babik ●Members: Proposing to invite previous members of the perfSONAR-PS TF responsible for different clouds o See https://twiki.cern.ch/twiki/bin/view/LCG/MeshUpdates https://twiki.cern.ch/twiki/bin/view/LCG/MeshUpdates o Fill-in members for missing clouds ●Inviting members knowledgeable about FAX, AAA, FTS, PhEDEx, Panda or Rucio ●If anyone is interested in joining the effort contact me!

Use of perfSONAR-PS Metrics  Throughput: Notice problems and debug network, also help differentiate server problems from path problems  Latency: Notice route changes, asymmetric routes  Watch for excessive Packet Loss  On-demand tests and NPAD/NDT diagnostics via web  Optionally: Install additional perfSONAR nodes inside local network, and/or at periphery  Characterize local performance and internal packet loss  Separate WAN performance from internal performance  Daily Dashboard check of own site, and peers 21 May 2014HEPiX - Annecy15

Debugging Network Problems   Using perfSONAR-PS we (the VOs) identify network problems by observing degradation in regular metrics for a particular “path”  Packet-loss appearance in Latency tests  Significant and persistent decrease in bandwidth  Currently requires a “human” to trigger.   Next check for correlation with other metric changes between sites at either end and other sites (is the problem likely at one of the ends or in the middle?)   Correlate with paths and traceroute information. Something changed in the routing? Known issue in the path?   In general NOT as easy to do all this as we would like with the current perfSONAR-PS toolkit 21 May 2014HEPiX - Annecy16

Improving perfSONAR-PS Deployments  Based upon the issues we have encountered we setup a Wiki to gather best practices and solutions to issues we have identified: http://www.usatlas.bnl.gov/twiki/bin/view/Projects/LHCperfSONAR http://www.usatlas.bnl.gov/twiki/bin/view/Projects/LHCperfSONAR  This page is shared with the perfSONAR-PS developers and we expect many of the “fixes” will be incorporated into future releases (most are in v3.3.2 already)  Improving resiliency (set-it-and-forget-it) a high priority. Instances should self-maintain and the infrastructure should be able to alert when services fail (OIM/GOCDB tests)  Disentangling problems with the measurement infrastructure versus problems in the network indicated by those measurements… 21 May 2014HEPiX - Annecy17

Future Use of Network Metrics  Once we have a source of network metrics being acquired we need to understand how best to incorporate those metrics into our facility operations.  Some possibilities: ANSE  Characterizing paths with “costs” to better optimize decisions in workflow and data management (underway in ANSE)  Noting when paths change and providing appropriate notification  Optimizing data-access or data-distribution based upon a better understanding of the network between sites  Identifying structural bottlenecks in need of remediation  Aiding network problem diagnosis and speeding repairs  In general, incorporating knowledge of the network into our processes  We will require testing and iteration to better understand when and where the network metrics are useful. 21 May 2014HEPiX - Annecy18

OSG & Networking Service  OSG is building a centralized service for gathering, viewing and providing network information to users and applications.  Goal: OSG becomes the “source” for networking information for its constituents, aiding in finding/fixing problems and enabling applications and users to better take advantage of their networks  MaDDashOMDOSG  Plan is to migrate MaDDash and OMD to OSG in the next month.  The critical missing component is the datastore to organize and store the network metrics and associated metadata  OSG (via MaDDash) is gathering relevant metrics from the complete set of OSG and WLCG perfSONAR-PS instances  This data must be available via an API, must be visualized and must be organized to provide the “OSG Networking Service” 21 May 2014HEPiX - Annecy19

Closing Remarks  Over the last few years WLCG sites have converged on perfSONAR-PS as their way to measure and monitor their networks for data- intensive science.  Not easy to get global consensus but we have it now after pushing since 2008  The assumption is that perfSONAR (and the perfSONAR-PS toolkit) is the de-facto standard way to do this and will be supported long-term  Especially critical that R&E networks agree on its use and continue to improve and develop the reference implementation  Dashboard critical for “visibility” into networks. We can’t manage/fix/respond-to problems if we can’t “see” them.  Having perfSONAR-PS fully deployed should give us some interesting options for better management and use of our networks 21 May 2014HEPiX - Annecy20

Discussion/Questions 21 May 2014HEPiX - Annecy21 Questions or Comments?

WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,

Similar presentations

Presentation on theme: "WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,

Similar presentations

Presentation on theme: "WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,"— Presentation transcript:

Similar presentations

About project

Feedback