Deploying perfSONAR-PS for WLCG: An Overview Shawn McKee/University of Michigan WLCG PS Deployment TF Co-chair Fall 2013 HEPiX Ann Arbor, Michigan October.

Slides:

Advertisements

Similar presentations

 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.

Advertisements

Update on OSG/WLCG perfSONAR infrastructure Shawn McKee, Marian Babik HEPIX Spring Workshop, Oxford 23 rd - 27 th March 2015.

Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.

PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.

Proximity service Main idea – provide “glue” between experiments and sonar topology – mainly map sonars to storages and vice versa – determine existing.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Network Performance Measurement Atlas Tier 2 Meeting at BNL December Joe Metzger

Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Network Monitoring for OSG Shawn McKee/University of Michigan OSG Staff Planning Retreat July 10 th, 2012 July 10 th, 2012.

Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

Internet2 Performance Update Jeff W. Boote Senior Network Software Engineer Internet2.

Managing the Oracle Application Server with Oracle Enterprise Manager 10g.

New perfSonar Dashboard Andy Lake, Tom Wlodek. What is the dashboard? I assume that everybody is familiar with the “old dashboard”:

1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.

Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015.

PerfSONAR-PS Functionality February 11 th 2010, APAN 29 – perfSONAR Workshop Jeff Boote, Assistant Director R&D.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,

Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Network and Transfer WG perfSONAR operations Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 28 h January 2015.

PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.

OSG Networking: Summarizing a New Area in OSG Shawn McKee/University of Michigan Network Planning Meeting Esnet/Internet2/OSG August 23 rd, 2012.

Julia Andreeva on behalf of the MND section MND review.

US LHC Tier-2 Network Performance BCP Mar-3-08 LHC Community Network Performance Recommended BCP Eric Boyd Deputy Technology Officer Internet2.

Data Placement Intro Dirk Duellmann WLCG TEG Workshop Amsterdam 24. Jan 2012.

Evolving Security in WLCG Ian Collier, STFC Rutherford Appleton Laboratory Group info (if required) 1 st February 2016, WLCG Workshop Lisbon.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Network Awareness and perfSONAR Why we want it. What are the challenges? Where are we going? Shawn McKee / University of Michigan OSG AHG - US CMS Tier-2.

GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.

GEMINI: Active Network Measurements Martin Swany, Indiana University.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

LHCONE Monitoring Thoughts June 14 th, LHCOPN/LHCONE Meeting Jason Zurawski – Research Liaison.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

Strawman LHCONE Point to Point Experiment Plan LHCONE meeting Paris, June 17-18, 2013.

David Foster, CERN GDB Meeting April 2008 GDB Meeting April 2008 LHCOPN Status and Plans A lot more detail at:

© 2014 Level 3 Communications, LLC. All Rights Reserved. Proprietary and Confidential. Simple, End-to-End Performance Management Application Performance.

Cyberinfrastructure Overview of Demos Townsville, AU 28 – 31 March 2006 CREON/GLEON.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

LHCONE NETWORK SERVICES: GETTING SDN TO DEV-OPS IN ATLAS Shawn McKee/Univ. of Michigan LHCONE/LHCOPN Meeting, Taipei, Taiwan March 14th, 2016 March 14,

Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.

News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) HEPIX, BNL 13 Oct 2015.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

Activities and Perspectives at Armenian Grid site The 6th International Conference "Distributed Computing and Grid- technologies in Science and Education"

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.

Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.

1 Network Measurement Challenges LHC E2E Network Research Meeting October 25 th 2006 Joe Metzger Version 1.1.

PerfSONAR operations meeting 3 rd October Agenda Propose changes to the current operations of perfSONAR Discuss current and future deployment model.

Bob Jones EGEE Technical Director

Shawn McKee, Marian Babik for the

WLCG Network Discussion

Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017

perfSONAR-PS Deployment: Status/Plans

POW MND section.

Networking for the Future of Science

Monitoring the US ATLAS Network Infrastructure with perfSONAR-PS

Shawn McKee/University of Michigan ATLAS Technical Interchange Meeting

Deployment & Advanced Regular Testing Strategies

LHCONE perfSONAR: Status and Plans

Overlay Networking Overview.

Network Performance Manager

Presentation transcript:

Deploying perfSONAR-PS for WLCG: An Overview Shawn McKee/University of Michigan WLCG PS Deployment TF Co-chair Fall 2013 HEPiX Ann Arbor, Michigan October 29 th, 2013

Science Motivations for perfSONAR-PS  WLCG ATLASCMSALICELHCb  WLCG (World-wide LHC Computing Grid) encompasses the 4 main LHC experiments: ATLAS, CMS, ALICE and LHCb. We all have heard of the Higgs and the science at the LHC…it is the leading edge instrument for both high-energy and nuclear physics research. Our global grid infrastructure was even noted as critical to the Higgs discovery   WLCG has a critical reliance upon ubiquitous, high-performance networks to tie together data, applications, computing and storage to enable and expedite scientific discovery. 29-Oct-2013Fall 2013 HEPiX2

Introductory Considerations   All distributed, data-intensive projects critically depend upon the network.  Network problems can be hard to diagnose and slow to fix  Network problems are multi-domain, complicating the process  Standardizing on specific tools and methods allows groups to focus resources more effectively and better self-support (as well as benefiting from others work)  Performance issues involving the network are complicated by the number of components involved end-to-end. We need the ability to better isolate performance bottlenecks   WLCG wants to make sure their(our) scientists are able to effectively use the network and quickly resolve network issues when and where they occur 29-Oct-2013Fall 2013 HEPiX3

Vision for perfSONAR-PS in WLCG 29-Oct-2013Fall 2013 HEPiX4  Goals:  Find and isolate “network” problems; alerting in a timely way  Characterize network use (base-lining)  Provide a source of network metrics for higher level services  First step: get monitoring in place to create a baseline of the current situation between sites (see details later)  Next: continuing measurements to track the network, alerting on problems as they develop  Choice of a standard “tool/framework”: perfSONAR  We wanted to benefit from the R&E community consensus  perfSONAR’s purpose is to aid in network diagnosis by allowing users to characterize and isolate problems. It provides measurements of network performance metrics over time as well as “on-demand” tests.

Initial perfSONAR-PS Issues  perfSONAR-PS was designed to be standalone and federated. Nice for standalone end-users/sites; bad for VOs who want to understand what their ensemble of sites and networks are doing  USATLAS identified that having dashboard to gather, grade, store and display network test metrics is REQUIRED for effective use.  Managing a set of sites for USATLAS or LHCOPN or ‘x’ showed another shortcoming: changes to the set of sites required ALL sites to update their configurations. Not scalable!  Reliability and resiliency are critical. The feasibility of getting all relevant sites to deploy perfSONAR-PS depends upon it being easy to install, and (most importantly) robust in its operation.  Sites don’t have manpower or expertise to baby-sit/diagnose PS issues 29-Oct-2013Fall 2013 HEPiX5

Example of Dashboard showing LHCONE 29-Oct-2013Fall 2013 HEPiX6 See

The perfSONAR Modular Dashboard  Centrally aggregates measurements from all PS hosts  Provides a web UI and REST interface   A new implementation maturing production quality  Addressing scalability issues for large meshes  Providing a more extensive REST API  Self-configuring from mesh definitions  Fancier …  y-1.0-SNAPSHOT/matrices.jsp?id=62 y-1.0-SNAPSHOT/matrices.jsp?id=62  Discussions with OSG about hosting the Modular Dashboard service and automating mesh-config creation 15 October – CHEP 2013, Amsterdam, NL 7 bwctl last 30 days 10Mb/ s 50Mb/ s

Configuration of Network Testing  One of the lessons learned from LHC use of perfSONAR- PS was that setting up and maintaining scheduled tests for the perfSONAR-PS toolkit instances was a challenge.  As sites changed, joined or left, every other site needed to update their configuration to change, add or remove tests.  Labor intensive, slow to get all updates in place and gets worse as we increase the size of the deployments!  Aaron Brown/Internet2 provided a solution: the “mesh” configuration which allows sites to track a central configuration and update themselves when it changes.  For current perfSONAR-PS deployments a set of patches and new RPMS needs to be installed to enable the mesh-config  V3.3.1 has all functionality for the mesh built-in 29-Oct-2013Fall 2013 HEPiX8 SUCCESS!

perfSONAR-PS Mesh Example 15 October – CHEP 2013, Amsterdam, NL 9 The perfSONAR-PS instances can participate in more than one configuration (WLCG, Tier-1 cloud, VO- based, etc.) The WLCG mesh configurations are centrally hosted at CERN and exposed through HTTP perfSONAR-PS toolkit instances can get their configuration information from a URL hosting an suitable JSON file An agent_configuration file on the PS node defines one or more URLs

Plans for WLCG Operations  Simone Campana and I are leading a WLCG (Worldwide LHC Computing Grid) operations task-force for perfSONAR:  Encouraging all sites to deploy and register two instances  All sites to use the “mesh” configuration“mesh” configuration  One set of test parameters to be used everywhere  Detailed instructions at   Simone presented at CHEP 2013 bringing perfSONAR- PS to an international audience  The current dashboard is a central source for network information. We also need to make sure we are gathering the right metrics and making them easily accessible  We need to encourage discussion about the types of metrics our frameworks and applications would like concerning the network 29-Oct-2013Fall 2013 HEPiX10

WLCG deployment plan  WLCG choose to deploy perfSONAR-PS at all sites worldwide  A dedicated WLCG Operations Task-Force was started in Fall 2012  Sites are organized in regions  Based on geographical locations and experiments computing models  All sites are expected to deploy a bandwidth host and a latency host  Regular testing is setup using a centralized (“mesh”) configuration  Bandwidth tests: 30 seconds tests  every 6 hours intra-region, 12 hours for T2-T1 inter-region, 1 week elsewhere  Latency tests; 10 Hz of packets to each WLCG site  Traceroute tests between all WLCG sites each hour  Ping(ER) tests between all site every 20 minutes 15 October – CHEP 2013, Amsterdam, NL 11

Current WLCG Deployment 29-Oct-2013Fall 2013 HEPiX sites shown, estimated ~150 sites eventually

Use of perfSONAR-PS Metrics  Throughput: Notice problems and debug network, also help differentiate server problems from path problems  Latency: Notice route changes, asymmetric routes  Watch for excessive Packet Loss  On-demand tests and NPAD/NDT diagnostics via web  Optionally: Install additional perfSONAR nodes inside local network, and/or at periphery  Characterize local performance and internal packet loss  Separate WAN performance from internal performance  Daily Dashboard check of own site, and peers 29-Oct-2013Fall 2013 HEPiX13

Debugging Network Problems   Using perfSONAR-PS we (the VOs) identify network problems by observing degradation in regular metrics for a particular “path”  Packet-loss appearance in Latency tests  Significant and persistent decrease in bandwidth  Currently requires a “human” to trigger.   Next check for correlation with other metric changes between sites at either end and other sites (is the problem likely at one of the ends or in the middle?)   Correlate with paths and traceroute information. Something changed in the routing? Known issue in the path?   In general NOT as easy to do all this as we would like with the current perfSONAR-PS toolkit 29-Oct-2013Fall 2013 HEPiX14

Success Stories  The deployment of perfSONAR-PS toolkit instances has found numerous network issues  In USATLAS the first time sites deployed PS, we found numerous network issues (dirty fibers, misconfigurations) as soon as we started regular testing  Once the scheduled tests were in place we found problems as network configurations were changed. Some recent examples:  OU routing changed from NLR to Internet2 and we almost immediately identified a path problem.  AGLT2 suddenly saw significant transfer slowdowns but just to some Tier-2s. Used PS measurements to identify a dirty jumper in Chicago after a networking change.  Transitions to using LHCONE were tracked with PS tests. Nebraska and OU both saw problems in one direction identified by PS tests  Similar experiences in Europe and Asia as PS testing was deployed  In general, subtle problems abound and it requires standard measurements to identify them and isolate them. 29-Oct-2013Fall 2013 HEPiX15

Example: Fix Assymetric Throughput 29-Oct-2013Fall 2013 HEPiX16 Asymmetric throughput between peer sites IU and AGLT2 resolved

 Problem at MSU after routing upgrade work, noticed small 0.7ms latency increase  Traceroute found an unintended route change (packets destined to MSU were going through UM)  routing prefs quickly fixed 29-Oct-2013Fall 2013 HEPiX17 Example of Problem Solving

High Level perfSONAR-PS Issues  If we want to rely upon network metrics we need to be sure we understand what we are collecting and how we are collecting it.  All sites should co-locate perfSONAR-PS instances with their storage resources (same subnet and switches)  If some sites are not doing this we need to flag them appropriately  The complete set of metrics: latency/packet-loss, bandwidth and traceroute need to be collected on “important” paths  Some test-sets (meshes) may end up with subsets depending upon impact  We know there are potential anomalous readings when testing between 10g and 1g instances. Need to flag those.  perfSONAR-PS developers are aware of this issue but there is not yet a solution provided within the toolkit 29-Oct-2013Fall 2013 HEPiX18

Network Monitoring Challenges  Getting hardware/software platform installed at all sites  Dashboard development: Need additional effort to produce something suitable quickly and ensure it meets our needs…  Managing site and test configurations  Testing and improving “centralized” (VO-based?) configurations  Determining the right level of scheduled tests for a site, e.g., Tier-2s test to other same-cloud Tier-2s (and Tier-1)?  Address 10G vs 1G tests that give misleading results  Improve path monitoring (traceroute) access within the tool  Alerting: A high-priority need but complicated:  Alert who? Network issues could arise in any part of end-to-end path  Alert when? Defining criteria for alert threshold. Primitive services are easier. Network test results more complicated to decide  Integration with VO infrastructures and applications 29-Oct-2013Fall 2013 HEPiX19

Improving perfSONAR-PS Deployments  Based upon the issues we have encountered we setup a Wiki to gather best practices and solutions to issues we have identified:  This page is shared with the perfSONAR-PS developers and we expect many of the “fixes” will be incorporated into future releases (most are in v3.3.1 already)  Improving resiliency (set-it-and-forget-it) a high priority. Instances should self-maintain and the infrastructure should be able to alert when services fail (OIM/GOCDB tests)  Disentangling problems with the measurement infrastructure versus problems with the measurements… 29-Oct-2013Fall 2013 HEPiX20

perfSONAR-PS Wishlist  Continued reliability/resiliency improvements  Must be “set-it-and-forget-it” to meet the needs of the bulk of our users  Topology/path diagnosis support  Traceroute sensitive to ECMP (“Paris” traceroute)  Tools/gui to:  visualize route  show router port usage  show drops/errors  Identify perfSONAR-PS instances on the path  Path comparison/correlation tools using metrics coupled + traceroutes (identify “bad” paths via multiple measurements)  Alerting and alarming  Support for configuring notification to alert users to network problems  NAGIOS support exists but not well matched to multidomain issues  Alarms targeted at most likely problem domain  Handle jumbo frames and NIC speed mismatches  10GE testing to 1GE “overruns” and provides misleading information  Support for additional tests (Iperf variants, new tools, etc) 29-Oct-2013Fall 2013 HEPiX21

Future Use of Network Metrics  Once we have a source of network metrics being acquired we need to understand how best to incorporate those metrics into our facility operations.  Some possibilities: ANSE  Characterizing paths with “costs” to better optimize decisions in workflow and data management (underway in ANSE)  Noting when paths change and providing appropriate notification  Optimizing data-access or data-distribution based upon a better understanding of the network between sites  Identifying structural bottlenecks in need of remediation  Aiding network problem diagnosis and speeding repairs  In general, incorporating knowledge of the network into our processes  We will require testing and iteration to better understand when and where the network metrics are useful. 29-Oct-2013Fall 2013 HEPiX22

Closing Remarks  Over the last few years WLCG sites have converged on perfSONAR-PS as their way to measure and monitor their networks for data- intensive science.  Not easy to get global consensus but we have it now after pushing since 2008  The assumption is that perfSONAR (and the perfSONAR-PS toolkit) is the de-facto standard way to do this and will be supported long-term  Especially critical that R&E networks agree on its use and continue to improve and develop the reference implementation  Modular dashboard critical for “visibility” into networks. We can’t manage/fix/respond-to problems if we can’t “see” them.  Having perfSONAR-PS fully deployed should give us some interesting options for better management and use of our networks 29-Oct-2013Fall 2013 HEPiX23

Discussion/Questions 29-Oct-2013Fall 2013 HEPiX24 Questions or Comments?

Relevant URLs  perfSONAR-PS site  Install/configuration guide: ps/wiki/pSPerformanceToolkit33 ps/wiki/pSPerformanceToolkit33http://code.google.com/p/perfsonar- ps/wiki/pSPerformanceToolkit33  Modular Dashboard: or  Tools, tips and maintenance:  OSG networking pages  GitHub perfSONAR Modular Dashboard:  WLCG Installs Oct-2013Fall 2013 HEPiX25