Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015.

Slides:



Advertisements
Similar presentations
Connect. Communicate. Collaborate WI5 – tools implementation Stephan Kraft October 2007, Sevilla.
Advertisements

Update on OSG/WLCG perfSONAR infrastructure Shawn McKee, Marian Babik HEPIX Spring Workshop, Oxford 23 rd - 27 th March 2015.
Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.
Proximity service Main idea – provide “glue” between experiments and sonar topology – mainly map sonars to storages and vice versa – determine existing.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Network Performance Measurement Atlas Tier 2 Meeting at BNL December Joe Metzger
1 ESnet Network Measurements ESCC Feb Joe Metzger
Use Cases. Summary Define and understand slow transfers – Identify weak links, narrow down the source – Understand what perfSONAR measurements mean wrt.
Connect communicate collaborate perfSONAR MDM updates: New interface, new possibilities Domenico Vicinanza perfSONAR MDM Product Manager
Network Monitoring for OSG Shawn McKee/University of Michigan OSG Staff Planning Retreat July 10 th, 2012 July 10 th, 2012.
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
Connect communicate collaborate perfSONAR MDM updates: New interface, new weathermap, towards a complete interoperability Domenico Vicinanza perfSONAR.
Internet2 Performance Update Jeff W. Boote Senior Network Software Engineer Internet2.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 8 th April 2015.
Connect. Communicate. Collaborate Implementing Multi-Domain Monitoring Services for European Research Networks Szymon Trocha, PSNC A. Hanemann, L. Kudarimoti,
New perfSonar Dashboard Andy Lake, Tom Wlodek. What is the dashboard? I assume that everybody is familiar with the “old dashboard”:
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
ASCR/ESnet Network Requirements an Internet2 Perspective 2009 ASCR/ESnet Network Requirements Workshop April 15/16, 2009 Richard Carlson -- Internet2.
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015.
Connect. Communicate. Collaborate perfSONAR MDM Service for LHC OPN Loukik Kudarimoti DANTE.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 18 h March 2015.
13-Oct-2003 Internet2 End-to-End Performance Initiative: piPEs Eric Boyd, Matt Zekauskas, Internet2 International.
WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,
Jeremy Nowell EPCC, University of Edinburgh A Standards Based Alarms Service for Monitoring Federated Networks.
WLCG Network and Transfer Metrics WG After One Year Shawn McKee, Marian Babik GDB 4 th November
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Update on Network Performance Monitoring.
Network and Transfer WG perfSONAR operations Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 28 h January 2015.
Update on Network and Transfer Metrics WG Shawn McKee, Marian Babik GDB 8 th October 2014.
Connect communicate collaborate LHCONE Diagnostic & Monitoring Infrastructure Richard Hughes-Jones DANTE Delivery of Advanced Network Technology to Europe.
PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.
OSG Networking: Summarizing a New Area in OSG Shawn McKee/University of Michigan Network Planning Meeting Esnet/Internet2/OSG August 23 rd, 2012.
PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015.
US LHC Tier-2 Network Performance BCP Mar-3-08 LHC Community Network Performance Recommended BCP Eric Boyd Deputy Technology Officer Internet2.
PerfSONAR-PS Working Group Aaron Brown/Jason Zurawski January 21, 2008 TIP 2008 – Honolulu, HI.
Network Awareness and perfSONAR Why we want it. What are the challenges? Where are we going? Shawn McKee / University of Michigan OSG AHG - US CMS Tier-2.
WLCG Latency Mesh Comments + – It can be done, works consistently and already provides useful data – Latency mesh stable, once configured sonars are stable.
GEMINI: Active Network Measurements Martin Swany, Indiana University.
LHCONE Monitoring Thoughts June 14 th, LHCOPN/LHCONE Meeting Jason Zurawski – Research Liaison.
David Foster, CERN GDB Meeting April 2008 GDB Meeting April 2008 LHCOPN Status and Plans A lot more detail at:
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.
Deploying perfSONAR-PS for WLCG: An Overview Shawn McKee/University of Michigan WLCG PS Deployment TF Co-chair Fall 2013 HEPiX Ann Arbor, Michigan October.
Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) HEPIX, BNL 13 Oct 2015.
The HEPiX IPv6 Working Group David Kelsey (STFC-RAL) EGI OMB 19 Dec 2013.
WLCG Operations Coordination report Maria Dimou Andrea Sciabà IT/SDC On behalf of the WLCG Operations Coordination team GDB 12 th November 2014.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
1 Deploying Measurement Systems in ESnet Joint Techs, Feb Joseph Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
PerfSONAR operations meeting 3 rd October Agenda Propose changes to the current operations of perfSONAR Discuss current and future deployment model.
WLCG IPv6 deployment strategy
Shawn McKee, Marian Babik for the
WLCG Network Discussion
perfSONAR-PS Deployment: Status/Plans
LHCOPN/LHCONE perfSONAR Update
Networking for the Future of Science
LHCOPN/LHCONE perfSONAR Update
Monitoring the US ATLAS Network Infrastructure with perfSONAR-PS
Shawn McKee/University of Michigan ATLAS Technical Interchange Meeting
Alerting/Notifications (MadAlert)
Deployment & Advanced Regular Testing Strategies
LHCONE perfSONAR: Status and Plans
Network Monitoring Update: June 14, 2017 Shawn McKee
Basic Configuration & Deployment
Big-Data around the world
Performance Measuring & Monitoring
Presentation transcript:

Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015

Goals: – Find and isolate “network” problems; alerting in time – Characterize network use (base-lining) – Provide a source of network metrics for higher level services Choice of a standard open source tool: perfSONAR – Benefiting from the R&E community consensus Tasks achieved: – Monitoring in place to create a baseline of the current situation between sites – Continuous measurements to track the network, alerting on problems as they develop – Developed test coverage and made it possible to run “on- demand” tests to quickly isolate problems and identify problematic links HEPiX: Update on perfSONAR2 Overview of perfSONAR in WLCG/OSG

3 Current perfSONAR Deployment 278 perfSONAR instances registered in GOCDB/OIM 245 Active perfSONAR instances 180 Running latest version (3.5+) Initial deployment coordinated by WLCG perfSONAR TF Commissioning of the network followed by WLCG Network and Transfer Metrics WG HEPiX: Update on perfSONAR for stats

End-to-end network issues are difficult to spot and localize – Network problems are multi-domain, complicating the process – Standardizing on specific tools and methods allows groups to focus resources more effectively and better self-support – Performance issues involving the network are complicated by the number of components involved end-to-end. perfSONAR perfSONAR provides a number of standard metrics we can use Latency measurements provide one-way delays and packet loss metrics – Packet loss is almost always very bad for performance Bandwidth tests measure achievable throughput and track TCP retries (using Iperf3) – Provides a baseline to watch for changes; identify bottlenecks Traceroute/Tracepath track network topology – All measurements are only useful when we know the exact path they are taking through the network. – Tracepath additionally measures MTU but is frequently blocked HEPiX: Update on perfSONAR4 Importance of Measuring Our Networks

Current perfSONAR measurement coverage for WLCG/OSG: – Full latency (one-direction only, 10Hz, OWAMP, IPv4) – Full traceroute (bi-directional, hourly, BWCTL/OWAMP, IPv4, IPv6) – Full bandwidth (one-direction only, fortnightly, BWCTL-only!, IPv4, IPv6) Regional meshes still disabled, need to discuss how to evolve – We can create any sub-mesh of the full latency mesh (for free, but only IPv4 and using same params) We could move from regional to bigger meshes (European, Asia/Pacific, US) – We can create new bandwidth meshes as bwclt needs fewer resources (but only for BWCTL-only nodes, not on dual-nodes) We re-enabled project meshes – Belle II – Belle II – both latency and bandwidth – Dual-stack – Dual-stack – just bandwidth (both IPv4 and IPv6) – LHCONE/LHCOPN – LHCONE/LHCOPN – These need to be separately tracked HEPiX: Update on perfSONAR5 Existing Test Coverage

HEPiX: Update on perfSONAR6 Overview of perfSONAR Pipeline The diagram on the right provides a high-level view of how WLCG/OSG is managing our perfSONAR deployments, gathering metrics and making them available for use. We will cover some of the details in what follows

Open Science Grid (OSG) has a networking area which is focused on providing a network service for its constituents including WLCG The network area components Network monitoring via perfSONAR A datastore and API providing a long-term repository for network metrics Tools to manage and maintain the network monitoring infrastructure MaDDash to visualize metrics, Check_mk to monitor services, mesh-config GUI to create and maintain perfSONAR testing meshes Documentation, Outreach and Support HEPiX: Update on perfSONAR7 OSG Networking Area

OSG is providing network metric data for its members and WLCG via the Network Datastore – The data is gathered from all WLCG/OSG perfSONAR instances – Stored indefinitely on OSG hardware – Made available via API – In production since September 14 th – Updated code going into production today The primary use-cases – Network problem identification and localization – Network-related decision support – Network baseline: set expectations and identify weak points for upgrading HEPiX: Update on perfSONAR8 Gathering & Storing Metrics

HEPiX: Update on perfSONAR9 OSG Network Datastore Diagram q OSG is gathering relevant metrics from the complete set of OSG and WLCG perfSONAR instances q Operating now q Running VMs on dedicated hardware q Data also published to CERN Active MQ instance and available for user subscription q Actively tuning and debugging 8 VMs Storage must host 7 distinct areas

perfSONAR measures our networks but how do we find the “right” perfSONAR instance (or metrics) when we need to understand a path? We need tools that let us identify which perfSONAR is relevant: – provide “glue” between experiments and sonar topology – map sonars to storages and vice versa – determine existing tools/technologies that can be used Looked at different approaches – Site-based – join based on common mapping to GOCDB/OIM sites – GeoIP – join based on geographical distance (prototyped) – Traceroutes – join based on network distance (prototyped) – RTT – multilateration based on RTT – determine position by measuring distance from reference points (perfSONARs) GeoIP API prototype exists: – =5 =5 – se01.in2p3.fr&count=10 (works with any hostname) se01.in2p3.fr&count=10 Interest from GEANT and ESNet to work on an open-source based project HEPiX: Update on perfSONAR10 How to Find Nearest pS Instances

We have a number of tools available to help debug and understand network problems. There are very good presentations on these tools in the training materials provided by perfSONAR: While I don’t have time to cover all the details (see materials/ ps-training/ and especially the Measurement Tools, Use Cases and Debugging presentations from Jason Zurawski) I do want to note that command line tools exist to allow you to create on-demand 3 rd party tests (between two remote instances) for bandwidth, latency and traceroute. materials/ ps-training/ – Follow the debugging strategy as a guide to finding and fixing network issues using perfSONAR capabilities HEPiX: Update on perfSONAR11 Existing Tools

MaDDash Use MaDDash to view metric summaries – Provide quick view about how networks are working OSG hosts production instance HEPiX: Update on perfSONAR12 Monitoring Metrics Metrics are displayed via source-destination matrix Multiple dashboards (meshes) can be selected Custom menus link to relevant resources New release (2.0) will incorporate MadAlert

Exploring Path analysis latency, packet-loss, throughput DFN JANET GEANT RAL Aachen ITEP QMUL We can correlate paths with packet-loss/latency information We can simplify the graph by aggregating nodes that belong to same NREN (visual debugging) 13HEPiX: Update on perfSONAR

perfSONAR v3.5 released on the 28 th of September Main themes for this release: – Support for central host management and node auto-configuration – Support for low cost nodes – Support for Debian, VMs, and other installation options – Modernize the GUIs In addition v3.5 incorporates feedback and bugfixes from our WLCG/OSG deployments, improving robustness. WLCG/OSG Deployment status as of today (great progress):Deployment status – : 6 – : 38 – 3.5 : 8 – : 172 – Unknown: 20 (These nodes are either down or hung) HEPiX: Update on perfSONAR14 perfSONAR v3.5 Toolkit

Configuration managed deployments via bundles ( see )ttp://docs.perfsonar.net/install_options.html – perfSONAR Tools (just tools) – perfSONAR TestPoint (passive, no MA) – perfSONAR Core (+MA) – perfSONAR Complete (+Web and Toolkit Configuration) – perfSONAR Central Management (MaDDash, Auto-config, Centralized config service) Low-cost nodes to support large-scale deployment ( ) – $ range should enable broad deployment – Small form factor enables more locations – Some limitations in capabilities due to hardware VMs - Still not recommended but possible – Target: whole node VMs, VMs with dedicated physical NICs – Main use “end-to-end” infrastructure testing (not network) What about Docker? – HEPiX: Update on perfSONAR15 New Deployment Options

We have a working infrastructure in place to monitor and measure our networks perfSONAR provides lots of capabilities to understand and debug our networks – New 3.5 provides new resiliency and install options Work on new applications is underway to make it more easier to find and fix problems We (OSG and WLCG) welcome feedback on how to further improve HEPiX: Update on perfSONAR16 Summary

Network Documentation OSG OSG Deployment documentation for OSG and WLCG hosted in OSG New 3.4 MA guide Modular Dashboard and OMD Prototypes – OSG Production instances for OMD, MaDDash and Datastore – – – Mesh-config in OSG Use-cases document for experiments and middleware dkPQTQic0VbH1mc/edit dkPQTQic0VbH1mc/edit HEPiX: Update on perfSONAR17 References