Update on OSG/WLCG perfSONAR infrastructure Shawn McKee, Marian Babik HEPIX Spring Workshop, Oxford 23 rd - 27 th March 2015.

Slides:



Advertisements
Similar presentations
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
Advertisements

Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.
Proximity service Main idea – provide “glue” between experiments and sonar topology – mainly map sonars to storages and vice versa – determine existing.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Use Cases. Summary Define and understand slow transfers – Identify weak links, narrow down the source – Understand what perfSONAR measurements mean wrt.
Integration and Sites Rob Gardner Area Coordinators Meeting 12/4/08.
Connect communicate collaborate perfSONAR MDM updates: New interface, new possibilities Domenico Vicinanza perfSONAR MDM Product Manager
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
Chapter 14 Part II: Architectural Adaptation BY: AARON MCKAY.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 8 th April 2015.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
The Internet2 HENP Working Group Internet2 Spring Meeting April 9, 2003.
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015.
LAMP: Bringing perfSONAR to ProtoGENI Martin Swany.
GDB July 2015 Jeremy’s quick summary notes Also refer to the meeting minutes
Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015.
NORDUnet Nordic Infrastructure for Research & Education Workshop Introduction - Finding the Match Lars Fischer LHCONE Workshop CERN, December 2012.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 18 h March 2015.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,
Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,
WLCG Network and Transfer Metrics WG After One Year Shawn McKee, Marian Babik GDB 4 th November
Rob Davidson, Partner Technology Specialist Microsoft Management Servers: Using management to stay secure.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
Network and Transfer WG perfSONAR operations Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 28 h January 2015.
Update on Network and Transfer Metrics WG Shawn McKee, Marian Babik GDB 8 th October 2014.
PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.
OSG Networking: Summarizing a New Area in OSG Shawn McKee/University of Michigan Network Planning Meeting Esnet/Internet2/OSG August 23 rd, 2012.
Julia Andreeva on behalf of the MND section MND review.
PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015.
US LHC Tier-2 Network Performance BCP Mar-3-08 LHC Community Network Performance Recommended BCP Eric Boyd Deputy Technology Officer Internet2.
PIC port d’informació científica EGEE – EGI Transition for WLCG in Spain M. Delfino, G. Merino, PIC Spanish Tier-1 WLCG CB 13-Nov-2009.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Network Awareness and perfSONAR Why we want it. What are the challenges? Where are we going? Shawn McKee / University of Michigan OSG AHG - US CMS Tier-2.
WLCG Latency Mesh Comments + – It can be done, works consistently and already provides useful data – Latency mesh stable, once configured sonars are stable.
LHCONE Monitoring Thoughts June 14 th, LHCOPN/LHCONE Meeting Jason Zurawski – Research Liaison.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Strawman LHCONE Point to Point Experiment Plan LHCONE meeting Paris, June 17-18, 2013.
David Foster, CERN GDB Meeting April 2008 GDB Meeting April 2008 LHCOPN Status and Plans A lot more detail at:
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
© 2014 Level 3 Communications, LLC. All Rights Reserved. Proprietary and Confidential. Simple, End-to-End Performance Management Application Performance.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Deploying perfSONAR-PS for WLCG: An Overview Shawn McKee/University of Michigan WLCG PS Deployment TF Co-chair Fall 2013 HEPiX Ann Arbor, Michigan October.
Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) HEPIX, BNL 13 Oct 2015.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
WLCG Operations Coordination report Maria Dimou Andrea Sciabà IT/SDC On behalf of the WLCG Operations Coordination team GDB 12 th November 2014.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
PerfSONAR operations meeting 3 rd October Agenda Propose changes to the current operations of perfSONAR Discuss current and future deployment model.
WLCG Workshop 2017 [Manchester] Operations Session Summary
Shawn McKee, Marian Babik for the
WLCG Network Discussion
Data Center Infrastructure
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
LHCOPN/LHCONE perfSONAR Update
Shawn McKee/University of Michigan ATLAS Technical Interchange Meeting
Alerting/Notifications (MadAlert)
Network Monitoring Update: June 14, 2017 Shawn McKee
ILMT/BigFix Inventory Demo
Big-Data around the world
Presentation transcript:

Update on OSG/WLCG perfSONAR infrastructure Shawn McKee, Marian Babik HEPIX Spring Workshop, Oxford 23 rd - 27 th March 2015

Goals: – Find and isolate “network” problems; alerting in time – Characterize network use (base-lining) – Provide a source of network metrics for higher level services Choice of a standard open source tool: perfSONAR – Benefiting from the R&E community consensus Tasks achieved by the perfSONAR TF: – Get monitoring in place to create a baseline of the current situation between sites – Continuous measurements to track the network, alerting on problems as they develop – Develop test coverage and make it possible to run “on- demand” tests to quickly isolate problems and identify problematic links 2 Network Monitoring in WLCG/OSG HEPiX Spring Workshop, Oxford

Started in May 2015, bringing together network and transfer experts Follows up on the perfSONAR TF goals Mandate – Ensure all relevant network and transfer metrics are identified, collected and published – Ensure sites and experiments can better understand and fix networking issues – Enable use of network-aware tools to improve transfer efficiency and optimize experiment workflows Membership – WLCG perSONAR support unit (regional experts), WLCG experiments, FTS, Panda, PhEDEx, FAX, Network experts (ESNet, LHCOPN, LHCONE) 3 Network and Transfer Metrics WG HEPiX Spring Workshop, Oxford

Objectives – Coordinate commissioning and maintenance of WLCG network monitoring Finalize perfSONAR deployment Ensure all links continue to be monitored and sites stay correctly configured Verify coverage and optimize test parameters – Identify and continuously make available relevant transfer and network metrics – Document metrics and their use – Facilitate their integration in the middleware and/or experiment tool chain Since inception, main focus was to finalize deployment and commissioning, extend the infrastructure, but also to jump start common projects with network and transfer metrics 4 Network and Transfer Metrics WG HEPiX Spring Workshop, Oxford

5 perfSONAR Deployment 251 perfSONAR instances registered in GOCDB/OIM 228 Active perfSONAR instances 223 Running latest version (3.4.1) Instances at WIX, MANLAN, GEANT Amsterdam Initial deployment coordinated by WLCG perfSONAR TF Commissioning of the network followed by WLCG Network and Transfer Metrics WG HEPiX Spring Workshop, Oxford

Tests are organized in meshes – set of instances that test to each other perfSONAR regular tests currently configured – Traceroute: End to end path, important to understand context of other metrics (Full WLCG mesh/hour) – Throughput: Notice problems and debug network, also help differentiate server problems from path problems (Full WLCG mesh/week) – Latency: Notice route changes, asymmetric routes and watch for excessive Packet Loss (Regional meshes, Continuous, 10Hz) perfSONAR is a testing framework, new tests and tools can be integrated as they become available – From iperf to iperf3, traceroute to tracepath Dynamic reconfigurations now possible – Creation, modification of meshes – Test frequency and parameters Additional perfSONAR nodes inside local network, and/or at periphery still needed (on LHCONE: MANLAN, WIX, GEANT) – Characterize local performance and internal packet loss – Separate WAN performance from internal performance 6 perfSONAR Metrics and Meshes HEPiX Spring Workshop, Oxford

7 Configuration Interface perfSONAR instance can participate in more than one mesh Configuration interface and auto-URL enables dynamic re- configuration of the entire perfSONAR network perfSONAR instance only needs to be configured once, with an auto-URL ( tname/ )

8 Configuration Interface HEPiX Spring Workshop, Oxford

Based on OMD/check_mk and MadDash MadDash developed as part of perfSONAR – 1.2 version released recently OMD/check_mk extended to cover WLCG perfSONAR needs Developed bootstrapping and auto-configuration scripts – Synchronized with GOCDB/OIM and OSG configuration interface Packaged and deployed in OSG Developed new plugins - core functionality – Toolkit Version, Regular Testing, NTP, Mesh configuration, Esmond (MA), Homepage, Contacts – Updated to perfSONAR 3.4 information API/JSON High level functionality plugins – Esmond freshness – checks if perfSONAR node’s local MA contains measurements it was configured to perform – Extremely useful during commissioning 9 Infrastructure Monitoring HEPiX Spring Workshop, Oxford

Auto-summaries are available per mesh Service summaries per metric type 10 Infrastructure Monitoring HEPiX Spring Workshop, Oxford

All perfSONAR metrics should be collected into the OSG network datastore – This is an Esmond datastore from perfSONAR (postgresql+cassandra backends) – Loaded via RSV probes; currently one probe per perfSONAR instance every 15 minutes. Validation and testing ongoing in OSG – Plan is to have it production ready by Q3 Datastore on psds.grid.iu.edu – JSON at – Python API at – Perl API at ps/wiki/MeasurementArchivePerlAPIhttps://code.google.com/p/perfsonar- ps/wiki/MeasurementArchivePerlAPI 11 OSG perfSONAR Datastore HEPiX Spring Workshop, Oxford

12 Datastore API HEPiX Spring Workshop, Oxford

Goal – Provide platform to integrate network and transfer metrics – Enable network-aware tools (see ANSE Network resource allocation along CPU and storage Bandwidth reservation Create custom topology Plan – Provide latency and trace routes and test how they can be integrated with throughput from transfer systems – Provide mapping between sites/storages and sonars – Uniform access to the network monitoring Pilot projects – FTS performance – adding latency and routing to the optimizer – Experiment’s interface to datastore 13 Integration Projects HEPiX Spring Workshop, Oxford

Aim – Develop publish/subscribe interface to perfSONAR – Enable possibility to subscribe to different events (filter) and support different clients - integration via messaging, streaming data via topic/queue – Provide mapping/translation btw sonar infrastructure and experiment’s topology Components esmond2mq – prototype already exists – Retrieves all data (meta+raw) from esmond depending on existing mesh configs – Publishes to a topic Proximity/topology service – Handle mapping/translation of services (service to service; storage to sonar), service to site (sonar to site) – Test different algorithms (site mapping, traceroutes, geoip) – Evaluate if existing tools can be reused for this purpose 14 Experiments Interface to Datastore HEPiX Spring Workshop, Oxford

15 Proposed Extension

Network Monitoring and Metrics WG Meeting 16 PuNDIT Project

perfSONAR widely deployed and already showing benefits in troubleshooting network issues – Additional deployments by R&E networks still needed Significant progress in configuration and infrastructure monitoring – Helping to reach full potential of the perfSONAR deployment OSG datastore – community network data store for all perfSONAR metrics – planned to enter production in Q3 Integration projects aiming to aggregate network and transfer metrics – FTS Performance – Experiment’s interface to perfSONAR Advanced network monitoring - diagnosis and alerts based on perfSONAR, developed within NSF funded PuNDIT project 17 Closing remarks HEPiX Spring Workshop, Oxford

18 Backup slides

OSG/WLCG critically depend upon the network – Interconnect sites and resources Traffic grows at a rate of factor 10 every 4 years Progressively moving from tier-based to peer to peer model Emergence of new paradigms and network technologies 19 Introduction HEPiX Spring Workshop, Oxford

End-to-end network issues difficult to spot and localize – Network problems are multi-domain, complicating the process – Standardizing on specific tools and methods allows groups to focus resources more effectively and better self-support – Performance issues involving the network are complicated by the number of components involved end-to-end. Famous “BNL-CNAF” network issue – 7 months, 72 entries in the tracking ticket, requiring work from many people 20 Network Troubleshooting HEPiX Spring Workshop, Oxford

Recent problem reported SARA->AGLT2. FTS timed out because rate for large files < Kbytes/sec perfSONAR tests confirmed similar results Opened ticket with I2, who opened ticket with GEANT – GEANT brought up LHCONE pS instance. Tests to AGLT2 showed 3 times BW vs (much closer) SARA – Problematic link identified within few days ( % packet loss) Currently establishing procedure to report on network performance issues in WLCG 21 perfSONAR in Troubleshooting Impacts overall link throughput. Many fixes tried. No solution yet. Where should we track/document cases? HEPiX Spring Workshop, Oxford

Installing a service at every site is one thing, but commissioning a NxN system of links is squared the effort. – This is why we have perfSONAR-PS installed but not all links are monitored. perfSONAR is a “special” service – Dedicated hardware and management is required – We understand this creates complications to some fabric infrastructure. Sharing your experience within WG might be the best way to help other sites 3.4 was a major release that introduce many new features and components – Some of them required follow up and bug fixing – 3.4.2rc is currently validated in a PS testbed by WLCG/OSG – monitored by the same monitoring infrastructure as production Security incidents had major impact on perfSONAR – wlcg-perfsonar-security mailing list was established to be a single point of contact in case of security incidents – New documentation was written to give guidelines to sites We still have issues with firewalls – New documentation very clear on port opening and campus/central firewall – Monitoring information now exposed by almost all sites – also thanks to migrating infrastructure monitoring to OSG 22 Commissioning Challenges HEPiX Spring Workshop, Oxford

We have 3 versions of perfSONAR Infrastructure monitoring – Prototype at maddash.aglt2.org – Testing at OSG’s ITB instance – Production at OSG’s production instance Main monitoring types are MaDDash and OMD/Check_MK – Prototype: – Testing: – Production: Notes: – OSG instances rely upon OSG Datastore: – X509 cert needed to view check_mk/OMD pages (any IGTF cert) Plan: – Develop additional plugins to check core functionality – IPv6, memory, etc. 23 Infrastructure Monitoring HEPiX Spring Workshop, Oxford

FTS - low level data movement service – used by majority of WLCG transfers Current granularity and coverage is a good match to perfSONAR network – Mapping SEs to sonars needed Goal: Adding latency and routing (tracepath) to the optimizer algorithm – Better tune number of active transfers Integration of traceroutes and FTS monitoring already attempted in the ATLAS FTS performance study (lead by Saul Youssef - – Integrates FTS monitoring and tracepath to determine weak channels and identify problems – Proposed to extend it to CMS and LHCb 24 FTS performance HEPiX Spring Workshop, Oxford